Requirements

mHUB is currently available for these operating systems:

  • Windows (XP, Vista, 7, 10, 2008, 2012);
  • Linux (including RHEL 5-7)*;
  • Solaris (10 or 11)*.

mHUB can be provided for other operating systems on request.

(*) Note that mHUB requires GCC 4.8.x or newer, which isn't available out-of-the-box on Solaris and some older Linux distributions. Please refer to the Runtime Requirements section for further details.

 

CPU

mHUB is fully multithreaded and highly scalable. By default it will use all available processor cores, but can be forced to run using fewer if necessary. The more cores that mHUB runs on, the faster it will be able to process data.

Typically our clients processing various volumes will provision a VM with enough cores to process the data in a timely manner  - this is what we've seen with our clients:

  • 500 million to a billion+ records will provision a 64 core machine*
  • 200 million+ records normally have a 32 core machine
  • 80 million+ records normally use a 16 core machine
  • 30 million+ records normally use an 8 core machine
  • we suggest a minimum of 4 cores, even with 4 cores you can process millions of records an hour

*If you're looking to process a billion records in minutes instead of hours, then we suggest Distributed processing with Hub for Apache Spark

RAM

mHUB can run entirely in-memory. As the volume of data increases, memory requirements also increase. It is highly recommended that mHUB is used on a machine with enough memory to sufficiently process the data without requiring disk storage.

As a rough guideline:

  • a machine with 8 GB of RAM should comfortably process 15 million rows;
  • a machine with 16 GB of RAM should comfortably process 30 million rows;
  • a machine with 32 GB of RAM should comfortably process 60 million rows;
  • a machine with 48 GB of RAM should comfortably process 80 million rows.

If overlapping two sources of data, then use their summed row counts with these guidelines (for example, 100 million vs. 20 million would require 80 GB of RAM.

Note that these figures are highly dependent on factors such as:

  • the average size of each row (these figures assume an average row size of 150 bytes);
  • which match keys are used (refer to the Configuration Guide for details on match keys);
  • the amount of duplication in the data.

To work with such high volumes of data, it is necessary to use the 64-bit edition of the mHUB component. If the 32-bit edition is used, then typically only 2 GB of RAM can be allocated by the process, and this limits the amount of data that can be held in RAM and processed without resorting to disk for overflow storage (which has a significant impact on performance); in practice, the limit will be 1 or 2 million rows of data, but this is entirely dependent on factors including quality of data and which matching levels and match keys are in use.

Normalization: Note that when an engine is configured for normalization, a row of data added to the engine is discarded immediately after it's processed and output; it is otherwise not retained in RAM. The above RAM requirements are therefore not applicable, and memory usage is minimal.

 

Disk

mHUB can fall back to storing data on disk, for example if memory usage exceeds a predetermined threshold. This can significantly impact performance, but will allow for processing greater volumes of data. Should disk usage be necessary, then fast disks (such as SSDs) are highly recommended.

Performance

Due to its in-memory architecture, matchIT Hub is many times faster than any other specialist contact data matching solution. It scales automatically across multiple processors – efficiently processing very high volume data. Performance depends principally on hardware and match rate (duplication or overlap rate), but examples include:

  • Finds overlap of 100,000 records against 50 million preloaded records in 11 seconds (uses 13GB RAM, 20% match rate)*
  • Matches 1 million records in 12 seconds (uses 500MB RAM, 11% match rate)*
  • Matches 50 million records in 52 minutes (uses 15GB RAM, 12% match rate)*
  • Matches 1 billion records in 15 minutes (using Apache Spark, 10% match rate)†
    * Using a 10-core hyper threaded Windows PC with 64GB RAM
    † Using a cluster of 20 machines on AWS, each with 192GB RAM, 48 Cores, with CPU 3.1 GHz Intel Xenon® Platinum 8175. Time to start up the machine cluster is an additional 6 minutes.

 

 

Was this article helpful?
0 out of 0 found this helpful

have a question or not finding what you're looking for?

Submit a ticket to get some help