Common warnings: Large Clusters

The selection of match keys is crucial to the performance and resource requirements of mHUB.

All incoming data is stored in memory in 'clusters' of similar records (i.e. candidates for comparison). These clusters are determined by the match keys in use. When incoming data is added to an existing cluster, the new data is compared to the existing data in that cluster.

A match key that is too loose will result in large clusters. This is bad because it increases the number of comparisons required and decreases performance. Cluster size is limited by the maximumClusterSize setting. See the mHUB Configuration Guide. Other Settings. The default cluster size is 200.

Once a cluster grows to the maximum cluster size, mHUB will stop comparing records in that cluster, potentially missing matches. Whenever this happens a warning message is logged, e.g.

Large cluster: key=2 record=23840 records=237

Large cluster: key=2 record=21704 records=232

Large cluster: key=2 record=21348 records=268

Basically, this means there are more than 200 records with a particular key value. If you see these warnings and you are using a loose key like Postcode on its own you should consider tightening it. This warning can be ignored if you are confident that the records that are not being compared on the key in use at the time, will be compared on another key. However, if other match keys are also producing large cluster warnings, then you should definitely either tighten match keys to produce smaller clusters, or increase the maximumClusterSize. If you want to check that other match keys are picking up the missed records for comparison in a dataset that you are using for testing, you can do this:

  • remove all the matched records picked up by any of the keys
  • increase the maximumClusterSize to a value greater than the counts reported in the large cluster warnings, for most clients we find around 800 is sufficient to let in matches but minimize loss of performance.
  • rerun the matching using the same keys on a deduped dataset and see if it detects additional matches.