mHUB - mHUB Integration Guide - Basics

 

Passing Data to mHub

See the Getting Started | Using the API | Loading Data for full details of how to load data into mHub.

The API doesn’t support quoted fields, so we recommend picking a delimiter that is guaranteed not to appear within the fields, such as 0x01 or other non-printable character. Also, because the delimiter is specified as the first character in the data, you can use a different delimiter with each call to AddData().

See the Configuration Guide | Appendix A – Column Types for a list of the key words that can be used in the input columns definition. Only pass the minimum set of fields to the matchIT engine which are actually required for producing match keys.

Encoding

mHub fully supports Unicode encoding. There are three overloaded versions of the AddData() method which take UTF8, ANSI and UTF16:

            int AddData(int table, const utf8char_t* data, int timeoutInMS);

            int AddData(int table, const char* data, int timeoutInMS);

            int AddData(int table, const wchar_t* data, int timeoutInMS);

…just pass the appropriate data type and it will be handled accordingly.

For languages such as Chinese that don’t use the Latin1 character set, mHub transliterates the Chinese characters into Latin1 so it can still generate phonetic keys for Chinese names e.g. 续函 == Xù Hán.

Outputting Results

See the Getting Started | Using the API | Retrieving Results for full details of how to fetch results from mHub.

One or more output types must be enabled. By default, only matching groups are output. Each output type is a delimited string, with the first character being the delimiter.

  • MP - matching pairs: two matching records, plus the score for each level they match at;
    • "|MP|record1|record2|levels|ind|fam|add|bus|cus"
  • GP - grouped matching pairs: as per a matching pair, plus the group that contains this pair;
    • "|GP|level|record1|record2|score|group|basescore"
  • MG - matching groups: a single record, plus the group that contains it;
    • "|MG|level|group|basescore|table|record"
  • DE - deduped data: a copy of the input data minus all duplicate records (retaining the master/best record in each group);
    • "|DE|level|data"
  • DU - duplicate data: all duplicate records from the input data.
    • "|DU|level|data"

See the Configuration Guide | Matching Settings for a detailed explanation of the above output types.

As an illustration of what control you can enable in your User Interface (GUI), your application could present the end user with output options such as:

  1. Output deduplicated records
    • Unique records and a master record from each group of duplicates is kept
    • Duplicates could be sent to a secondary output or discarded
  2. Output unique records only
    • Grouped duplicates could be sent to a secondary output or discarded
  3. Output all records with duplicates flagged.

The first two options can be implemented using the deduped data and matching groups output types, the third option can be implemented using the deduped data and duplicate data output types.

Basic User Settings

Ideally core options are controlled via the GUI and end users don’t have to edit an xml configuration file for basic levels of functionality. However, we don’t recommend presenting advanced configuration options in the GUI as these options can be left at default for the vast majority of use cases and over-complicating the GUI detracts from ease of use. It’s also best to steer end users away from changing the matching weights, as training and time to experiment methodically is required to successfully tune these. End users shouldn’t need to edit the XML configuration files, and most (>90%?) of the time your application should be auto-generating the XML for greatest ease-of-use.

We recommend that the following basic settings be built into the GUI:

  • Matching level
  • Minimum threshold score
  • Match keys.

An explanation of how mHub’s matching engine works is at What Makes Us Special and this will help you understand the context of the advice on matching provided in the following sections of this set of articles.

Matching Levels

There are several different standard levels of matching, as follows:

  • Individual e.g. matching Bill Smith and William Smith at the same address.
  • Family e.g. matching Bill Smith and Sheila Smith at the same address.
  • Household (also known as Address) e.g. matching Bill Smith and Susan Jones at the same address.
  • Company e.g. matching Bill Smith at J Sainsbury PLC and Susan Jones at Sainsburys at the same address.
  • Name only e.g. matching Bill Smith and William Smith at the same or different addresses.
  • Company only e.g. matching ABC Life Assurance and A.B.C. Life at the same or different addresses.

You can also set up custom matching rules, which would include matching on email address or telephone number.

You can let the user decide which level(s) to match to via e.g. a set of checkboxes in your GUI.

Alternatively you can automatically choose which levels to use based on whether or not the input layout contains people's names or organization names etc. as follows:

Fields available

Appropriate matching levels

Name fields

Name Only Level

Address

Address Level

Address + Name fields

Individual Level (or Family Level)

Organization

Company Only Level

Address + Organization

Business Level

Address + Organization + Name Fields

Individual Level

 

Or you could use the automatic selection of level to set default checkboxes that the user can then change.

See the Configuration Guide | Matching Settings for more guidelines on choosing matching levels.

Minimum Threshold Scores

Minimum threshold score settings could be portrayed to the end user as: Loose, Medium or Tight. There will be a different minimum threshold score per level, when using the standard matching weights.

Matching level

Maximum score

Minimum threshold score

Individual Level

120

80

Name Only Level

60

30

Family Level

120

80

Address Level

60

40

Business Level

120

80

Company Only Level

60

30

 

Match Keys

Refer to the mHub Configuration Guide|Appendix B - Match Keys for a list of key words that represent normalized (parsed, phoneticized, standardized) forms of the input data. These are ideal for use as match keys.

Additionally any of the input column types can be used as match keys but - since this is the input data in its raw form – subtle differences in the data will prevent matches. mHub Configuration Guide|Appendix C - Match Key Functions lists function that can be applied to match keys – these offer some basic normalization such as UPPER() & TRIM().

The selection of match keys depends on the data (residential or business, US, UK or international) and the volumes.

Loose keys are fine for low volume data but tight keys are required for high volume data or performance will degrade significantly. “High volume” typically means over (maybe) 5 million records but it does depend on the concentration of the data e.g. geographical concentration if address is part of the matching criteria.

You can set appropriate defaults for matching on name (individual or organization) and address from the following tables:

Medium Data Volume Match Keys

The US and other data with high-level postal codes

Matching level

Default Keys

Person level

PostOut + PhoneticLastName

 

PhoneticLastName + PhoneticStreet

 

AddressKey + Premise

Family level

Same as Person level

Address level

PostOut + PhoneticLastName[1]

 

PostOut + PhoneticStreet + Premise

 

AddressKey + Premise

Organization level

PostOut + PhoneticOrganizationName1

 

PhoneticOrganizationName1 + PhoneticStreet

 

AddressKey + Premise  

 

The UK and other data with low-level postal codes

Matching level

Default Keys

Person level

PostOut + PhoneticLastName

 

PhoneticLastName + PhoneticStreet

 

Postcode

Family level

Same as Person level

Address level

PostOut + PhoneticLastName

 

Postcode + PhoneticStreet + Premise

 

Postcode

Organization level

PostOut + PhoneticOrganizationName1

 

PhoneticOrganizationName1 + PhoneticStreet

 

Postcode

 

 

High Data Volume Match Keys

The US and other data with high-level postal codes

Matching level

Default Keys

Person level

AddressKey + PhoneticLastName

 

PostOut + PhoneticStreet + Premise[2]

 

Postcode[3] + PhoneticLastName

Family level

Same as Person level

Address level

AddressKey + PhoneticLastName

 

PostOut + PhoneticStreet + Premise2

 

Postcode3 + Premise2[4]

Organization level

AddressKey + PhoneticOrganizationName1

 

Postcode + PhoneticOrganizationName1 + PhoneticOrganizationName2

 

Postcode + PhoneticStreet + Premise2

 

The UK and other data with low-level postal codes

Matching level

Default Keys

Person level

Postcode + PhoneticLastName

 

AddressKey + PhoneticLastName

 

Postcode + Premise4

Family level

Same as Person level

Address level

AddressKey + PhoneticLastName

 

Postcode + Premise4

Organization level

Postcode + PhoneticOrganizationName1

 

AddressKey + PhoneticOrganizationName1

 

Postcode + Premise4

 

Note that even when matching at address level, if the data includes names, we recommend that one of the match keys includes name (individual or organization): this is because the match key is used just to identify candidate pairs of records for comparison, not to determine whether the records match. By using name in conjunction with a smaller part of the address or postal code data, additional candidates can be identified for comparison.

 

[1] Omit this key if last name is not available

[2] Where Premise is non-blank

[3] Where Postcode is at street or lower level

[4] Or Delivery Point code/suffix if available

 

Reporting

A basic matching report (with graphs) based on matches found and scores should be core functionality.

The main statistic to report on is the number of potential duplicates found e.g.

The end user might also like to see a breakdown of matches by score e.g.

Was this article helpful?
0 out of 0 found this helpful

have a question or not finding what you're looking for?

Submit a ticket to get some help