mHUB - mHUB Appendices

Appendix A - Column Types

The following column types can be used when defining the input data columns. Refer to Data Settings, above, for further details including how to map multiple elements by prefixing with "First", "Second", "Third", "Fourth" and "Fifth".

Unique Ref

  • UniqueRef.

A unique ref is typically mapped to a table's primary key column, if available. For certain operations, such as when calling the engine's GetData() method, a unique ref is required and must therefore be mapped to the input data.

Name

  • FullName;

  • Prefix;

  • FirstNames;

  • Forenames - synonym of FirstNames;

  • MiddleNames;

  • Initials;

  • LastName;

  • Surname - synonym of LastName;

  • Qualifications;

  • Suffix.

For best results specify either FullName or a combination of the other fields (Prefix, FirstNames, MiddleNames, Initials, LastName, Qualifications, and Suffix). If FullName is specified then the name elements will be extracted (and can be output when the engine is configured for normalization mode).

Organization

  • Organization;

  • Company - synonym for Organization;

  • Department;

  • JobTitle.

Address

  • Address1;

  • ...

  • Address9;

  • Town;

  • City - synonym for Town;

  • Region;

  • State - synonym for Region;

  • County - synonym for Region;

  • Postcode;

  • ZIP - synonym for Postcode;

  • ZipPlus4 - synonym for Postcode; (From version 2.0.3)

  • PostOut;

  • Zip5 - synonym for PostOut; (From version 2.0.3)

  • PostIn;

  • Plus4 - synonym for PostIn;

  • DeliveryPoint;

  • DPS - synonym for DeliveryPoint;

  • Country.

Telephone

  • Telephone;

  • Fax.

Email

  • Email.

Date of Birth

  • DateOfBirth;

  • DOB - synonym for DateOfBirth.

Website

  • Website;

Custom Fields

  • CustomField1;

  • ...

  • CustomField9.

Other

  • Other;

  • Other1;

  • ...

  • Other32; (before version 2.0.3)

  • Other200. (from version 2.0.3)

Other-type columns cause data in that column to be ignored. This can be useful when outputting a copy of the input data (matching mode) or for passing unmodified data straight into the output (normalization mode).

 

Appendix B - Match Keys

The following match keys are available for use; refer to Matching -> Match Keys for further details.

Name

  • NormalizedName;
  • NameKey;
  • PhoneticLastName;
  • Name1 - synonym for PhoneticLastName;
  • PhoneticFirstName;
  • Name2 - synonym for PhoneticFirstName;
  • PhoneticMiddleName;
  • Name3 - synonym for PhoneticMiddleName;
  • FirstNameFound;
  • Gender.

NameKey: The first five characters from PhoneticLastName, plus the first character from the first name.

NormalizedName: A normalization of the name in upper case, of the form "LastName,FirstName,MiddleName" - for example, "SMITH,JOHN,R."

PhoneticLastName/Name1: A phoneticization of the last name, up to eight characters in length by default.

PhoneticFirstName/Name2: A phoneticization of the first name, up to eight characters in length by default.

PhoneticMiddleName/Name3: A phoneticization of the middle name, up to eight characters in length by default.

FirstNameFound: Either Y (yes) or N (no) to indicate whether a firstname has been recognized.

Gender: Either M (male), F (female), E (either), or X (inconsistent).

Organization

  • NormalizedOrganization;
  • OrganizationKey;
  • PhoneticOrganizationName1;
  • OrgName1 - synonym for PhoneticOrganizationName1;
  • PhoneticOrganizationName2;
  • OrgName2 - synonym for PhoneticOrganizationName2;
  • PhoneticOrganizationName3;
  • OrgName3 - synonym for PhoneticOrganizationName3.

OrganizationKey: The first five characters from PhoneticOrganizationName1 plus the first five characters from PhoneticOrganizationName2.

NormalizedOrganization: A normalization of the company name in upper case, of the form "FirstWord,SecondWord,ThirdWord" - for example, "HELPIT,SYSTEMS,LTD"

PhoneticOrganizationName1/OrgName1: A phoneticization of the first word of the company name, up to eight characters in length by default.

PhoneticOrganizationName2/OrgName2: A phoneticization of the second word of the company name, up to eight characters in length by default.

PhoneticOrganizationName3/OrgName3: A phoneticization of the third word of the company name, up to eight characters in length by default.

Address

  • AddressKey;
  • PhoneticStreet;
  • PhoneticTown;
  • PostOut;
  • PostIn;
  • Premise.

PhoneticStreet: A phoneticization of the address's street, up to eight characters in length by default.

PhoneticTown: A phoneticization of the address's town, up to eight characters in length by default.

AddressKey: The first four characters from PhoneticTown plus the first four characters from PhoneticStreet. This will always be eight characters in length, and should never be trimmed because spaces are meaningful.

PostOut: The first half of a postcode (UK, Canada) or the first five digits of a US ZIP code; for all other countries this will be a copy of the postcode.

PostIn: The second half of a postcode (UK, Canada) or the last four digits of a US ZIP+4 code; blank for all other countries.

Premise: A copy of the premise that's been extracted from the address. Often used in conjunction with AddressKey to reduce the number of records in each cluster (e.g. AddressKey + Premise).

Telephone

  • TelAreaCode;
  • TelLocalNumber;
  • FaxAreaCode;
  • FaxLocalNumber.

TelAreaCode: Specifies the area code at the start of the telephone number.

TelLocalNumber: Specifies the local part of the telephone number. For example, the area code for the UK telephone number 01372360070 is 01372 whilst the local number is 360070.

FaxAreaCode: Specifies the area code at the start of the fax number.

FaxLocalNumber: Specifies the local part of the fax number.

Input

From version 2.0.3

Any mapped input fields can be used in keys, simply prepend the column type with "Input".

It is not recommended to use input fields in fuzzy keys. Since the values of input fields are not normalized in any way, it is recommended to apply the Match Key Functions PUNTRIM and UPPER.

  • InputFullName;
  • InputPrefix;
  • InputFirstNames;
  • InputForenames - synonym for InputFirstNames;
  • InputMiddleNames;
  • InputInitials;
  • InputLastName;
  • InputSurname - synonym for InputLastName;
  • InputQualifications;
  • InputSuffix.
  • InputOrganization;
  • InputCompany - synonym for InputOrganization;
  • InputDepartment;
  • InputJobTitle.
  • InputAddress1;
  • ...
  • InputAddress9;
  • InputTown;
  • InputCity - synonym for InputTown;
  • InputRegion;
  • InputState - synonym for InputRegion;
  • InputCounty - synonym for InputRegion;
  • InputPostcode;
  • InputZIP - synonym for InputPostcode;
  • InputPostOut;
  • InputPostIn;
  • InputPlus4 - synonym for InputPostIn;
  • InputDeliveryPoint;
  • InputDPS - synonym for InputDeliveryPoint;
  • InputCountry.
  • InputTelephone;
  • InputFax.
  • InputEmail.
  • InputDateOfBirth;
  • InputDOB - synonym for InputDateOfBirth.
  • InputWebsite;
  • InputCustomField1;
  • ...
  • InputCustomField9.

 

Appendix C - Match Key Functions

The following functions can be applied to any one match key (e.g. TRIM(Address1)) or to multiple match keys (e.g. TRIM(Address1+Address2)), or can be combined (e.g. UPPER(TRIM(Address1))).

  • TRIM(field) - Removes leading and trailing whitespace from the field;
  • LTRIM(field) - Removes leading whitespace only from the field;
  • RTRIM(field) - Removes trailing whitespace only from the field;
  • PUNTRIM(field) - Removes all non-alphanumeric characters from the field;
  • UPPER(field) - Uppercases the field;
  • LOWER(field) - Lowercases the field;
  • LEFT(field,count) - Extracts characters from the left of the field; if count exceeds the length then the field is returned as-is;
  • RIGHT(field,count) - Extracts characters from the right of the field; if count exceeds the length then the field is returned as-is;
  • SUBSTRING(field,start,count) - Extracts characters from within the field (start is 0-based, so use 0 for the first character); the data returned is not padded if there are insufficient characters.

Functions are best used with raw input data (names, address lines, postcodes, etc.) rather than with the key fields generated by the mHUB engine (NameKey, AddressKey, etc.).

 

Appendix D - Normalization Outputs

The following field types can be used when specifying the outputs when the engine is running in normalization mode.

All input fields are equivalent to the column types listed in Appendix A - Column Types.

Refer to Appendix B - Match Keys for further details on all key fields.

Unique Ref

  • UniqueRef.

Name

  • FullName;
  • Prefix;
  • FirstNames;
  • Forenames - synonym for FirstNames;
  • MiddleNames;
  • Initials;
  • LastName;
  • Surname - synonym for LastName;
  • Qualifications;
  • Suffix;
  • Salutation;
  • Contact;
  • Prefix2;
  • FirstNames2;
  • Forenames2 - synonym for FirstNames2;
  • Initials2;
  • MiddleNames2;
  • LastName2;
  • Surname2 - synonym for LastName2.

Input fields:

  • InputFullName;
  • InputPrefix;
  • InputFirstNames;
  • InputForenames - synonym for InputFirstNames;
  • InputMiddleNames;
  • InputInitials;
  • InputLastName;
  • InputSurname - synonym for InputLastName;
  • InputQualifications;
  • InputSuffix.

Key fields:

  • NormalizedName;
  • NameKey;
  • PhoneticLastName;
  • Name1 - synonym for PhoneticLastName;
  • PhoneticFirstName;
  • Name2 - synonym for PhoneticFirstName;
  • PhoneticMiddleName;
  • Name3 - synonym for PhoneticMiddleName;
  • FirstNameFound;
  • Gender;

Organization

  • Organization;
  • Company - synonym for Organization;
  • Department;
  • JobTitle.

Input fields:

  • InputOrganization;
  • InputCompany - synonym for InputOrganization;
  • InputDepartment;
  • InputJobTitle.

Key fields:

  • NormalizedOrganization;
  • OrganizationKey;
  • PhoneticOrganizationName1;
  • OrgName1 - synonym for PhoneticOrganizationName1;
  • PhoneticOrganizationName2;
  • OrgName2 - synonym for PhoneticOrganizationName2;
  • PhoneticOrganizationName3;
  • OrgName3 - synonym for PhoneticOrganizationName3.

Address

  • Address1;
  • ...
  • Address9;
  • Town;
  • City - synonym for Town;
  • Region;
  • State - synonym for Region;
  • County - synonym for Region;
  • Postcode;
  • ZIP - synonym for Postcode;
  • PostOut;
  • PostIn;
  • Plus4 - synonym for PostIn;
  • DeliveryPoint;
  • DPS - synonym for DeliveryPoint;
  • Country;
  • Premise;
  • FlatNo;
  • SubPremise - synonym for FlatNo;
  • Thoroughfare;
  • Street - synonym for Thoroughfare.

Input fields:

  • InputAddress1;
  • ...
  • InputAddress9;
  • InputTown;
  • InputCity - synonym for InputTown;
  • InputRegion;
  • InputState - synonym for InputRegion;
  • InputCounty - synonym for InputRegion;
  • InputPostcode;
  • InputZIP - synonym for InputPostcode;
  • InputPostOut;
  • InputPostIn;
  • InputPlus4 - synonym for InputPostIn;
  • InputDeliveryPoint;
  • InputDPS - synonym for InputDeliveryPoint;
  • InputCountry.

Key fields:

  • AddressKey;
  • PhoneticStreet;
  • PhoneticTown.

Telephone

  • Telephone;
  • TelAreaCode;
  • TelLocalNumber;
  • Fax;
  • FaxAreaCode;
  • FaxLocalNumber.

Input fields:

  • InputTelephone;
  • InputFax.

Email

  • Email;
  • Username;
  • Domain;
  • TLD.

Input fields:

  • InputEmail.

Date of Birth

  • DateOfBirth;
  • DOB - synonym for DateOfBirth.

Input fields:

  • InputDateOfBirth;
  • InputDOB - synonym for InputDateOfBirth.

Custom Fields

  • CustomField1;
  • ...
  • CustomField9.

Input fields:

  • InputCustomField1;
  • ...
  • InputCustomField9.

Quality Scores

  • NameQuality;
  • AddressQuality;
  • EmailQuality.

These are numeric scores that indicate the quality of an item. 0 tyically indicates missing, badly-formatted, or nonsense data, increasing up to 9 for perfect-quality data. Note that quality scoring is not enabled by default (refer to the Advanced Configuration Guide for further details).

Other

  • Other;
  • Other1;
  • ...
  • Other32.

Miscellaneous

  • DataFlags - for diagnostic and testing purposes.

 

Appendix E - Advanced Post-Matching Rule Functions

The following functions can be used in Advanced Post-Matching Rules.

Functions returning True/false

  • LEVEL(matching_level) - Returns true if the match is at the specified level (see below for a definition of matching_level);
  • KEY(integer_value) - Returns true if the specified key was used to find the match (keys are numbered sequentially in the order they are defined in the config starting from 0);
  • ACRONYMMATCH() - Returns true if acronym matching was used;
  • CONTAINS(field1,field2) - Returns true if field1 contains field2;
  • CONTAINS(field,string_value) - Returns true if field contains the specified string literal;
  • NAMEMATRIX(matrix_index1,matrix_index2,matrix_index3) - Returns true if the specified name matrix entry was used (see below for a definition of matrix_index);
  • ORGANIZATIONMATRIX(matrix_index1,matrix_index2,matrix_index3) - Returns true if the specified organization matrix entry was used;
  • RECORD(logical_expression) - Returns the Boolean value of the logical expression when applied to a single record. For example, “Record(ForeNames==Lastname)” returns true if ForeNames matches Lastname in either record.;
  • MATCHES(field) - Returns true if the specified field is the same within the two matched records. For example, “Matches(Town)” returns true if the town is the same in both records;
  • BOTHEMPTY(field) - Returns true if the specified field is blank within both records;
  • ONEEMPTY(field) - Returns true if the specified field is blank within one (but not both) of the records;
  • SOUNDSEQUAL(field) - Returns true if the specified field sounds equal;
  • APPROX(field) - Returns true if the specified field matches approximately;
  • SOUNDSAPPROX(field) - Returns true if the specified field sounds approximately;
  • LOCATIONMATCH - Returns true if the location is matched; (From version 2.0.3)
  • PREMISELOOSE - Returns true if the premise is loosely matched; (From version 2.0.3)
  • PREMISETIGHT - Returns true if the premise is tightly matched; (From version 2.0.3)
  • PREMISEONEEMPTY - Returns true if one of the records in the match has an empty premise; (From version 2.0.3)
  • DIRECTIONALUNMATCHED - Returns true if the directional is unmatched; (From version 2.0.3)
  • NUMERICSTREETUNMATCHED - Returns true if a numeric street is unmatched; (From version 2.0.3)
  • BUILDINGMATCH - Returns true if the building is matched; (From version 2.0.3)
  • BUILDINGONEEMPTY - Returns true if one of the records in the match has an empty building; (From version 2.0.3)

Functions returning a string

N.B. The string functions RECORD1() and RECORD2() are really only meaningful in overlap mode, where record 1 comes from the main table and record 2 comes from the overlap table.

  • RECORD1(field) - Return the value of the field from record 1;
  • RECORD2(field) - Return the value of the field from record 2;

Functions returning an integer

  • LENGTH(field) - Returns the length of a field;

Where:

  • matching_level is one of:
    • IndividualLevel
    • FamilyLevel
    • AddressLevel
    • BusinessLevel
    • Custom
  • matrix_index is one of:
    • EqualIndex
    • SoundsEqualIndex
    • BothEmptyIndex
    • OneEmptyIndex
    • ApproxIndex
    • SoundsApproxIndex
    • ContainsIndex
    • UnequalIndex

Logical operations involving the LENGTH() or CONTAINS() functions are true if the condition applies to both records. If you want to test the length of either record use the RECORD() function - Record(Length(field))

 

 

 

Appendix F - Nationalities

mHUB must be initialized with one from the following nationalities:

  • Other;
  • UK;
  • USA;
  • Ireland;
  • France;
  • Germany;
  • Spain;
  • Portugal;
  • Sweden;
  • Denmark;
  • Norway;
  • Australia;
  • NewZealand;
  • Austria;
  • Switzerland;
  • Netherlands.

If the data being processed is predominantly of a nationality that isn't listed, choose Other.

 

Appendix G - Between Overlaps

The following settings can be changed between overlaps.

  • layout of table 2, data options;
  • matching outputs and output options;
  • matching advanced post-matching rules and bridging prevention;
  • matching weights and constraints;
  • debug;
  • threads.

I.e.

<settings>

<data>
<input table="2" columns="|UniqueRef|FullName|Address1|Address2|Address3|Postcode" />
<options>
<trimAllData>false</trimAllData>
<verifyInputColumns>true</verifyInputColumns>
</options>
</data>

<matching>
<outputs>
<types>
<matchingPairs enabled="" />
<groupedMatchingPairs enabled="" />
<matchingGroups enabled="true" />
<dedupedData enabled="" />
<duplicateData enabled="" />
</types>
<options>
<outputUniqueRefsOnly enabled="true" />
<outputComponentScores enabled="false" />
<outputExactMatchScores enabled="false" />
<outputAllExactMatches enabled="false" />
</options>
</outputs>
<advanced>
<postMatchingRules>
<rule condition="" action="" />
<rule condition="" action="" />
...
</postMatchingRules>
<bridgingPrevention>
<nameBridgingPrevention enabled="true" />
<prefixBridgingPrevention enabled="true" />
<companyBridgingPrevention enabled="true" />
<aggressiveSplitting enabled="false" />
</bridgingPrevention>
<options>
<outputDeletedMatches enabled="" />
</options>
<associations>
<associate element1="" element2="" />
<associate element1="" element2="" />
...
</associations>
</advanced>
</matching>

<threads>0</threads>

<debug enabled="false" pathname="C:\Temp\matchIT_Hub_debug.log" truncate="false" />
</settings>

An attempt to modify any other settings between overlaps will result in the error "Attempt to change invariant settings when already running".

 

Appendix H - Phonetic Algorithms

soundIT

The API provides a unique phonetic algorithm for name matching, called soundIT. soundIT takes account of vowel sounds and syllables in the name, and, more importantly, determines the stressed syllable in the word. This means that "Batten" and "Batton" sound the same according to soundIT, as the different letters fall in the unstressed syllable, whilst "Batton" and "Button" sound different, as it is the stressed syllable which differs. Another advantage of soundIT is that it can recognize groups of vowels and consonants that form vowel sounds - thus it can equate "Shaw" and "Shore", "Wight" and "White", "Naughton" and "Norton", and "Leighton" and "Layton" (which are all reasonably common English last names).

This algorithm was developed with extensive testing on a large table of the most common last names in the UK. Therefore, it is specifically designed to be used with English names. If a file with mostly non-English names is processed through the mAPI, then you may want to try the 'Loose' soundIT or Soundex algorithms instead. For US data we recommend that you use soundIT, because it is proven to work well also with Spanish, German and other names that occur commonly in the US. soundIT has been designed with foreign language versions in mind (i.e. for data collected in countries where foreign languages are spoken). These could quite easily be developed, according to demand. Please contact your supplier if you are interested in this.

Note that the keys that the mAPI generates are 'Loose' soundIT keys, where all vowel sounds are equated, together with some consonants, such as 'm' and 'n', 'd' and 't', 's' and 'f'. This is so that potential matches are not missed from candidate match groups based on the phonetic keys; The API uses the 'full' soundIT algorithm at the scoring stage, for matching accuracy.

Loose soundIT

This option is effectively the same as the soundIT option, except that the API uses the 'Loose' soundIT algorithm as described above at the scoring stage. This is for use mainly with non-English names, on which soundIT works less effectively, and can miss true matches. This option should not be used on files with mainly English names, as it can potentially lead to more false matches.

Dynamic soundIT

This is a hybrid of the soundIT and Loose soundIT phonetic algorithms. Firstly, the loose algorithm is used to generate the phonetic form of a word. By default, if it contains only one vowel sound, then the standard soundIT algorithm is used instead. This can improve accuracy when matching mono-syllabic words and can help to reduce the number of false matches.

See EngineSettings.Compare.Phonetic.LooseThresholdForDynamicSoundIT below.

Soundex

Soundex is a widely-used algorithm (patented just after the First World War!), which constructs a crude non-phonetic key by keeping the initial letter of the name, then removing all vowels, plus the letters H, W and Y, and translating the remaining letters to numbers. It gives the same number to letters that can be confused e.g. 'm' and 'n' both become 5. It also drops repeated consonants and consecutive letters that give the same number e.g. S and C. It only takes the first four characters of the result, or pads it out with zeroes if it is less than four long. Thus all the common spellings and misspellings of the name "Tootill" equate to the same Soundex key: Tootill, Toothill, Tootil, Tootal, Tootle, Tuthill, Totill are all translated to "T340".

The algorithm that the mAPI uses is an enhanced version of Soundex, and is for use mainly with non-English names. This option should not be used on files with mainly English names, as it can lead to False matches e.g. Brady, Beard and Broad get the same Soundex key.

 

None

This option constructs a non-phonetic version of the supplied name fields as match keys and allows only non-phonetic name matching.

 

Appendix I - Fuzzy Algorithms

matchIT Fuzzy

The default matchIT algorithm for fuzzy string matching allows for only 1 error of one of the following types: insertions, deletions, substitutions, transpositions, and wide transpositions. Where: a transposition is where two adjacent characters are swapped, and a wide transposition is where two non-adjacent characters are swapped.

The matchIT Fuzzy compare algorithm is equivalent to Damerau-Levenshtein with an edit distance of 1, except that Damerau-Levenshtein doesn't allow for wide transpositions.

Damerau-Levenshtein

This is an extension to the original Levenshtein distance algorithm that handles transpositions as well as insertions, deletions, and substitutions. Both algorithms are in the public domain.

The algorithms produce a string metric ("edit distance") for measuring the similarity between two strings (i.e. the number of changes to get from one string to the other). matchIT then produces a fuzzy score between 0 and 1, which is a measure of the edit distance in relation to the length of the longer string. Two configuration settings provide fine tuning of fuzzy matching Compare | Fuzzy | MaximumEditDistance and Compare | Fuzzy | MinumumScore.

Each string comparison must satisfy both constraints. See below for some examples that provide a rational for both constraints.

Examples

Using the default settings, these strings will be matched:

Type First word Second word Edit distance Score
Identical JO JO 0 1
Identical JOHN JOHN 0 1
Identical JONATHAN JONATHAN 0 1
Fuzzy (reversal) JO OJ 1 0.5
Fuzzy (insertion) AB ABC 1 0.667
Fuzzy (substitution) (#) A B 1 0.5
Fuzzy (substitution) JOHN JAHN 1 0.75
Fuzzy (transposition) JOHN JHON 1 0.75
Fuzzy (insertion) REBECCA REBECCCA 1 0.875

An exception is made for single-letter matches. matchIT only allows single-letter matches if they’re similar (i.e. G and J, S and F, M and N), and does not compare them using Damerau-Levenshtein.

Using the default settings, the following strings will not be matched – either because the edit distance exceeds the maximum (1) or the score hasn’t reached the minimum (0.5):

Type First word Second word Edit distance Score
Fuzzy (insertions) (*) ANTON ANTHONY 2 0.714
Fuzzy (insertions) (*) REBECCA REBBECCCA 2 0.778
Fuzzy (substitutions) (*) LEONARD LOENRAD 2 0.857
Fuzzy (reversal) (*) JON NOJ 2 0.333
Fuzzy (reversal) JOHN NHOJ 3 0.25
Different (+) JO ED 2 0
Different DAVE ED 3 0.25
Different JOHN DAVE 4 0
Different NOTTINGHAM GLOUCESTER 10 0
Containment NOTTINGHAM NOTTINGHAMSHIRE 5 0.667

By increasing the MinimumEditDistance to 2, the three fuzzy matches marked (*) will be found. However, the row marked with (+) still won’t be matched, even though its edit distance is 2, because its score is below the minimum of 0.5. Hence the need for both configuration options.

(Of course, some of these matches are going to be found because they’re phonetically identical or similar (e.g. Rebecca and Rebbeccca), these examples are just to illustrate what can be achieve by using Damerau-Levenshtein regardless of phoneticization.)