Processing Modes


A mHUB engine must be configured for a processing mode.

Currently, mHUB provides three processing modes: Matching, Lookup, and Normalization. (Other processing modes will be added to mHUB in future versions.)

 

Matching

The Matching mode is used for finding matching records in one or two data sources.

When one data source is used, matching records are identified within the source. This is primarily for data cleansing purposes, for removing duplicate contacts from a master table for example.

When two data sources are used, matching records that intersect two sources of data are identified. For example, identifying contacts to be suppressed from a mailing, for updating contacts' details, or for merging new contacts into a master table. Matching records are not identified within each table, but they can be removed beforehand if necessary (for example, when merging new contacts from an update table into a master table, duplicate contacts can first be removed from the update table, then the cleaned update table overlapped with the master table; these two steps would need to be done separately, using either a single mHUB engine or two chained engines).

The outputted results from the Matching mode is configurable. There are five output types, and any combination of at least one type can be specified. Each result is prefixed with a two-letter code that identifies the type of result:

  • MP - matching pairs: two matching records, plus the score for each level they match at;
  • GP - grouped matching pairs: as per a matching pair, plus the group that contains it pair;
  • MG - matching groups: a single record, plus the group that contains it;
  • DE - deduped data: a copy of the input data minus all duplicate records (retaining the master/best record in each group);
  • DU - duplicate data: all duplicate records from the input data.

Note that all DE and DU results, added together, give the original input data.

Most mHUB applications will typically enable the MG (matching groups) output type. Each matching group is output, in turn, and each record (always two or more) in each matching group is output. One record will be identified as the master, or best, record.

Alternatively, to output a cleaned ("deduped") version of the input data, the DE (deduped data) output type can be enabled. The output can be fed directly into a new database table, for example.

 

Lookup

The Lookup processing mode is primarily intended for use in online duplicate prevention applications, in which reference data is searched for candidate records that match the lookup data (e.g. a single record).

When a mHUB engine is configured for Lookup, two data sources must be specified. The first data source (table 1) contains the reference data. Once loaded, the engine retains this in memory and waits for lookup requests.

When a new lookup is to be made, the data is added to the second data source (table 2). Fundamentally, the underlying process performs an overlap, finding the intersection of the two tables (i.e. the reference data and the single added record). Any results are returned to the caller.

Multiple lookups can be performed concurrently by separate threads - for example, within a web server that processes simultaneous lookup requests from multiple users. When a lookup is added to table 2, the calling thread is associated with that lookup; when the same thread retrieves the results from the engine, only the results that correspond to the lookup are actually returned.

 

Normalization

The Normalization processing mode outputs a normalized copy of all input data. Normalization processes include:

  • casing (for example, "mr john smith" becomes "Mr John Smith");
  • standardization (such as trimming of redundant whitespace and correctly formatting postcodes/ZIPs);
  • extraction of name elements (including prefixes, firstnames, lastnames, suffixes, and qualifications);
  • determination of gender (male, female, or either);
  • generation of salutations (for example, "Dear Mr Smith");
  • extraction of address elements (including premises, towns/cities, and postcodes/ZIPs);
  • splitting of email addresses into usernames and domains.

Full control over the output fields is available. Any combination of fields can be output (all fields, several fields, or even a single field), including a copy of the input data. Phonetic match keys can also be output, for advanced matching and diagnostic purposes.

Alternatively, if no output fields have been configured, then mHUB will automatically determine which fields to output depending on the type of data that's input to the engine. In this case, the engine will output a header row that precedes all output data, to indicate which fields are being output and for dynamically creating a destination table in a database prior to importing the normalized data.

In addition, data can be modified as it's output. This can be achieved via simple replacements; for example replacing "Apartment" in address lines with "Apt", or replacing "USD" with "$" in a custom field.