This is a compulsory setting. The processing mode must be configured before any data can be added to the mHUB engine.
Currently supported processing modes:
- Matching - for finding matching records within a single data source, or for finding the overlap of two data sources (i.e. records that intersect both data sources);
- Lookup - primarily for online duplicate prevention applications, in which individual records are looked up in a reference data source.
- Normalization - for outputting a 'normalized' (cased, standardized, parsed, extracted, modified) copy of the input data.
(Additional processing modes will be added in future releases of the product.)
Data settings specify the layout of the data to be added to mHUB and options for pre-processing the incoming data.
<input columns="|UniqueRef|FullName|Address1|Address2|Address3|Postcode" />
This is a compulsory setting. The input data definition must be configured before any data can be added to the mHUB engine.
The input data definition is a delimited string, containing one or more column types. The delimiter is indicated by the first character of the string; this must be a non-alphanumeric character. Refer to Appendix A for a list of all available column types. Each type (except for Other) can appear only once in the definition.
When overlapping two data sources, with different column definitions, their definitions can be defined as follows:
<input table="1" columns="|UniqueRef|FullName|Address1|Address2|Postcode" />
<input table="2" columns="|UniqueRef|FirstNames|LastName|Address1|Address2|Postcode" />
Optionally, each data source can be tagged with a descriptive name by including a 'name' attribute. The name will be written to the statistics XML that's output by the GetStats/getStats engine method. For example:
<input table="1" columns="..." name="Master Customer Table" />
<input table="2" columns="..." name="Incoming Feed" />
When a record contains multiple elements additional names, organizations, addresses (including postcodes), telephones, faxes, emails, and jobs can be mapped in the data source definitions by prefixing columns names with Second, Third, Fourth, or Fifth.
- Map two names using FullName and SecondFullName.
- Map two addresses using Address1-9 and SecondAddress1-9.
- Map three postcodes using Postcode, SecondPostcode, and ThirdPostcode.
- Map five emails using Email, SecondEmail, ThirdEmail, FourthEmail, and FifthEmail.
A prefix of First is permitted, but is not necessary. FirstEmail is the same as Email. Elements can be mapped in any order. For example, this is a valid data source definition:
For further details of matching record contains multiple elements see Associations.
trimAllData: When enabled, leading and trailing whitespace is trimmed from all input data. This can help reduce memory usage because all added data is stored unmodified (unless this setting is enabled).
verifyInputColumns: All added data is parsed according to the configured input data definition. Processing is aborted if any data is encountered that doesn't conform to the definition. Disabling this setting will allow all data to be processed, whether it conforms to the definition or not. This option should only be used when necessary, for example when it's not possible to easily correct the data before it's added.
textQualifier (From version 220.127.116.11): The fields of an input data record may contain embedded delimiters, in which case the field is wrapped in double quote characters. If textQualifier is false, the record parser does not check for quotes or embedded delimiters, in which case the delimiter must be something that does not appear in the data.
abortOnSchemaError (From version 2.0.3): When enabled (or prior to version 2.0.3) if an input data record that doesn't match the column layout defined the engine will go into an aborted state. When disabled (default), the malformed data is simply logged and ignored.
abortOnDataError (From version 18.104.22.168): When enabled (or prior to version 22.214.171.124) if an input data record that contains invalid utf-8 the engine will go into an aborted state. When disabled (default), the record containing invalid data is simply logged and ignored.