Alteryx - Advanced Settings

The advanced settings dialog is a popup tabbed dialog.

 

Matching tab

 

Minimum scores

Specifies a matching threshold score for each matching level enabled.

Maximum cluster size

All processed data is added to clusters. When a record is added to an existing cluster, it's then compared to each record already in the cluster, provided the maximum cluster size has not been exceeded. If the cluster has reached the maximum size, then no more comparisons will be performed on that cluster and it will be logged as a large cluster.

 

 

 

Output Options

Output unique refs only

If enabled, then only unique refs are output. If disabled (default), then the output contains a copy of the input data, which can include the unique ref.

Output component scores

If enabled, then scores for mapped components are output for each matching level in addition to total scores. If disabled (the default), then only the total score for each matching level is output.

Output exact match scores

If enabled, then a total score is output for exact matches that is the sum of the sure score setting for all mapped components plus one. Otherwise the score field is blank for exact matches. Regardless of this setting the component scores for exact matches are always blank.

Output all exact matches

When disabled (the default), matching pairs are only output if a record exactly matches the first record of a cluster. If enabled, then all matching pairs are output.

Output highest scores

If enabled, highest scores are also output to the Grouped Matching Pairs and Matching Groups output types. This is the highest score achieved by any matching pair within each group. (Conversely, the base score is the lowest score achieved by the pairs within a group and is always output.)

Output duplicates count

If enabled, the number of duplicates in each group is output to the Matching Groups output type. This is one less than the number of records in the group.

Output compare results

If enabled, the matching matrices indices and acronym match flag are included in the Matching Pairs output type.

 

Grouping options

 

Name bridging prevention

Prevents records with different forenames being grouped together because they all match to a record that is missing forename.

Prefix bridging prevention

Prevents records with ‘Miss’ and ‘Mrs’ being grouped together because they match to a record with ‘Ms’.

Company bridging prevention

Prevents records with different company names being grouped together because they all match to a record with an acronym. (e.g. IBM matches “International Business Machines” and “Injection Blow Moulding”).

Aggressive splitting

If enabled, bridging records will be disassociated from all matching records. If disable (the default), bridging records will remain matched to one sub-group of non-bridging records.

Master record identification

If enabled (default), the master record in each group is chosen according to: Master Priorities rules, then address length, then lowest UniqueRef. If disabled, the master record in each group is simply the record with the lowest UniqueRef.

 

 

Match Keys

Match keys determine how records are clustered. When a new record is added to an existing cluster (containing one or more existing records) the record is compared to each of those existing records. Clusters are used to group potentially matching records.

 

Keys

Lists the keys that will be used to cluster records for matching.

Key types

Keys are grouped into ‘exact keys’ and ‘fuzzy keys’. All the records in a fuzzy key cluster are compared to one another. All the records in an exact cluster are automatically considered matching, without needing to compare.

Key fields

Each key is a combination of key fields, e.g. Address Key + Premise.

Key functions

Functions (such as UPPER, TRIM, etc) can be applied to key fields. Functions are best used with raw input data (names, address lines, postcodes, etc.) rather than with the key fields generated by the Hub engine (NameKey, AddressKey, etc.)

Allow blank keys

Key fields can be marked as 'optional' by enclosing them within square brackets, alternatively enabling ‘allow blank keys’ makes all key fields optional. The two methods cannot be used at the same time.

Dynamic keys

In overlap mode (and lookup mode) enabling this option instructs Matching to dynamically choose which keys to use (from those defined), on a record-by-record basis, depending on which input columns are populated.

 

Matching Rules

 

Matching levels

The Matching Rules dialog has one tab for each matching level enabled (Individual, Name only, Family, Address, Business, Company only).

Weights

Weights are used when compared records are scored. Weights are configured automatically when the basic configuration settings are specified (nationality, tightness), there is no need to manually configure these weights unless customizations are being made.

Thresholds

Scoring thresholds can be applied to provide further matching requirements when two records are compared. It is not recommended to change these settings.

 

Constraints

Must match gender

When enabled, potential matches will be disregarded if their genders differ. However, if the gender is unknown in one or both of the records, the records will potentially be classed as a match.

Must match suffix

When enabled, potential matches will be disregarded if their suffixes differ. However, if the suffix is unknown in one or both of the records, the records will potentially be classed as a match.

Must match joint names

When enabled, potential matches will be disregarded if one record has a joint name but the other doesn't. For example, normal behaviour will match "Mr and Mrs J Smith" with "Mr J Smith"; enabling must match joint names will prevent such matches.

Address constraints

The address matching constraints (must match location, premise, directional, etc) are now implemented via post-matching rules, so do not need to be configured here.

 

Matching Matrices

Three dimensional matching matrices are used to decide the level of match records should achieve. In the name matching matrix the three dimensions represent the individual name fields: last name, first name, middle name. In the company matching matrix the three dimensions are name1, name2, and name3. The matrix maps the match type for these individual name fields (equal, both_empty, one_empty, sounds_equal, etc.) to an overall match level (sure, likely, possible, etc.).

 

 

Post-Matching Rules

 

Advanced Post-Matching rules are applied to matching pairs prior to grouping. The Advanced Post-Matching rules only apply to fuzzy compared matches. Each rule specifies both a condition using a SQL-like syntax, plus an action that determines what happens when a condition is satisfied.

 

Conditions

Rule conditions are logical expressions that results in a Boolean (true or false). An expression can be a function – such as “matches(city)” – or a logical operation such as “AddressScore >= 30”, “City == ‘RALEIGH’”. Conditions can consist of a single logical expression or of multiple expressions (combined using “and”, or “or”).

Actions

Rule actions are either "Keep" or "Delete". If any successful rule specifies a Keep action, then the match is kept. If any successful rule specifies a Delete action, then the match is deleted, but only if the match isn’t being kept.

 

Master Priorities

 

Master priority rules are used to determine which record in a matching group should be marked as the master record (i.e. the best record).

 

Word Lookup

The Names and Words tables (NAMES.DAT & NAMES2.DAT) control:

  • the matching equivalent of words e.g. Tony = Anthony
  • the gender of forenames e.g. John = Male, Susan = Female, Chris = Either
  • casing rules e.g. PO Box, IBM, 360Science
  • expansion/contraction of abbreviations and correction of typing errors e.g. Svcs = Services, Finacial = Financial
  • attributing type to these and other words e.g. Mr = Prefix, Ltd = Business, FL = State, The = Noise.

Word_Lookup.png

 

 

Generate tab

 

 Generate_Config.png

 

 

Generate name options

Default gender

The Default Gender property is the gender to assume when the matchIT API can't determine whether the name is male or female e.g. Chris Smith, C Smith.

Use equivalent name

If enabled, the input first name is replaced with its equivalent from the NAMES.DAT file.

Enhanced double barrelled lookup

When enabled, this setting will cause an unrecognised middle name to be considered part of a non-hyphenated double-barrelled last name.

Process blank last name

With this setting enabled, a blank lastname will cause extra processing to be performed on other input data to help detect typographical errors.

Parse name elements

When enabled, this will cause input name elements (including prefix, firstnames, and lastname) to be parsed.

Detect inverse names

Attempt to identify addressee names that have been specified with the lastname preceding the firstnames, provided a comma delimiter follows the lastname (for example, "Smith, John" where Smith is the lastname).

Parse as normalized name

Addressee names are assumed to be in a delimited normalized format.

 

Generate company options

Use equivalent name

If enabled, then the equivalent (according to the NAMES.DAT file) of words indicating a business name, such as "Motors" or "Services" are included in the normalized organization name and the corresponding phonetic keys.

Normalization truncation

If enabled (non-zero), and the organization consists of more than three words, then the third element of the normalized organization name will be truncated to the first N characters of each word after the first two (where N is the value of this setting).

Ignore parentheses

With this property enabled, any words that are enclosed with parentheses within an organization name will be excluded from the phonetic organization keys.

Ignore trailing post town

Exclude any trailing post town from the phonetic organization keys.

 

Generate address options

Verify postcode

If enabled, verifies and corrects the format of the postcode.

Default street line

This property is used when the generating a phonetic address key, to indicate the position of the street and the town in the address.

Lines to scan

This property enables personal names to be extracted from address lines. It can be set to 1 or 2.

Premise first

Indicates whether to expect the premise or flat number to come first in address lines

 

Generate options

Drop excluded words

When enabled, flag any records that contain exclusion words in any of the key fields.

Consider casing

When enabled, consider the casing of the incoming data when splitting the data up for extracting keys, proper casing, and so forth.

Variable keys max length

Specifies the maximum length of various variable-length phonetic keys.

 

Compare tab

 

 Compare_Config.png

 

 

 

Phonetic compare options

Algorithm

The phonetic algorithm used when scoring. There are five choices available: soundIT, Loose_SoundIT, Dynamic_SoundIT, Soundex, None.

(Algorithm) for first name

Optionally specify a different algorithm to compare first names (‘none’ means use the same as the main Algorithm setting.

Loose threshold

When the Dynamic_SoundIT algorithm is in use, this property controls the threshold at which soundIT is switched to Loose soundIT.

Fuzzy compare options

Algorithm

The fuzzy algorithm used when scoring. There are two choices available: matchIT_Fuzzy, Damerau_Levenshtein.

Maximum edit distance

The maximum number of differences between the two strings. (Applicable to Damerau_Levenshtein only)

Minimum score

The minimum fuzzy score. (Applicable to Damerau_Levenshtein only)

 

Address compare options

Match box number and postcode

When enabled, two compared addresses score Sure if they contain matching postal box numbers and postcodes.

User premise range

When enabled, this will allow addresses to contain premise ranges.

Loose fuzzy premise match

When enabled, additional fuzzy premise matching is performed.

Match delivery points

When enabled, this will prevent two addresses from matching when both contain two postal codes but different delivery point codes and the addresses score below the minimum threshold.

Match DP threshold

See match delivery points, above.

Default DPSs

See match delivery points, above.

Ignore premise suffix

When enabled, this will allow two premises to match regardless of whether one or both has an apartment- or flat-type suffix (for example, 12 and 12a).

 

Name compare options

Prevent mrs matching miss

When enabled, then two compared names will not match if one has a title of Mrs and the other a title of Miss.

Fuzzy match non-normalized names

When enabled (the default), this will cause additional matching checks to be performed on names using the non-normalized name matching fields.

Blank name company matching

When two records contain no addressee names, this setting will allow the names to achieve a score depending on what's available in the job title and company name fields.

·         0 - Off

·         1 - On if either name blank

·         2 - On when both names are blank

Initial match equivalent

Controls how an initial matches a name that's equivalent to the given firstname. For example, when comparing Rebecca Smith and B Smith, then the B could be considered a match for Becky, which is a common abbreviation (or equivalent) of Rebecca.

Cross match initial to name

When enabled (the default), and the first letter of a firstname matches the middle initial (for example, "Richard Smith" and "John R Smith") then the names will be considered a possible match.

Fuzzy match initials

Controls how similar-sounding initials (M/N, S/F, and G/J) can be matched. When set to 'full' (the default), then one name's initial is permitted to match the first letter of the other name's firstname (for example, "M Smith" versus "Neil Smith"). When set to 'initialsOnly', then only initials are permitted ("M Smith" versus "N Smith"). A setting of 'noMatch' disables such matches.

Initial match forename

Controls the result achieved when an initial matches the first letter of a firstname. This defaults to 'equal', so that B Smith versus Bob Smith will achieve the same result as Bob Smith versus Bob Smith (i.e. 'equal' for the firstnames). Reducing this setting to 'approx' or 'contains' will reduce the resultant name score in order to distinguish such matches.

Fuzzy match forename

Used to prevent different recognized firstnames from fuzzy matching. For example, ordinarily Ron and Roy will fuzzy match.

 

Resources tab

 

 Resources_Config.png

 

Threads

By default - or if 0 is specified for the number of threads - each engine will use all available processor cores.

Debug log

Should a process that uses matchIT Hub unexpectedly crash, the engine can be configured to create a debug log of all data loaded and all operations performed on the data.

Memory usage

Input buffer

All data added is initially stored in the input buffer. The processing threads remove this data from the input buffer for processing.

Output buffer

Similarly, all results are written to the output buffer. The application must remove results from the buffer to prevent it from becoming full.

Block size

Every item of data - once it's been removed from the input buffer and acquired by a processing thread - is stored internally in blocks along with other items of data.

Cache limit

When a block is full, it's added to the fast cache. When the cache becomes full (if the cacheLimit is not 0) then archiving will begin.

Threshold

When the memory usage of the running process exceeds the threshold, blocks will be moved from memory to the temporary disk paging file.

Compression level

The compression level can be 0 for disabled, or 1 (fastest compression) to 9 (slowest/best compression).

Encryption

The encryption key size can be 128, 192, or 256. Encryption of memory-resident data should not normally be required, but can be enabled if necessary.

 

Disk usage

Location

Specifies the directory in which a temporary disk paging file will be created, should the process's memory usage exceed the threshold (refer to Memory Settings, above).

Limit

A nonzero limit can be specified, in which case the process will be aborted should the disk file's size exceed the limit.

Compression level

The compression level can be 0 for disabled, or 1 (fastest compression) to 9 (slowest/best compression).

Encryption

The encryption key size can be 128, 192, or 256. A key size of 256, for maximum security, is highly recommended.