Keys and Scores - an Introduction to matching

We Split our matching into a two step process, we use keys to create candidate groups, then we use scoring to control what we present to you. At its core what does this mean for you?  

More good matches, less false positives.

 

Match Keys

The mAPI allows you to generate phonetic “match keys” from names and addresses, as well as non-phonetic standardized data.  In this context, a match key is something that groups of records within the database have in common, which indicates that detailed comparison of the records is worthwhile to see how well the records match each other in other respects.

In an ideal world, a user would be able to compare every single record in a database with every inquiry, or with data entered on a data capture screen, or with every other existing record in the database. For internal matching on a 10,000 record database, this would mean you would do roughly 10,000*5,000 i.e. 50 million comparisons - even on this fairly small database, the processing time would be far too long. For inquiries or data entry, looking at every record on the existing database would take far too long for a real time application.

To reduce the number of comparisons, we can look at a field and say that only records that have the same value in that field are potential matches. For example, we could select the Lastname field (if the last name is held separately from the title and first names or initials), and only compare records where people have the same last name. This would dramatically reduce the number of comparisons we would have to do because, for example, we would not compare a record for Mr Smith to a record for Mr Jones. We would compare all the Smiths, because two records for Mr Smith might be for the same person. In this approach, we have used the Lastname field as a match key.

This approach is an improvement on our original idea, but it has its limitations. For example, take the name Deighton – people may expect it to be spelt  Dayton, as this spelling is far more common than Deighton. The two spellings sound the same, but our solution above would not compare Mr Dayton and Mr Deighton, because the Lastname field is different. The obvious solution to this is to use a "sounds like" version of the Lastname field as a phonetic match key. The mAPI takes important fields (such as name, address, company), and generates phonetic versions of the key elements in those fields.

However, even this approach still gives us too many records to compare, most of the time. For example, if you are comparing records across the whole of the US, there are a lot of Smiths – in a file of 100,000 records there would on average be nearly 750 Smiths, which would involve over a quarter of a million comparisons if processing internal matching of the database, or 750 real time comparisons for inquiry or data entry. To get round this, we use combinations of fields to narrow the search – Mr Smith in Boston is obviously not the same person as Mr Smith in Dallas. In the example above, we may choose something like phonetic key of last name plus part or all of the postal code – so now we only compare two records for Smith if they live in the same area. This is a more explicit match key than just phonetic last name.

The last point to consider is that when looking for matches is that you can never rely on any single item of data always being the same in order to find all the matching records there are - you need to do at least two scans of the database using different match keys. If, for example, two records were identical except that one record did not have a postal code, you would miss the match if you relied just on phonetic key of last name plus the postal code - because the records would not be compared unless the postal code as well as the phonetic last name were the same. However, if the street in both the records is the same (which is quite likely, if they are really matching records), then a second match key of phonetic last name plus the phonetic key of the most significant word(s) in the street will find the match on the second scan of the database.  For data entry, this may mean that a potential duplicate of a new record would not be detected until the address has been entered, if the postal code held on the record on the database is different from that being entered – but most duplicates would be detected before entering the address.

For most data files, we recommend three match keys - for US data, our third default key is a phonetic key of city and street together with the street or apartment number. For UK data, it is the postal code on its own. These keys usually give reasonably small groups of records to compare and allow non-phonetic last name matches to be detected. For example, you could pick up a match of Wilson and Wislon, as long as the postal code in the two records is identical and the address is effectively the same.

 

Match Scores

Having established that two records have the same match key, or that a record being entered or sought has the same match key as an existing record on the database, you can use the mAPI to go through the data, field by field, and work out how similar they are. Each field can contribute to a match score, depending on how similar those fields are between the two records, or between the new data and the existing record. At the end, we have an overall score that tells us how alike two records are – by default, the higher the score, the more similar the records are.

For batch processing, when flagging duplicates or merging files in a batch process, you can ask the user to enter a threshold score, above which your system will automatically flag one record from all matching pairs. With most data files, all pairs scoring above a particular score will be genuine duplicates ("true matches") and anything below a lower score will be "false matches" – this leaves a "gray area" between these two scores where most of the pairs are true matches but some are false. For a cautious approach (“underkill”), you can therefore enter a higher score as a threshold score for deletion. For overkill, you can enter a lower score. Alternatively, you can go for underkill and then visually inspect the matching pairs in the gray area - there are usually relatively few in this area.

This process works well, but it is essential that system designers can tune it themselves. This is because matching requirements can vary from company to company, file to file and even job to job. For example, sometimes you want to match at individual level, sometimes at company or family or address. In addition, data files vary widely in their structure and overall "shape" of the data" – sometimes postal codes are reliable and low level, sometimes they are unreliable, or they only indicate the town/city and not the street.

Because we know that everyone's data is different, we have allowed the way that two records are compared to be customized, using a parameter table that tells us how much each field contributes to the overall matching score. Using this table, we can tell the mAPI how important each field is in the matching process. We call this the Weights table, as it reflects the relative weighting that each field has towards the total match score. For name and address matching, the mAPI compares the elements of the name (or address) as a whole, rather than just comparing them element by element - this allows it to match names where some of the components are omitted or in a different order in one record (e.g. John Michael Smith and Mike Smith) or addresses which have a house or building name in one record but not in another.