mSQL - SSIS Tutorial - Find Matches / findoverlap more in depth, weights and constraints

Previously we set up a find matches with the defaults, all of below also applies to the findoverlap task

Let's jump into that setting and some advanced options, such as getting more granularity in our results, using non-default keys, changing the minimum score, and adding a weight to an additional field like telephone and email.

Before you can incorporate telephone or email into a match for example, you need to map it back in generatekeys, so if you haven't already, then go back and map those fields before getting started. If you're using our example1 data, you can just map telephone.

 

1. open up find matches – click show advanced options if its not already ticked.

The first thing we recommend is clicking component scores, for whatever level you’re on, we suggest breaking out the scores for the individual elements
for individual or family, it's name, address, postcode
for household, it's just address and postcode

For business, it's organization, address and postcode

You can also show the score for any additional fields as well, such as telephone or email or a customfield if you plan on incorporating it into your match.

 

 

2. Next, let's go to the keys tab, there’s two things you could do

A. Tighten up existing keys – if you find matching is running too slow, it could be because you have inefficient keys, you could make the keys more stringent by forcing the zipcode to start out the same for them to be compared in the first place, in our example below we added left(mkPostOut,3) to the 2nd and 3rd keys.

alternatively, the simplest approach may be going back to the previous screen, clicking high data volume, then returning to the keys screen and refreshing, then we'll fill in some tighter default keys for you and overwrite the existing ones.

B. Alternatively, you could add more keys if you think we’re missing matches, by adding a key like email you would force us to compare all records with the same email

with telephone – use the mkfields, as those are already standardized.



 

The default keys work for our clients about 85% of the time, otherwise normally a subtle change like adding one or two keys, or modifying an existing key is all that’s necessary.

 

Remember, our matching works in a two step process

First we line it up on the keys, the keys only control what we compare in the first place, because comparing every record to every single other record is inefficient. Then as a second step we score the matches and test them against our constraints before we present it to you.

 

3. Now lets go to the advanced tab

Here you’ll see the weight for whichever level you chose previously. It is possible to run multiple levels at the same time, although for simplicity we will suggest running one level at a time, there are other considerations then that you may need to go back and adjust your exact matching keys as well if you do want to run multiple levels at the same time.

 

 

By default, something has to score 80 in order to count as a match on individual, business, or family level, or at least 50 for a household match, or 60 for a name only/company only match

 

So if you want to get less fuzzy matches, you could raise the minimum score to 86 or 95 for example, wheras if you wanted more fuzzy matches you could lower the score to 71  (we wouldn’t suggest lowering the minimum to 70 or below as that will likely be too fuzzy)

for household, you may raise the minimum to 55, or could lower it to 40 for fuzzier matches.

for name only/company only(which would be used instead of individual/business), 60 is the default high score, raising the minimum any higher will prevent all fuzzy matches from showing up, but you could lower it to 40 or 25 to let in fuzzier matches.

 

4. The next thing is weights/ constraints (in below screenshot we're using business level )

Lets start out with the 3 most common constraints –

Here you’ll see constraints along the top - Must match location, must match premise, and no one empty premise

 

 

Example 1 – must match location
So let's say you have 2 records from Starbucks in New York City

Company Address City State zip
Starbucks 1 main street New York NY  10001
Starbucks PO BOX 123 New York NY 10003

By default – we’re going to keep these separate – that’s the ‘must match location’  setting – this says mainly – "don’t cross match streets to PO Box’s"


If you want matches like that – then you could turn off must match location (but that may let in false matches , we leave it on by default for most matching levels)

 

Example 2 – must match premise

Let's say you have two Starbucks records

Company Address City State zip
Starbucks 1 main street New York NY  10001
Starbucks 42 main street New York NY 10001

 


Starbucks , 1 main street , New York NY  10001
Starbucks ,  , New York NY  10001

This would make it past ‘must match location’ – because they’re both on the same street, and maybe someone typo’d this.

 

If your data is highly concentrated although, this likely isn’t a typo

When we generate keys, we extract the premise (or house number) – so the premise for these records would be 1 and 42 respectively.

 

By clicking must match premise – we will not record this as a match – as 1 doesn’t not equal 42

 

This is normally the easiest way to reduce false matches if you want to automate your process, we’d suggest this as the first change to make if you see matches like above and don’t want them.

  

Example 3 – no one empty premise

 

Company Address City State zip
Starbucks 1 main street New York NY  10001
Starbucks main street New York NY 10001
Starbucks 42 main street New York NY 10001

 

Here – the main street record matches 1 main street, and the main street record matches 42 main street, even though we won’t link the  1 main street to 42 main street directly because of the previous constraints, they are bridged together into one group by the record missing the premise information.

If you see matches like this, we suggest enabling the no one empty premise

These are the 3 most common changes users will make, its much easier than playing with the keys.  This all applies to individual level as well if you were to use a person's name instead of a company name.

For family level, we'd suggest enabling must match premise as well in most cases.

For household level, we enable more constraints by default since we remove the emphasis on name when it comes to scoring.

 

 

5. Modifying the Weights

Just as an intro, we’ll add a small weight to telephone and email for example.

We suggest leaving the weights mostly alone, or only making subtle changes until you understand them more . We’ll get into the scoring more later on, or feel free to reach out to support for more advanced questions.

So, in this case, I’m going to add a weight of 9 to the sure score for both email and telephone.

By adding a weight to the sure score, they will get those points if both records have that field exactly the same, this is a cumulative score so the email score would be added to what it scored on name/company/address/zip.

If you add a weight to the likely or possible scores, that will allow matches that have one or two transpositions respectively. At this point we would recommend not adding a likely or possible weight to telephone, as that could lead to false matches possibly as phones are less unique than email frequently.

The one empty and both empty scores are if the telephone or email are not populated in one or both records.  Some people like to identify that kind of information pattern and may put a small weight of 1 or 2 points in one empty/both empty. You can only see this information if you broke out the component scores like advised at the beginning.

As a note, you can also apply a negative weight if you wanted records to not match based on some field being the same.

 

 

Click ok and save when done.

 

 



Was this article helpful?
0 out of 0 found this helpful

have a question or not finding what you're looking for?

Submit a ticket to get some help