mSQL - SSIS Tutorial - findoverlap vs findmatches

Find Overlap is for matching two separate data sources.  So you would be setting up two generate keys, pointing to your two separate tables.

 

First, let's go over the difference between an overlap vs. an internal dedupe (the findmatches/groupmatches tasks) as this is a common point of confusion.

 

When you’re matching internally (findmatches), you have these 4 records for example

 

1

W R Dayton Jr

Palmer Air Charters

Suite 106

7350 Airport Rd

Wilmington

28401-4273

2

Bill Deighton

PAC

735 Airport Rd #106

Wilmington

NC

28401

3

Mr. W.Datyon

Palmr Air

7350 Airport Road

Ste 106

Wilmington

28410

4

William Deighton

Palmer Air Charters, Inc.

7350 Airport Rd

Wilmington

NC

 

 

These 4 records may not link directly, but if we know 1 matches 2, and 2 matches 3, and 3 matches 4,  then we can determine that all four go together.  And when we link them we give a base score – which is the weakest link in the set.  When you’re doing an internal dedupe its all about gaining that single customer view.

Now that is good for an internal dedupe.

 

Although when overlapping (findoverlap) it’s a different approach, as opposed to linking records together and choosing the weakest link as the baseline, you want to instead choose the best match

So instead let's say we had two tables

Table1 lets say is your existing customer table, and has 3 separate records (this may indicate that it should be deduped, but that may be a separate project)

W R Dayton Jr

Palmer Air Charters

Suite 106

7350 Airport Rd

 

 

Bill Deighton

PAC

735 Airport Rd #106

Wilmington

NC

Mr. W.Datyon

Palmr Air

7350 Airport Road

Ste 106

Wilmington

And we have table 2 – lets say this is a feed of new sales

William Deighton Palmer Air Charters, Inc. 7350 Airport Rd  Wilmington  
John smith LeDroit Park 416 FLORIDA AVE NW WASHINGTON, DC 20001  
Smith, John Park Office at ledroit 416 FLORIDA AVENUE NORTHWEST WASHINGTON, DC 20001

Now, we can only link that incoming record to one of our three existing customers. 

So instead what we’ll do is findoverlap , it will compare the 3 customer records against our incoming table.  In this case – we may have found 2 or 3 separate matches, but in this case – it’s about picking the best match, that way we don’t make a new customer record when we shouldn’t.

But we still have the issue that we’re attempting to link 3 new sales, but only caught 1.  So John Smith placed 2 orders, but really should only have 1 account.   So, in this case, you should take the records that were not matched in your overlap, and possibly run them through a dedupe internally as well.   That way we only create one new customer account, instead of 2.   

You shouldn’t rely on just overlap, or just findmatches necessarily.   Ideally a combination of both will get you the best end result. It's just a matter of figuring out the order that works best for you.

We do recommend before running an overlap, that at least one of the tables would ideally be deduped internally.  Otherwise, you can end up with many to many relationships