mSQL - SSIS Tutorial - Incorporating Exact Overlap

Why should I run an exact overlap match?

  • You have a high duplication rate – exact matching is a way of getting ‘easy’ matches out of the way
  • more granularity in results – exact matching provides an extra tier of scoring, so even if a match scored 130 (the highest score with our defaults) – an exact match would be an even better match than that, instead of being mixed in
  • you want to match on a single field, such as an exact email match, or an account number or invoice

 

Why shouldn’t I run an exact match

  • exact matching selects all the columns and their data into memory, this may be slow depending how large your table is and how much tempdb space is allocated on your SQL instance . Or if there’s a large difference in table QTY – such as matching a 120 million record file against an incoming feed of only 20,000 records, you’re likely better going straight to fuzzy matching (find overlap).
  • you have lots of blank fields or junk in your data, in which case the exact matching will skip over most matching due to the columns not being ‘optional’ by default, or may over report matches that normally would be skipped, we expect the data to be well populated by default.
  • There are inconsistent table structures
    most clients prefer to do an exact match on the source data, so if one table has first/last , and the other has fullname, they won’t line up well. Similarly, if one table has 2 address lines, vs 4 address lines – unless you run it through the gencorrectedaddresses first, it won’t line up.

    You can use the default exact key in that case, or a mixture of source and key fields, although that would only be suggested if you have a high duplication rate.

 

Exact matching isn’t always necessary but is normally suggested as it’s an easy way of improving performance in most cases.

Ideally at least one of the tables are deduped internally first.

Let's add exact matching to a process we set up previously.

 

1. Exact matching should be run between the generate keys(x2) and findoverlap – there are two tasks that would need to be added

Findexactoverlap
Groupexactoverlap

If you're adding exact matching to an overlap process that was already set up, then please follow all steps below. Otherwise if you're building the package from the first time then just ignore steps 2 and 10.

 

2. Delete the connector between generatekeys and findmatches if you're editing an existing task

 

 

3. Drag and drop the tasks and connect them, it should be like below

 

 

you can align/rearrange using the format menu above if you highlight the bottom 4 tasks, make them the same width then center them if you want to make it look nicer

 

4. Now open up the exact overlap task

 

By default, we’re going to choose the larger table as the source, and the smaller table as the overlap table. 

 

 

If your tables are empty or the same qty, you may need to choose the main and overlap data source yourself, or you can switch the main and overlap at this point, once you choose it here, it will then flow down through the other tasks.

 

5. When running an exact overlap by default we’re only going to have one to one or one to many relationships written out. This is different behavior from the findoverlap/findmatches tasks as those will write out many to many relationships, depending on your process, this may be something you should be aware of.

 

6. Now let's look at the keys by clicking next or clicking keys up top or next on the bottom.

 

When you click the key up top, that lists the selected key in the bottom half.

 

 

Think of the list of keys within a single key as an AND  statement, all those fields have to line up exactly for it to count as a match.  It’s a simple yes/no when it comes to exact matching.

Whereas if you were to click up top to add a key, that would be like an OR  statement.

 

7. If your source data is different between the two tables such as with our example tables, then you’re likely best off sticking with the default key for an overlap. 

Otherwise, if your tables are both standardized in the same manner, then we’d suggest using the source fields, similar to previously suggested in the findexactmatching tutorial.

You can use a combination of key fields (anything that starts with an MK*)  and source data, but that will be less efficient as we’ll have to select across multiple tables.

So for example if you ran both of our example tables through address validation, then you could use the address lines and city/state/zip from the corrected tables where they'd line up, but then use key fields for the name such as mknormalizedname and mkgender.

 

8. Once you have your key set up, then you can save the task.

 

9. Now simply open up the groupexactoverlap and save it.  It automatically reads in from the previous task

 

 

10. If  you added exact overlap to an existing task, then we suggest checking the findoverlap and setting it to read in from the groupexactoverlap – also ensure that  ‘exclude exact matches’  is ticked,  otherwise any efficiency gains will be lost.

otherwise continue to set up the findoverlap/groupoverlap as covered previously if this is a new one.



 

 

 

Was this article helpful?
0 out of 0 found this helpful

have a question or not finding what you're looking for?

Submit a ticket to get some help