mSQL - SSIS Tutorial - Incorporating exact matching

Exact matching isn’t always necessary but is normally suggested as it’s an easy way of improving performance in most cases.

 

When should I incorporate exact matching?

  • You have a high duplication rate – exact matching is a way of getting ‘easy’ matches out of the way
  • more granularity in results – exact matching provides an extra tier of scoring, so even if a match scored 130 (the highest score with our defaults) – an exact match would be an even better match than that, instead of being mixed in
  • you want to match on a single field, such as an exact email match, or an account number or invoice that would normally not qualify as a match from a fuzzy perspective

 

Why shouldn’t I run an exact match

  • exact matching selects all the columns and their data into memory, this may be slow depending how large your table is and how much tempdb space is allocated on your SQL instance
  • you have lots of blank fields or junk in your data, in which case the exact matching will skip over most matching due to the columns not being ‘optional’ by default, or may over report matches that normally would be skipped, we expect the data to be well populated by default.

As a note, anything found by an exact match is removed for processing for the fuzzy match, so if you're too aggressive with the exact match by say doing an exact match on email, you could just end up introducing false positives if your data isn't clean. We normally recommend a single exact match key, that encompasses all the significant data points.

Let's add exact matching to a process we set up previously.

 

1. Exact matching should be run between generate keys and findmatches – there are two tasks that would need to be added

Findexactmatches
groupexactmatches


 

2. Assuming you're editing the package from the previous article, the first thing you'll need to do is delete the connector between generatekeys and findmatches

 

 

3. Drag and drop the tasks, connect them in the order like below. You can highlight and arrange using the format menu above, you make them the same width then center them, although if you want to rename the tasks you should do that before.

 

 

4. Open up findexactmatches – notice it reads in the task from the previous source

 

5. If you want to use the default key, you can just save and skip to step 12,

but if your goal is more granularity, I would suggest altering the existing key

 

 

Notice up top that a single key is listed, unlike the fuzzy keys where its only one or two fields, this concatenates a majority of the key fields. 

Anything that starts out with an mk  is a matchkey field, and has already been standardized by mSQL.  You can think of the list of columns within a single key as ‘AND’ statements,  whereas having multiple keys up top are like “OR”  statements

 

6. In this case we’ll stick with the single key, but we’re going to alter the default key to use your input data instead (all those columns you mapped during generatekeys)

 

Go through and alter the default key to use a majority of you input fields, when you click on the key up top, it lists the individual fields on the bottom, you can then replace those mkfields with your input fields.

 

NOTE: IF YOU SEE TOWN/REGION POSTCODE – those are equivalent to CITY/STATE/ZIP


If you’re doing business level matching, you only need to map organization and any address lines and town/region/postcode

If you’re doing individual level matching – you only need to match fullname (and/or  prefix/first/initials/last/suffix )  and any address lines/town/region postcode.

 

7. Once you’ve chosen all your keys, delete any extra ‘mk’  fields, then apply the PUNTRIM function to all the keys

this helps with simple things like 

123 MAIN ST APT 1-A
matching to 
123 main st. apt1a

which would be missed by standard SQL query



 

8. Your input data isn’t standardized which is why we suggest applying the PUNTRIM function, whereas an ‘mk’ field has already been standardized by mSQL and shouldn't need a function applied to it.

 

9. You can also set the ‘optional’ selection to yes or no, this is basically stating whether it's optional that the field is populated, so in this case I would consider address2  optional – as that isn’t always filled in because not every record has a suite,  but fields like the name and zip are not optional, as those are required to count as a match.

 

10. If you wanted to do an exact email match or additional exact match keys, you could also add that – but we’d suggest you QA your results if that’s the case, or at least not choosing  yes on the ‘optional section’, having a match of email only may cause false positives if you had say two different people with an info@360Science.com , so you'd either stick to the one exact match key and more easily identify that during fuzzy matching(reccomended), or make the key more concise by adding name in addition to email as part of the exact key.



 

11. Save the task

 

12. Open up groupexacmatches, then save it, no changes needed

 

13.  Open up findmatches – change the source to group exact matches (this isn’t needed if you are starting a new find matches), also ensure that ‘exclude exact matches’ is ticked

 

 

14. Open up groupmatches

 

set Merge Exact matches to ‘all matches’, save (always set it to all if you're not using the default exact key)

 

 

Exact matching is now incorporated into your process.

 

 

 

 

Was this article helpful?
0 out of 0 found this helpful

have a question or not finding what you're looking for?

Submit a ticket to get some help