Fact Three - #DataGymnastics and #RegExHell Sucks

Fact 3 - #DataGymnastics and #RegExHell Sucks

Conventional customer data matching processes 

Labor intensive #DataGymnastics and #RegExHell



Matching algorithms are too limited in their inherent ability to locate matches accurately, so they rely on a ‘matchkey’ - AND matchkeys rely on clean, extracted, parsed, transformed, normalized, standardized, and enriched data,  

To create a matchkey, conventional matching solutions require analytics-ready data. What does analytics ready mean?

It means you must have a data structure in place with standardized key fields and normalized values. In other words, you MUST Extract, Transform, standardize, Clean and Enrich data BEFORE you match the data.For instance in data types like an ‘Address’; the Premise Number, Street Name, Suite/APT, City, State, Zip need to extracted to individual fields and have correct/standardized values (e.g Texas = TX, and Avenue = Ave).

For conventional solutions, analytics-ready data is critical to creating consistent match keys and building MatchCodes that return good results. A DBA/Data Analyst must create as much consistency and uniformity in each column as humanly possible. It’s a time consuming process, and the process looks like this...






There is no question that a MatchCode with properly extracted, transformed and standardized data will find many correct matches.

But the fact is, it’s not reasonable to rely on extracted, transformed and standardized data. There is no ability to overcome all of the errors and variations in data, and single match keys will fail and will miss a lot of good matches.

The reality is, it’s labor Intensive #DataGymnastics and #RegExHell. This fact was recently called out in an article published in Forbes titled “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task”

Data scientists spend around 80% of their time on preparing and managing data for analysis., and 76% of data scientists view data preparation as the least enjoyable part of their work.

It doesn't represent the realities of data, it doesn't work for efficiency and it doesn’t work at scale.