- The client augments their own sales data with multiple third party sources to get a comprehensive view of sales and associated drivers. But data quality and integration issues, when joining so many sources, abound. Each month, several days of staff time are spent finding and correcting issues with data integration.
- The client had developed a heuristic for guiding data examiners to possible issues that needed corrections There are four primary types of data issues, with associated remedial actions.
- We built a machine learning classifier to emulate what decisions a human might need to make, based on the four types, and using the attributes of the data anomalies as features in those classifiers.
- In a validation study, the heuristic was able to correctly identify and classify the issue type 63% of the time. The machine learning algorithm did so 86% of the time.
- Running both the heuristic, followed by the machine learning algorithm “batting cleanup” dramatically reduces the time spent by staff cleaning up the data.