Dirty Data is a result of software packages that never trapped erroneous data to from being entered. Most database records for many businesses contain some erroneous data. Clinical data analysis revealed error rates as high as 27%* in some data fields.
This could indicate incomplete or incorrect data entry or inaccurate data. Dirty data exists in every database and is the bane of all Data Scientists’ life. The majority of the time spent by Data Scientists is in cleaning up the erroneous data
How much corrupted data do you have? That is the key question.
What constitutes dirty data: –
Duplicate data.
Incomplete data
Incorrect data
Incorrectly associated data
Poorly formatted data
Low-quality data
Invalid data
Inconsistent data
General data clutter
Audit Your Customer Data for Dirty data
Only a high level summarized report can highlight the scope of the problem.
Field | Number of Instances | Number of Dirty Data records |
First Name | 5790 | 137 (5790 -5657) fields with missing data |
Surname | 5790 | 37 (5790 -5657) fields with mising data |
5790 | 5647 (5790 – 142) fields with missing data | |
Companies | 5790 | 220 (5790 – 5570) fields with missing data |
Various methods for Data Cleaning / Data Scrubbing
There are four different routes that companies, typically, can take for their data cleanup projects. Each has its pros and cons.
Manually
Anybody in a company with guidance can carryout this task. However it is a tedious exercise and prone to errors.
Using Excel
Excel skills are required to use the “plugin” More efficient in time and cost, than doing it manually
Using Data Scientists
This is a very expensive resource and is the bane of their life!
Extract, Transform, and Load (ETL) software
This is a dedicated software solution to transform and clean data before it is loaded into a database or social media apps.
To learn more: click here