Dirty Data

Dirty Data is a result of software packages that never trapped erroneous data to from being entered. Most database records for many businesses contain some erroneous data. Clinical data analysis revealed error rates as high as 27%* in some data fields.

This could indicate incomplete or incorrect data entry or inaccurate data. Dirty data exists in every database and is the bane of all Data Scientists’ life. The majority of the time spent by Data Scientists is in cleaning up the erroneous data

How much corrupted data do you have? That is the key question.

What constitutes dirty data: –

Duplicate data.

Incomplete data

Incorrect data

Incorrectly associated data

Poorly formatted data

Low-quality data

Invalid data

Inconsistent data

General data clutter

Audit Your Customer Data for Dirty data

Only a high level summarized report can highlight the scope of the problem.

Printout reflecting missing fields n LinkedIn — *Report generated from LinkedIn*


Field	Number of Instances	Number of Dirty Data records
First Name	5790	137 (5790 -5657) fields with missing data
Surname	5790	37 (5790 -5657) fields with mising data
Email	5790	5647 (5790 – 142) fields with missing data
Companies	5790	220 (5790 – 5570) fields with missing data

Various methods for Data Cleaning / Data Scrubbing

There are four different routes that companies, typically, can take for their data cleanup projects. Each has its pros and cons.

Manually

Anybody in a company with guidance can carryout this task. However it is a tedious exercise and prone to errors.

Using Excel

*Today, using Power Query Editor in Excel, there are built in dedicated functions to clean data*

Excel skills are required to use the “plugin” More efficient in time and cost, than doing it manually

Using Data Scientists

This is a very expensive resource and is the bane of their life!

Extract, Transform, and Load (ETL) software

This is a dedicated software solution to transform and clean data before it is loaded into a database or social media apps.

To learn more: click here

References

* blog.insycle.com/