Dirty Data is a result of software packages that never trapped erroneous data to from being entered. Most database records for many businesses contain some erroneous data. Clinical data analysis revealed error rates as high as 27%* in some data fields.

This could indicate incomplete or incorrect data entry or inaccurate data. Dirty data exists in every database and is the bane of all Data Scientists’ life. The majority of the time spent by Data Scientists is in cleaning up the erroneous data

How much corrupted data do you have? That is the key question.

What constitutes dirty data: –

Duplicate data.

Incomplete data

Incorrect data

Incorrectly associated data

Poorly formatted data

Low-quality data

Invalid data

Inconsistent data

General data clutter

Audit Your Customer Data for Dirty data

Only a high level summarized report can highlight the scope of the problem.

Printout reflecting missing fields n LinkedIn
Report generated from LinkedIn
    
Field  Number of Instances Number of Dirty Data records
First Name 5790  137 (5790 -5657)  fields with missing data
Surname 5790 37 (5790 -5657) fields with mising data
Email 5790   5647 (5790 – 142) fields with missing data
Companies 5790 220 (5790 – 5570) fields with missing data

                                                                                                                                   

Various methods for Data Cleaning / Data Scrubbing

There are four different routes that companies, typically, can take for their data cleanup projects. Each has its pros and cons.

Manually

Anybody in a company with guidance can carryout this task. However it is a tedious exercise and prone to errors.

Using Excel

Today, using Power Query Editor in Excel, there are built in dedicated functions to clean data

Excel skills are required to use the “plugin” More efficient in time and cost, than doing it manually

Using Data Scientists

This is a very expensive resource and is the bane of their life!

Extract, Transform, and Load (ETL) software

This is a dedicated software solution to transform and clean data before it is loaded into a database or social media apps.

To learn more: click here

References

* blog.insycle.com/

Real Analytics 101