The Importance of Data Cleaning

Data cleansing or data cleaning is the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. [Source : Wikipedia]

We should re-title “data cleaning” as “understanding the data”. Why?

1. It’s not a bad thing to spend 80 percent or more of our time deeply understanding the data. This step might give a clear cut idea about the data with which we might be conducting several studies or derive multiple inferences about it.

2. Cleaning the data well requires understandings it’s nuances. Cleaning is just a small part.

3. Data cleaning brings up the wrong image. We aren’t trying to make it perfect, we’re trying to actively prep it for analysis. This is where many go wrong as they take the word at its literal meaning and spend more time than its necessary, it is very difficult to make data perfect, but to prep it for analysis, way easier than the former.

4. Understanding the data is hard. It’s why it takes so much time. Cleaning feels too easy.

5. It’s a science in and of itself. We should treat it like that rather than considering it a mundane task, or delegating it to others, often this causes unwanted problems which in turn convinces the bias that Data cleaning is troublesome.

Why not just accept that data cleaning is a part of the process of analyzing data and move on?

If you think of a cooking metaphor, why should a chef be surprised if the carrots arrive not yet cut up or skinned? You don’t HAVE to prep carrots before you cook them for every meal, but, at least most of the time, you need to cut them up a bit. What chef does not keep their refrigerator organized so you can quickly pull ingredients during the RUSH? A chef who likes to be stressed out. What chef would complain that carrots need cutting before cooking? Most carrots do not come out of the ground ready to be served. It’s a part of the work. At the same time, most executive chefs have staff that work for them. They source, cut, peel, organize, and probably even cook the carrots. Saying to a customer at your restaurant. Yes, we have carrots but first we need to clean them, skin them, cut them, organize them and then cook them for you. As a consumer, don’t be surprised that I’m not really interested in what you need to do to make the dishes. I’m only here to eat them and get back to my life. So, you have to understand the carrots first? I may question how many carrots you’ve cooked. But at the end of the day, the thing that matters is how tasty was the dish that were prepared with carrots, not the carrots themselves.

Data Cleaning Cycle

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store