Data Validation Series: Methods to Clean Up Datasets
Updated: Jun 2, 2021
Welcome to the third post of Bear Cognition’s Data Validation series. In the first piece, we covered the importance of validation. The second piece continued to build our understanding of validation by highlighting some of the most common errors in datasets. Now that we understand why data should be validated and what warning signs to look for, we are finally going to get into the details of how analysts clean data before the visualization process.
Proper validation of large datasets can seem like a daunting task; not only do you have to spend the entire time looking at raw data rather than creating visualizations, but it also can take quite a bit of time and precision to correctly clean the data. Luckily, there are many tools that can vastly improve accuracy and tremendously speed up the process overall. Spreadsheet applications such as Excel and Google Sheets supply the user with various tools to find errors or duplicate values. More advanced software such as Tableau’s Prep Builder as well as other ones that are built to help the user seek out numerical errors in large datasets can provide these tools. These certainly are only a few examples of the wide variety of data verification tools that exist to help analysts in their data verification processes.
No matter which methods or software an analyst chooses to verify data, the most simple of errors to clean up will always be the ones that involve non-numeric data. This non-numeric data includes the presence of typos, the consistency of categorical data labels, and the correct identification of variable types. Many data software services will automatically detect a variable type for each column; while these are usually accurate, analysts should still use their own understanding of the data to verify that the variables have been categorized accurately. For example, if a category in a dataset is determined by a few numbers such as 1-3, it could be incorrectly labelled as a numeric column. While some data analysis software allows for variable types to be changed in the middle of a project, it can be easier for analysts to label these columns correctly during verification. To fix typos and categorical labels (such as “United States” vs. the abbreviated “USA”), analysts can use data cleaning tools to view every unique value in alphabetical order which helps seek out errors much faster than parsing through entire columns.
While an analyst’s understanding of the data is important for cleaning the categorical side of data, it is much more important for cleaning up errors among the numerical fields. It does not take many incorrect values to have serious consequences on the overall outcome of an analysis; something as small as an accidental extra digit can completely damage the accuracy of data. There are a few key aspects of every numeric field in a dataset that analysts should check:
- Range: Knowing the possible values for a particular variable is important to finding errors. Common examples of where impossible values can appear are in physically measurable fields. If a dataset includes the age of customers, then finding a value of -26 is a clear indication of an incorrect value. The same applies to measures such as weight, height, or distance as well as values that have a strict range such as a proportion between 0-1.
- Distributions & Suspicious Responses: One of the more advanced capabilities of data cleaning software is the ability to provide analysts with visuals of the distribution of values across every given variable. Visually seeing these allows for suspicious responses, also known as “outliers”, to be identified easily. Upon finding an outlying value, analysts should use their understanding of the variable to compare it to the other values and determine if the number could be genuinely accurate or if it appears to be an error. A common mistake that may result in an outlying error is a numeric typo of an extra digit or two at the end of an otherwise correct number. Below are two useful visualizations to use in data verification to identify outliers: histograms (left) and box plots (right); the outlying value is highlighted in yellow.