Data Validation Series: Methods to Clean Up Datasets

Parker Smith
Jan 5, 2024
6 min read

Updated: Jan 10

Welcome to the third post of Bear Cognition’s Data Validation series. In the first piece, we covered the importance of validation. The second piece continued to build our understanding of validation by highlighting some of the most common errors in datasets. Now that we understand why data should be validated and what warning signs to look for, we are finally going to get into the details of how analysts clean data before the visualization process.

Proper validation of large datasets can seem like a daunting task; not only do you have to spend the entire time looking at raw data rather than creating visualizations, but it also can take quite a bit of time and precision to correctly clean the data. Luckily, there are many tools that can vastly improve accuracy and tremendously speed up the process overall. Spreadsheet applications such as Excel and Google Sheets supply the user with various tools to find errors or duplicate values. More advanced software such as Tableau’s Prep Builder as well as other ones that are built to help the user seek out numerical errors in large datasets can provide these tools. These certainly are only a few examples of the wide variety of data verification tools that exist to help analysts in their data verification processes.

Categorical Data

No matter which methods or software an analyst chooses to verify data, the most simple of errors to clean up will always be the ones that involve non-numeric data. This non-numeric data includes the presence of typos, the consistency of categorical data labels, and the correct identification of variable types. Many data software services will automatically detect a variable type for each column; while these are usually accurate, analysts should still use their own understanding of the data to verify that the variables have been categorized accurately. For example, if a category in a dataset is determined by a few numbers such as 1-3, it could be incorrectly labelled as a numeric column. While some data analysis software allows for variable types to be changed in the middle of a project, it can be easier for analysts to label these columns correctly during verification. To fix typos and categorical labels (such as “United States” vs. the abbreviated “USA”), analysts can use data cleaning tools to view every unique value in alphabetical order which helps seek out errors much faster than parsing through entire columns.

Numerical Data

While an analyst’s understanding of the data is important for cleaning the categorical side of data, it is much more important for cleaning up errors among the numerical fields. It does not take many incorrect values to have serious consequences on the overall outcome of an analysis; something as small as an accidental extra digit can completely damage the accuracy of data. There are a few key aspects of every numeric field in a dataset that analysts should check:

- Range: Knowing the possible values for a particular variable is important to finding errors. Common examples of where impossible values can appear are in physically measurable fields. If a dataset includes the age of customers, then finding a value of -26 is a clear indication of an incorrect value. The same applies to measures such as weight, height, or distance as well as values that have a strict range such as a proportion between 0-1.

- Distributions & Suspicious Responses: One of the more advanced capabilities of data cleaning software is the ability to provide analysts with visuals of the distribution of values across every given variable. Visually seeing these allows for suspicious responses, also known as “outliers”, to be identified easily. Upon finding an outlying value, analysts should use their understanding of the variable to compare it to the other values and determine if the number could be genuinely accurate or if it appears to be an error. A common mistake that may result in an outlying error is a numeric typo of an extra digit or two at the end of an otherwise correct number. Below are two useful visualizations to use in data verification to identify outliers: histograms (left) and box plots (right); the outlying value is highlighted in yellow.

Null Values

In the previous verification post, we discussed how null values are arguably the most frequent mistakes among datasets. “Null” values are a literal absence of data; these are denoted by blank spaces in data sources or a null identifier such as “NULL” or “N/A”. Much like the previous errors we have discussed, null values can be found with ease by using data verification software tools.

As we have been discussing examples of what data verification software can do, there has been a running theme: the identification of errors. These tools save a great deal of time for analysts by bringing immediate attention to potential mistakes without the workload of manually inspecting every single row of data. Finding these mistakes is only the easy part, however; the harder part is determining what exactly to do with all of these errors.

Actions

Multiple different options exist for dealing with errors in datasets. Each one has its own pros and cons, but ultimately an analyst should use their understanding of the data to determine which choice is best for certain errors found during the verification process.

· Row Deletion: This option is generally the one that analysts least prefer, considering that it involves the deletion of information. This typically is not done in response to one field having an error, but rather the vast majority of a row (70% or more) having errors.

· Replacement using a statistic: If a numeric error needs to be replaced, a quick and common response is to replace the value with either the mean or median of the correct values already in the field. This process does not take up too much of an analyst’s time and it ensures that the computed values will fall in the typical range of the variable. However, this naturally adds bias to the data by centering more values around the chosen statistic. While this is the fastest replacement method, others are more accurate.

· Replacement using prediction: If analysts want more realistic replacement values for each error, they can create a regression model to determine predicted numbers where the errors arise. Building these models will add more time to the verification process but result in stronger accuracy in the replacement values.

· Replacement using algorithms: By far the most time-consuming option for value imputation would be to create a machine learning algorithm to replace missing values. Two examples of these would be K Nearest Neighbors and Random Forests. More advanced than regression, these algorithms adapt well to non-linear data and can replace errors in categorical fields as well as numeric ones.

· Replacement using “unknown”: If categorical data contains null values, it can be difficult to determine good replacement categories without one of the algorithms above. For a quicker fix, analysts can replace these null values with an “other” or “unknown” category to signify that the true data for those observations was not available. This is the common go-to fix for categorical data without using algorithms.

Replacement with Consideration to Error Values

For null values in data, analysts should always defer to the above possibilities to take action. For actual values that have been identified as errors, however, analysts should also take the incorrect data into consideration for their replacements.

· Categorical: If a category value can be clearly seen as a typo to another category in its field, analysts are generally safe to replace them with the correct spelling; an example would be finding “New Yokr” in a field that contains US states and manually changing it to “New York”.

· Numeric: For many impossible or extremely outlying values in data, a typo can also be the root cause. For example, it may be likely that an age value of -26 is meant to just be 26. In this case, an analyst should investigate if such a replacement would make sense in that observation row by using one of the replacement choices above. If the result gives a replacement value for that observation that is relatively close to 26, it may be fair for the analyst to go ahead with 26 and assume that a typo was at play.

Seeking out issues from raw data is difficult, but luckily there are plenty of tools out there for analysts to use that have made finding these errors quicker and easier than ever. Once identified, these errors can be replaced in many ways; when determining which method will produce the most accurate replacement for each case, analysts should rely on their understanding of the data fields as well as knowledge of these different remedies to determine the best path into accurate data verification. At Bear Cognition, every dataset is carefully verified using these methods to ensure that the data in our visualizations remain accurate.

If your business is looking for actionable intelligence with accurate data, please contact us today!