Data Validation Series: Common Errors in Datasets
Updated: Jun 2, 2021
In another blog post, we discussed the importance of validating data by highlighting the serious risks posed by performing analysis without proper validation. In this post, however, we will focus on the reason that data validation exists; datasets often have errors within them. Mistakes in datasets are much more common than one might expect: In 2017 Harvard Business Review conducted a study which found that critical errors exist in up to 47% of new data records.
In a business world that is data-driven, it is vital that analysts conduct data verification to ensure maximum accuracy in their analyses. Previously, we discussed why data verification is necessary- now it is time to dive deeper into understanding what needs verification and what indicators to lookout for.
1. Null Values
The quickest and easiest indicator that verification is needed is null values. No matter what kind of file houses the data, these usually appear in two forms: either as a value such as “NULL” or “N/A”, or simply as an empty space where data should be. Null values typically cause issues when they appear in numeric fields. There are several different ways to approach this, but we will save that conversation for our next validation blog post.
2. Duplicated Rows
Repeating rows are another common issue found in all kinds of datasets. Unlike null values, however, duplicated rows are difficult to spot just by eyeing the data. It is important to specify that duplicated rows are noted by the exact same value appearing across every column in the entire set. Even if two rows only have one column with different values, they should not be treated as duplicates.
Typos are a nuisance that can be found in any kind of digitally created content, and datasets are no exception. The effect and visibility of these mistakes can vary tremendously in data. For example, if America and Amercia are both in the dataset they could be treated as two different categorical names when they’re meant to be the same.
4. Inconsistent Expressions
If a dataset involves information about dates or locations, a typical mistake can be the inconsistency of expressions. In a column that specifies certain days, the format that those dates are reported in should ideally be the same throughout the entire dataset. For example, some dates may be reported in MM-DD-YYYY while others are reported in the format of “Month Day, Year”. Other examples of inconsistent expressions include countries (“United States” vs. its abbreviation “US.”) and addresses (“Drive” vs. “Dr.” when reporting the name of a road).
5. Inconsistent Data Types
Data sources define each column as a certain type of data. This is usually done automatically by the system, so some users may not even be aware of these definitions. These “types” categorize every variable under a variety of options such as an integer, float, or string. Integers represent whole numbers, floats represent numbers with decimals, and strings represent any combination of alphabetic fields, numeric characters, or special characters. While a couple of other types exist, these three are by far the most common. If a string type is accidentally entered in a column that is meant to be numeric, the dataset will either report it as a null value or convert the entire column type to string.
Below is a small sample dataset with examples of each of the 5 mistakes listed above. Please note that the information here has been randomly