Data Validation Series: Common Errors in Datasets

Updated: Apr 26

In another blog post, we discussed the importance of validating data by highlighting the serious risks posed by performing analysis without proper validation. In this post, however, we will focus on the reason that data validation exists; datasets often have errors within them. Mistakes in datasets are much more common than one might expect: In 2017 Harvard Business Review conducted a study which found that critical errors exist in up to 47% of new data records.

In a business world that is data-driven, it is vital that analysts conduct data verification to ensure maximum accuracy in their analyses. Previously, we discussed why data verification is necessary- now it is time to dive deeper into understanding what needs verification and what indicators to lookout for.

1. Null Values

The quickest and easiest indicator that verification is needed is null values. No matter what kind of file houses the data, these usually appear in two forms: either as a value such as “NULL” or “N/A”, or simply as an empty space where data should be. Null values typically cause issues when they appear in numeric fields. There are several different ways to approach this, but we will save that conversation for our next validation blog post.

2. Duplicated Rows

Repeating rows are another common issue found in all kinds of datasets. Unlike null values, however, duplicated rows are difficult to spot just by eyeing the data. It is important to specify that duplicated rows are noted by the exact same value appearing across every column in the entire set. Even if two rows only have one column with different values, they should not be treated as duplicates.

3. Typos

Typos are a nuisance that can be found in any kind of digitally created content, and datasets are no exception. The effect and visibility of these mistakes can vary tremendously in data. For example, if America and Amercia are both in the dataset they could be treated as two different categorical names when they’re meant to be the same.

4. Inconsistent Expressions

If a dataset involves information about dates or locations, a typical mistake can be the inconsistency of expressions. In a column that specifies certain days, the format that those dates are reported in should ideally be the same throughout the entire dataset. For example, some dates may be reported in MM-DD-YYYY while others are reported in the format of “Month Day, Year”. Other examples of inconsistent expressions include countries (“United States” vs. its abbreviation “US.”) and addresses (“Drive” vs. “Dr.” when reporting the name of a road).

5. Inconsistent Data Types

Data sources define each column as a certain type of data. This is usually done automatically by the system, so some users may not even be aware of these definitions. These “types” categorize every variable under a variety of options such as an integer, float, or string. Integers represent whole numbers, floats represent numbers with decimals, and strings represent any combination of alphabetic fields, numeric characters, or special characters. While a couple of other types exist, these three are by far the most common. If a string type is accidentally entered in a column that is meant to be numeric, the dataset will either report it as a null value or convert the entire column type to string.

Below is a small sample dataset with examples of each of the 5 mistakes listed above. Please note that the information here has been randomly generated and does not reflect any real businesses or locations.

6. Incorrect Values

By far the most difficult data error to deal with is inaccurate numeric values. Each of the previously listed issues can be identified quickly through the appropriate tools. However, if a literal number appearing in the dataset is wrong, how could one even tell? Believe it or not, there actually are reliable methods of sniffing out numerical errors in data. These steps, along with solutions for each of the sections above, will be outlined in our next Data Verification blog post.

Sources of Error

As you can tell from the list above, there’s multiple ways for there to be mistakes inside of datasets. How exactly do these issues arise?

In data sources that rely on manual input, human mistakes can account for a large amount of errors (most commonly typos and incorrect values). While this explains a large part of why flawed data exists, other factors are also to blame such as:

  • Errors in the creation of automated data collection models

  • Outdated datasets that were never fixed due to the cost and/or effort of updating

  • Errors from combining multiple data sources improperly

  • Potentially malicious behavior targeted at datasets by hackers

Bear Cognition Offerings

No matter what kind of data source you are looking at, there is a possibility that it has errors lingering somewhere throughout it. In our previous post we detailed the risk that these errors may pose. Searching for the various types of mistakes in data is not the most exciting task for an analyst, but it is arguably the most important one. At Bear Cognition, we assure complete verification of our customer’s data to ensure accurate results. If your company is ready to take the step into data driven decision-making with reliable analysis, contact us today!

13 views0 comments

Recent Posts

See All