top of page
Search

No Visualization without Validation

Updated: Jun 2, 2021

When you hear the words “data analytics”, what is the first image that starts to form in your mind? Chances are it’ll be a visualization of some sort. Whether it’s a simple bar chart or a more complex time series graph, more likely than not, it is some version of a colorful graphic in your mind's eye. Visualization is the face of data analysis for a reason: it provides insight where mundane rows of numbers simply cannot. In addition to this, visualizations add a strong degree of creativity to the field of analytics. If you browse any popular data visualization community, you will see that the visual elements of projects are becoming more imaginative and visually impressive over time. While the artistic side of data analytics can be exciting and innovative, the ultimate purpose of any project will always be to derive the insights from a given dataset. Visualization is all about uncovering truths- after all, “the numbers don’t lie”, right?


Believe it or not, this quite often may not be the case. Sure, numbers themselves technically cannot lie. However, issues such as typos, empty cells, and duplicated values frequently exist in base datasets. When a source of data has such mistakes, the visualizations used in the analysis will show incorrect results- in this case the numbers would be lying. None of the visualizations will matter if the core data used to build them is flawed. Accuracy will always be the most important aspect of a project, so it is vital that the correct steps are followed to ensure that datasets are as correct as they can possibly be. This process is generally referred to as data validation, though many other terms for this have also been used such as cleaning, cleansing, and verification. While visualizing data is the exciting and creative side of analysis, data validation is the bland and tedious side. Luckily, there are a number of standard approaches and preparation tools that make this easier than ever. These will be explored in an upcoming post, but first, let us start by discussing the specifics of data validation and what makes it such an important part of a project.


Data validation is the process of exploring and editing raw datasets to ensure that the most accurate results possible appear in the analysis. “Most” is a key word here- one cannot obviously verify that every single piece of information entered is 100% correct. However, there are common flaws in datasets that are easy to identify such as missing values, extreme outliers, or typos in alphabetic fields. The amount of time it takes to validate data usually varies depending on the structure and size of the data source.


Take the two dashboards below for example, they illustrate just how drastically errors in a dataset can change the outcome of an analysis. Both are from the same base dataset. While one of them is cleaned and verified, the other contains common errors such as typos and empty cells. There is no surefire way of identifying incorrect data from the analysis alone. As you can see, data validation can drastically change an important analysis project.

 

One might wonder why this tedious and potentially long process is so important to data analysis- after all, if there is some sort of data available, visualizations can be created. Just looking at charts and graphs, there are no obvious ways of seeing inaccuracies in the data. Why should we take the time to maximize the accuracy of data?


The answer to this question is probably better to say as a warning: deciding to skip the validation step poses extremely significant risks to you as an analyst. While the negative impact may not be immediate, it will certainly come back to hurt you eventually.


Risks of not Validating Data