As a data analyst, I spend my days at work turning data into actionable insights for clients. Read: translating numbers into real-world decisions. The ability of numbers and mathematics to quantify and describe the world we live in has always fascinated me, leading me to a degree in engineering and then onto a career in data. But what if the data we meticulously capture doesn’t actually tell the whole story? Even if we’ve double and triple checked the math, ensured our data pipelines are operating correctly, and verified the veracity of the data we’re collecting, there could still be a problem.
Recently, I read an article by The New York Times that completely altered my mindset when considering the context of a dataset. This article pointed out that our data collection methods can fail at properly describing the real world, not because of improper analysis, but because of what we choose to measure.
The article prompted me to consider what we unintentionally leave out of datasets when we choose what to measure without considering the context. What or who is left out of the equation? Here, we come to the concept of biased data.
What is Bias in Data?
The idea of data biases was not new to me when I started considering the effect of context on the data I work with every day. I discovered that I didn’t understand how biases could present themselves in data. Where do they come from? How do they quietly insert themselves into rows of information? After all, aren’t numbers supposed to be a straightforward, unfeeling source of truth?
While we’ll get into the ways bias creeps into datasets a bit further down, we first need to explore how bias presents itself.
Bias in AI and ML Models
The reason most people are familiar with biases in the world of data science and analytics are because of artificial intelligence and machine learning models.
One prominent example is controversy surrounding the Apple credit card. The card began attracting attention when people noticed that men were seemingly getting approved for much higher credit limits than women, even when they had similar incomes and credit scores. Was someone at Apple guilty of discriminating in their approval process? Not exactly. It’s more likely that the algorithm used to determine the risk of lending to certain borrowers was improperly trained, creating a bias against women.
Let’s take a step back. What is an algorithm? Machine learning? Time for a brief overview.
Machine learning models can be thought of as a black box. The training process starts when a data scientist provides a data set that can be categorized into two parts: the prompt and the desired output. Once the model is given this information, new data (just the prompt) is run through the model to see how accurately it can predict the desired outcome. This process is usually repeated until a desired level of accuracy is reached.
In short, the model teaches itself how to make decisions and most of the time is pretty much left alone after the resulting output is deemed satisfactory.
Back to Apple’s problem – it is probable that their model for approving credit card users had been trained on a dataset where women were typically approved for smaller credit limits. This led the model to believe that gender was a defining risk factor over things like credit score or income level, thus creating a discrimination problem.
Bias in Data Collection
Moving on from the most common version of data bias, I want to focus on biases present in datasets when data collection is responsible.
As a very simple example, if survey data on what food brands people buy is collected outside of grocery stores in a singular neighborhood, the data could end up being biased in several ways.
Only collecting data in one neighborhood can exclude diverse perspectives that may have been found in other neighborhoods from the dataset. For instance, the chosen neighborhood may contain a majority of people from one socio-economic class. People of a higher socio-economic class may tend to buy more brand name, luxury items than people of a lower socioeconomic class who gravitate to cheaper necessities. This could skew the data in one way or the other.
Other neighborhoods also may contain a different ethnic demographic. As an example, if the survey is conducted in a neighborhood with a higher Hispanic population than the surrounding neighborhoods, any data collected may show a higher rate of purchases of Hispanic food items such as Mexican rice, cumin, and tomatillos that may not be representative of the entire geographic area or city’s common grocery purchases.
The survey could also be affected by the type of person who shops at grocery stores. If mothers or other female family members do the grocery shopping for their families more often than males, this could skew the dataset to overrepresent women’s buying habits.
These biases can be a completely unintentional side effect of poor data collection methods. Not to say that biased data collection is never intentional, but the questions we ask and how we decide to answer them can have a huge impact on the integrity of the data we collect if we don’t take the proper care.
An improvement to the example above would be for surveyors to collect responses at multiple types of grocery stores in multiple neighborhoods. Another good practice would be to obtain responses from people of varying ages, genders, and backgrounds, striving to have a survey population that reflects the demographics of the entire geographical area in question.
A recent example of a dataset that may not tell the entire story is COVID-19 case count data. The running count of coronavirus cases around the world measures the number of positive coronavirus tests – not the number of actual infections. This is an important distinction. How many positive infections were never added to the count because the person was asymptomatic or simply chose not to get a test? Can we still trust the data if people who lack access to testing sites or typically chose not to get tested fall overwhelmingly into particular groups of people, whether that be by geographic location or cultural background? There is a chance these groups are completely unrepresented in COVID data, not because of a prejudice belonging to the person collecting the information but because of the chosen data collection method. In this case there is no way to measure if a person is currently infected with coronavirus without a positive test; consequently, coronavirus cases must be measured by the closest proxy – positive tests – instead of the real thing. Due to this, the data can never accurately depict the real-world number of people affected. It can only give us a good estimate.
What can we do about this bias?
There is no right answer as to how society can correct all data fallacies due to bias, intentional or not. However, we can control the datasets we personally create and consume. When creating data solutions for societal problems, we can take special care to think critically about how our data collection methods can unintentionally skew the data. When consuming data reports, presentations, or infographics, we can research and verify the source of the data. If you find a reference to data that seems biased or unreliable, make the flaws known. Hold organizations and individuals accountable for the data they present and the context that surrounds it, especially when the data is regarded as truth.
How Bear Cognition can help
At Bear Cognition, data is our area of expertise. We’re experienced in verifying, standardizing, transforming, and presenting data so that your business can make informed decisions. We can help your business identify and rectify data collection practices that may be harming the integrity of your data pipeline. Contact a member of our team if you would like to learn more about how we enable small to medium sized businesses to transform their data practices and gain actionable insights.