Data Bias and the Questions we Ask

As a data analyst, I spend my days at work turning data into actionable insights for clients. Read: translating numbers into real-world decisions. The ability of numbers and mathematics to quantify and describe the world we live in has always fascinated me, leading me to a degree in engineering and then onto a career in data. But what if the data we meticulously capture doesn’t actually tell the whole story? Even if we’ve double and triple checked the math, ensured our data pipelines are operating correctly, and verified the veracity of the data we’re collecting, there could still be a problem.
Recently, I read an article by The New York Times that completely altered my mindset when considering the context of a dataset. This article pointed out that our data collection methods can fail at properly describing the real world, not because of improper analysis, but because of what we choose to measure.
The article prompted me to consider what we unintentionally leave out of datasets when we choose what to measure without considering the context. What or who is left out of the equation? Here, we come to the concept of biased data.
What is Bias in Data?
The idea of data biases was not new to me when I started considering the effect of context on the data I work with every day. I discovered that I didn’t understand how biases could present themselves in data. Where do they come from? How do they quietly insert themselves into rows of information? After all, aren’t numbers supposed to be a straightforward, unfeeling source of truth?
While we’ll get into the ways bias creeps into datasets a bit further down, we first need to explore how bias presents itself.
Bias in AI and ML Models
The reason most people are familiar with biases in the world of data science and analytics are because of artificial intelligence and machine learning models.
One prominent example is controversy surrounding the Apple credit card. The card began attracting attention when people noticed that men were seemingly getting approved for much higher credit limits than women, even when they had similar incomes and credit scores. Was someone at Apple guilty of discriminating in their approval process? Not exactly. It’s more likely that the algorithm used to determine the risk of lending to certain borrowers was improperly trained, creating a bias against women.
Wait, what?
Let’s take a step back. What is an algorithm? Machine learning? Time for a brief overview.
Machine learning models can be thought of as a black box. The training process starts when a data scientist provides a data set that can be categorized into two parts: the prompt and the desired output. Once the model is given this information, new data (just the prompt) is run through the model to see how accurately it can predict the desired outcome. This process is usually repeated until a desired level of accuracy is reached.