Fixing Root Causes of AI Bias, Part 4

Today we are going to discuss the final battle in dealing with AI bias. That is, how can we fix the root cause of AI biases? Since today’s exposition builds on our previous discussions on this very topic, it’s important to get familiar with the three installments we’ve already published on AI bias.

  1. Acknowledging the Bias
  2. Inherited Biases from Data
  3. Emergent Biases in Operational Models

As previously discussed, the most common sources of AI bias are those inherited from training data. And these inherited biases are often introduced as we preprocess the data before training. On the contrary, emergent bias can be created during training even when the training data is unbiased. So these biases are often created during training as data scientists employ subjective model selection (e.g. regularization, hyperparameter tuning, etc.) to finalize their model choice.

Contrary to conventional belief, data science work requires much discretion. Hence, many standard procedures in data science are potential sources where bias can be introduced if applied haphazardly. But they also serve as natural points for us to inject counteracting biases to neutralize the biases we want to eliminate. With meticulous data scientists, inadvertently injected bias (whether inherited or emergent) should be minimal. What’s left are the inherited biases that already exist in the training data.

As data scientists, we often treat the data given to us as the raw input. We seldom question where the data came from and how it was collected. So where did these preexisting inherited biases come from, and can they be eliminated? In most practical situations, there are two major sources where inherited biases are created.

Eliminating Data Capture Biases

Since all training data must be captured and collected at some point, preexisting inherited biases may be a result of biased data collection processes. All data collection schemes are designed and built as a result of a series of design choices. And it’s a well-known fact in Choice Architecture that there are no neutral designs. Hence, every data collection process is inherently biased.

For example, it’s common knowledge that survey data always exhibit a certain self-selection bias. Such data will overrepresent the already inclined and diligent consumers and underrepresent those that are either lazy or paranoid.

We may still be biased even when the data is collected automatically through passive behavior measurements through sensors, devices or WiFi networks. We may inadvertently be selecting the population who will be near the sensors we installed. And we may systematically underrepresent those who don’t have access to reliable WiFi or a mobile device.

If we don’t control the data collection process, which is typically the case for most data science work, it’s crucial to understand the data collection processes to understand the biases inherent to the data collection process. We need to acknowledge these biases as we’ve discussed in my previous article. In practice, this means we must continually monitor the level of bias in the captured raw data to ensure the changes in bias do not adversely impact the trained model.

However, in some rare situations where we can influence the data collection process, we should aim to capture more data and more complete metadata that provides the context to interpret those data. In the age of big data and ubiquitous sensors, the default strategy should be to collect everything possible. Debating over what to collect often ends up being more costly due to lost time and development velocity.

Having detailed demographic data can help reveal the biases inherent to the data collection process so we can better understand them. It will facilitate the effective monitoring of how data-collection biases change over time. Most importantly, it can also help us redesign the data-capturing process to reduce bias during data collection. As with any design, redesigning the data-capturing method is an iterative exercise. With proper bias monitoring and rapid iteration, data-collection biases can be progressively minimized over time.

Related Article: Dealing With AI Biases, Part 1: Acknowledging the Bias

Unconscious Human Biases in Data Generation

If the data-capture mechanism has been iteratively perfected, the only place where inherited biases can be created is during data generation. But most of the data used in training machine-learning (ML) models are generated by humans. These data are merely the result of past human decisions and behaviors. Therefore, the inherent biases in the training data originate from us, humans. The bias in data is simply a reflection of our own bias.

In practice, however, AI bias often appears to be more extreme and therefore more noticeable than our own bias. This is because the slightest bias in our decisions can often be magnified dramatically through the ML training process. Our minute biases are accentuated when machines learn from large amounts of data very rapidly. This amplification is a result of both the high speed of learning and the consolidation of lots of data. Despite this, the AI bias problem is fundamentally a human bias problem.

Related Article: Dealing With AI Biases, Part 2: Inherited Biases From Data

Solving for Bias in Our Data

Now that we have a good grasp on the root cause of the AI bias problem, how can we solve this problem?

Source link

We will be happy to hear your thoughts

Leave a reply

Enable registration in settings - general
Shopping cart