Would you trust AI with your life?

There’s a somewhat famous story in AI research circles about a neural network model that was trained to distinguish between wolves and huskies. The model learned to identify them successfully, achieving high accuracy when given images that weren’t used for its training.

However, it soon became apparent that something was going wrong — some very clear images were being misclassified. When they looked into why the neural network was making such gross mistakes, researchers figured out the model learned to classify an image based on whether there was snow in it — all images of wolves used in the training had snow in the background, while the ones of huskies did not. Unsurprisingly, the model was failing.

Now, imagine we want to be able to help catch stray huskies in the wild, so we somehow fix the model, and teach it to correctly distinguish between wolves and huskies, regardless of the background color. We embed it in devices with cameras, which we then share among volunteers and friends. We trust our model not to say it’s a husky when it’s actually a wolf, but how confident are we that nothing else will break the model? What will happen if the model sees a coyote? Will it classify it as a wolf, based on the size? What about a fox? A bear? Do we risk telling our friends to approach, hoping they realize the stray is actually a bear before getting out of the car with a nice juicy steak?

Machine Learning what?

Machine Learning techniques, most notably Neural Networks, have achieved tremendous success with a multitude of problems, including notoriously difficult ones like translation and speech recognition. Their usefulness is undeniable, and as such they have become ubiquitous in a variety of applications.

Despite a series of breakthroughs in the past 12 years, the current practice in the AI research community is to do incremental research. Improvements to AI systems are being achieved by using larger models and more data, as my colleague Catarina exposed in a previous article. Gains in performance are fractional, and the existence of scoreboards have encouraged the practice.

These scoreboards offer public datasets for several Natural Language Processing (NLP) tasks, like Question-Answering, Sentiment Analysis, Semantic Similarity, etc. This is actually a great initiative as it fosters researchers to build comparable systems. However, it also makes researchers tailor their systems for these datasets too much. Not that this didn’t happen before, but in the midst of all the hype surrounding AI, this has gone way out of hand.

As in the wolf vs. husky conundrum, the problem is that more and more models are achieving higher performance by learning idiosyncrasies in the data. Neural models are like black boxes, which makes it hard to confirm whether the model is solving the data instead of the task. Not enough people seem to worry a lot about this, and so these models get prematurely applied to real life use cases, and by the time someone notices that the snow is a factor, the damage is done.

There are two main causes for these over-optimization issues.

1. Optimizing for the wrong thing

Models are optimized for a metric that is easy and fast to compute, and which correlates, to some degree, to the desired goal (or “measure” of success). The problem of mapping a desired goal to an easily measurable quantity has been acknowledged for decades in several disciplines, most notably in 1975, when the economist Charles Goodhart published a paper on economic regulation that popularized what became known as Goodhart’s Law:

“When a measure becomes a metric, it ceases to be a good measure.”

Less catchily: “Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.” Regardless of the formulation, what the law implies is that, whenever our performance is measured in terms of some number, we optimize for that number. In other words, we game the metric.

Goodhart’s law, SKETCHPLANATIONS

Neural Network models end up doing the same thing. The metric they are optimized for is just a proxy for the real measure of performance. There is no way to guarantee the model is going to map to the expected performance in the real world.

Neural Machine Translation models, for example, are optimized for BLEU, which is a metric that compares the model’s output to a reference translation, word for word. In the real world, what matters is a fluent and accurate translation, even if phrased in a different way from the original sentence.

2. Optimizing with unrepresentative data

As in the snow detection story, powerful models can achieve higher (metric) performance simply by learning idiosyncrasies in the training data. But real data can be somewhat different and not contain the same idiosyncrasies or overall frequencies of terms, classes, backgrounds, etc. When deployed to real world scenarios, such models will be inevitably biased towards the representation they learned from the training data. A wolf in a green landscape will easily become a husky.

When unrepresentative data is used for training, sometimes with no considerations about how the training data was collected or where it came from, it can be very problematic to apply a model to different situations from the ones it knows. The model becomes biased. And while this implicitly learned bias may not seem so problematic in this particular situation (unless, of course, someone gets mauled), when it happens with loan applications, housing tax credits, even job interviews, it’s scary to think about the implications.

Last year, California’s state court decided that there was too much human bias in deciding cash bail amounts. With the argument of removing this bias, they passed a law mandating the use of an algorithm to assess the risk of a person failing to appear in court, which they assumed would provide an objective view. But where is the training data for this algorithm coming from? Most likely from historical records, which contain the very same bias the algorithm is supposed to avoid.

Into the wild

Neural networks are confident in their predictions even when it makes no sense at all.

Even after fixing the wolf vs. husky model, we still had a problem. What will it predict when it is fed an image of a coyote, or a fox, or even a bear?

We know our wolf vs. husky model doesn’t know a bear when it sees one. It will try to classify it as either a wolf or a husky. But the problem with neural models in general is that the probability they assign to a given output does not reflect the confidence they have in that prediction. Probabilities cannot be taken as confidence estimates. Neural networks are confident in their predictions even when it makes no sense at all, and even when the input is substantially different from anything the model saw during training. When the model encounters the image of a bear, the output can be anything from 100% wolf to 100% husky. Wouldn’t it be a relief if our model would output 50% / 50%? We could then take all precautionary steps to avoid coming closer.

What we would like is for our models to show high uncertainty when dealing with data in regions they have not seen before. “We want them to ‘fail gracefully’ when used in production,” as Anant Jain wrote in his post on Medium. That will allow us to trust our model’s predictions.

Unfortunately, the current practice is to trust a model based on the performance it achieved under a single metric over an unrepresentative dataset.

Is there hope?

None of these problems can be easily solved. They require effort and time from researchers, engineers, regulators, decision- and policy-makers. But there is hope.

To avoid overfitting to a single proxy metric that will not reflect in the real desired measure, we can train models using complementary metrics. The best model should be the one performing equally well on all of them. Additionally, we should put some considerable effort in periodically measuring performance in the real world, even if just for some partial set of examples (since this usually requires manual human work).

To reduce implicit bias as much as possible, more representative training data will obviously help. However, knowing which data is more representative is itself a challenge. What would be really helpful is to have models that are explainable, or that are able to output an explanation for what their prediction is. This is exactly what would allow us to immediately pinpoint the wolf-snow bias.

Finally, being able to trust what models predict would allow for much safer applications of AI. Humans could intervene whenever a certain confidence limit was not reached, thus allowing models to do their great job at dealing with the data they are truly tailored to.

At Unbabel, we’re constantly coming across huskies, wolves, and bears. But by having humans in the loop, fixing our models’ mistakes and evaluating the true quality of what we deliver, we are able to keep improving our models and also how we automatically evaluate them.

Paraphrasing our VP of Linguistic Technologies, Alon Lavie:

The most important practical [fact] for us is that experimental results we obtain do not generalize as we assume and are actually not representative of our translation scenario in practice. This happens all the time.

AI is here to stay, and we have already reaped a lot of benefits from it. But we are reaching a tipping point where neural networks are used so widely that we need to be more responsible in how we train them. We’re seeing more and more wolves, the snow is melting, and our friends are out there. Maybe we should focus on fixing what’s broken before it’s too late.