We’re no strangers to AI, and a lot of the technologies we use on a daily basis lean on it to provide us with good experiences — from basic recommendation systems that guide our purchases to more complex facial recognition in every picture we post online. And in the past few years, we’ve seen the rise of deep learning and neural networks, enhancing performance at an astounding rate. Natural Language Processing was not an exception, and roughly three years ago, researchers successfully applied these models to Machine Translation. The word soon spread when both Google and Microsoft claimed to have achieved human parity.

The catch? These models require data. Lots and lots of it. Exceptionally large computation resources, too, which happen on specialized units called GPUs, and consume a lot more energy than the traditional CPUs. Data centers alone were estimated to compose 1% of the global demand for electricity in 2017, consuming around 195TWh, according to a report by the International Energy Agency. And although the same report predicts that the demand for more computation and increase in data centers workload will be countered by efficiency improvements on a number of components of these units, we shouldn’t ignore the energy footprint of current deep learning techniques.

Can there be such a thing as responsible AI?

AI’s carbon footprint

On one hand, AI is driving itself to be more efficient than ever. Take DeepMind and Huawei, pioneering data center cooling technologies, or Google, which created TPU, a device that enables businesses to train their models faster, and more efficiently.

But the industry is also part of the problem. In a comparative study, OpenAI pointed out a trend that the amount of computing used for training runs was increasing with a 3.5 month-doubling time (to have an idea of the scale, Moore’s Law had an 18-month doubling period). And these numbers are starting to raise some eyebrows. Just last August, at ACL 2019, in Florence, researcher Emma Strubell presented a paper called Energy and Policy Considerations for Deep Learning in NLP, which was received with a bit of controversy.

In her study, she presented the consumption costs of training different state-of-the-art models, and compares them to, say, the footprint of one passenger travelling from New York to San Francisco by plane, the average lifetime of a car, or even the average human life. In particular, Strubell points out the impact of massive hyperparameter tuning and architecture search, techniques for exploration that, at its limit, can be considered as a brute-force approach to finding the best model for a specific task. These values surpass all others by multiple orders of magnitude.

Even if we consider that we could just shift to hubs powered mostly or fully by renewable energies — which we know is not the case for now — these numbers are definitely an eye-opener.

How did we get here?

Most NLP tasks started to benefit from classic Recurrent Neural Networks throughout the last decade. The “recurrent” comes from the way these models work: they consume one word at a time, generating a state or an output required for the task, and feeding it again to the model to help generate the next one. This is an expensive mechanism, which, when compared with typical models used in other fields, can be slower regarding training time — especially if we allow very long sequences.

Then, in machine translation, a new mechanism came along — “attention.” This new method provided researchers with a tool to better understand the outputs, by letting them know which source words a model was looking at to generate each of the target words. In particular, attention did not need to consume the input sequentially, and so it rapidly grew into a number of methods and applications. It wasn’t long until the community decided it was all it needed, and so we saw the rise of Transformers, which instead of relying on recurrence, build on top of these mechanism and combine it with a simpler non-recurrent neural network. These models, even though they were bigger, could achieve better results in a number of tasks with a significantly-reduced number of FLOPs (floating point operations per second, a common measure of efficiency when using GPUs), which, resource-wise, was actually positive.

Finally, researchers turned to pretraining of some of the basic building blocks of NLP models. They did this by gathering large amounts of written text, which, instead of requiring labels or parallel sentences in other languages, could be used directly by unsupervised methods. By just looking at the text and the natural way sentences are built and words appear together, they were able to train better representations of words. Instead of solving one task directly and letting it learn everything required, these representations could be directly plugged in into other models, used for downstream tasks. This is what is called language model pretraining, and with whimsical names such as ELMo, BERT, Ernie2.0 and RoBERTa (and the less amusing GPT and XLNet), these started to dominate language modeling and language generation tasks, requiring large amounts of data and, in some cases, large number of resources.

With these new models, the need to show improvements fast and claim the title of state-of-the-art, the number of papers in the last couple of conferences where results are achieved with a massive amount of resources started to rise.

Looking at most papers (excluding the ones that don’t report the resources used), it’s becoming increasingly more common to see trainings ran on dozens of GPUs across multiple days or even weeks. With GPT, for example, the model required eight GPUs to be trained for an entire month. GPT-2, its successor, has 10 times as many parameters and was trained on 10 times as much data. And this research, that ran several experiments to achieve a moderate improvement, with a total training amounting to more than three months on 512 GPUs.

Many researchers are debating the relevance of state-of-the-art when it’s achieved solely through brute force, and are discussing the implications of leaderboards that only look at one single metric that’s being optimized. It’s becoming less and less clear whether these improvements are achieved because of the methods or just the sheer number of computing power and resources. And if we can’t tell where the improvements are coming from, it’s fair to question the process through which these papers get picked for leading conferences.

A reproducibility crisis

Even setting aside the energy costs and footprint, these models present other problems. Massive resources are not only expensive from an energy point of view. They’re actually expensive. And typically, only big research groups or companies have the capital to run this type of experiments.

There are other barriers besides the amount of resources, and researchers have criticized this reproducibility crisis, pointing out a series of troubling trends, among them the failure to distinguish between improvements coming from architecture as opposed to tuning. Some researchers have advocated for better reporting, with the proposal of budget reporting and reproducibility checklists to increase transparency. NeurIPS for example, started to ask researchers to submit their own checklists.

What these groups claim is that these models are reusable. That, when open-sourced, as many companies now do, they could be just plugged in for downstream experiments or tasks and used as they are, and smaller companies wouldn’t have to reproduce them on their side. But things are never that simple. These models aren’t foolproof, and we’re all familiar with the shortcomings of AI, particularly when it comes to bias. As my colleague Christine recently wrote, we need to think about the data we’re feeding our models, which can reinforce our biases, “lead to discrimination in hiring processes, loan application, and even in the criminal justice system.” So it’s pretty bold to assume that these models will never need to be revisited.

Towards responsible AI

When we talk about AI, most people imagine either a utopia, or an apocalyptic scenario. Usually the latter. But given that actual Artificial Intelligence is still far from being cracked, we might have more pressing concerns. As AI researchers, we need to drive this discussion, and think about the impact of our work right now. We need to think about the carbon footprint of the models we’re training, especially in a time where millions of young people are striking and pressuring our governments to fight global warming.

To Strubell, we can become more responsible and improve equity in NLP research through a series of efforts, by prioritizing computationally efficient hardware and algorithms — even privileging better hyperparameter tuning techniques; and by reporting the budget enforced, an essential part of untangling these state-of-the-art claims.

But there are other things we could do. We could place greater focus on research directions where efficiency is naturally privileged, such as data selection and data cleaning fields, low resource scenarios, among others. And maybe it’s time for major conferences to take the lead in enforcing these values, for example, weighing in a model’s footprint in the leaderboards.

There is no quick fix, but many of these small changes might help. And just the simple fact that these topics are getting more and more attention is a positive indicator that we, as a community, want to move towards better solutions.