OpenAI’s GPT-3 has been getting a lot of buzz recently, for good reason. Pre-trained neural network language models like GPT-3, as well as Facebook’s XLM and XLM-R, are giant breakthroughs for language understanding in AI. However, these algorithms can’t be left unchecked. Human-in-the-loop AI is still crucial to improving language models for several important reasons:
- Helping machines understand context
- Mitigating bias and other ethical concerns
- Grounding them in reality
Let’s look behind the GPT-3 hype and see why we still need humans for language-based machine learning.
Humans Remain the Best Way to Understand Context
GPT-3 leverages a whopping 175 billion parameters (versus its predecessor GPT-2’s 1.5 billion) and around 500 billion words of (mostly web-crawled) training data that allow it to grasp highly complex representations of text and perform relatively well on a variety of language tasks. The model also uses a concept called “few-shot learning,” where just by being fed a small amount of examples of a specific task, it can function very well on that task without manual fine-tuning or a large corpus of specific training data.
So far, with few-shot learning, beta testers have had striking results using GPT-3 for many tasks, such as converting natural language into code, writing essays, creating chatbots for historical figures, answering complex medical questions, and even machine translation.
Despite being trained on predominantly English data, the researchers behind GPT-3 found that the model can translate from French, German, and Romain to English with surprising accuracy, and in some cases outperforms existing unsupervised machine translation systems (where the training data is not composed of explicit pairs of corresponding sentences in both languages) by a large margin. However, while translation is a serendipitous side-effect of training such a large, powerful model, and it would be convenient if we could use the same AI system like GPT-3 for several tasks at once, such as answering and translating a customer’s question simultaneously, there is still a ways to go before we can comfortably rely on a model like this to provide customer-facing responses.
OpenAI’s CEO Sam Altman said on Twitter that despite the hype, GPT-3 “still has serious weaknesses and sometimes makes very silly mistakes.” As The Verge also points out, GPT-3 experiments are still riddled with errors, some of them more egregious than others. Users don’t always get desirable answers on the first try, and therefore need to adjust their prompts to get correct answers. NLP systems, and machine learning algorithms in general, cannot be expected to be 100% accurate. Humans are still required to differentiate acceptable responses from the unacceptable.
Part of determining what is acceptable is making judgments related to pragmatics, which is something humans excel at. Pragmatics, or the study of how language is interpreted in context, tells us that if we ask a friend, “Do you like to cook?” and her response is “I like to eat,” she probably doesn’t enjoy cooking. Pragmatics is also the reason we would say, “Could you please provide your payment details?” to a customer rather than, “Give me your credit card number,” even though the two sentences have the same intent.
In settings where there’s little margin for error, such as real-time customer service chats, humans need to occasionally correct machines’ mistakes. Local dialects and phrases can easily be misinterpreted by machine translation. It’s also critical that a translation system adheres to localized cultural norms — for example, speaking formally in a business setting in Germany or Japan. So, for now, we still need humans to process the nuances of natural language.
GPT-3 is Impressive, but Still Biased
Going beyond questions of pragmatics, humans also need to be involved in the development of these language models for ethical reasons. We know AI systems are often biased, and GPT-3 is no exception. In the GPT-3 paper, the authors conduct a preliminary analysis of the model’s shortcomings around fairness, bias, and representation, running experiments related to the model’s perception of gender, race, and religion.
After giving the model prompts such as “He was very”, “She was very”, “He would be described as”, and so on, the authors generated many samples of text and looked at the most common adjectives and adverbs present for each gender. They note that females are more often described with words related to their appearance (“beautiful,” “gorgeous,” “petite”), whereas males are described with more varied terms (“personable,” “large,” “lazy”). In examining the model’s “understanding” of race and religion, the authors conclude that “internet-trained models have internet-scale biases; models tend to reflect stereotypes present in their training data.”
None of this is novel or surprising, but investigating, identifying, and measuring biases in AI systems as the GPT-3 authors did are the necessary first steps toward the elimination of these biases.
To make tangible progress in mitigating these biases and their impact is where we need humans, and it involves more than having them correct errors, augment datasets, and retrain models. Researchers from UMass Amherst and Microsoft analyzed nearly 150 papers related to “bias” in NLP and found that many have vague motivations and lack normative reasoning: they do not explicitly state how, why, and to whom the “biases” are harmful. To understand the real impact of biased AI systems, they argue, we must engage with literature from outside of NLP, such as sociolinguistics and sociology, that “explores the relationship between language and social hierarchies,” as well as engage with communities whose lives are affected by NLP systems. After all, language is a human phenomenon, and as practitioners of NLP, we should think not just about how to avoid machine-generated text sounding offensive, but question how our models interact with and impact the societies in which we live.
In addition to bias, major concerns continue to surface about the model’s potential for automated toxic language generation and fake news propagation, as well as the environmental impact of the raw computing power needed to build larger and larger machine learning models. Here the need for humans isn’t an issue of model performance, but of ethics. Who if not humans will ensure such technology is used responsibly?
GPT-3 Can’t Say, “I Don’t Know”
If the goal is to train AI to match human intelligence, or at least perfectly mimic human language, perhaps the largest missing piece is the fact that language models trained solely on text have no grounding in the real world (although this is an active research area). They don’t truly “know” what they’re saying, and their “knowledge” is limited to the text they are trained on. So, while GPT-3 can accurately tell you who the U.S. President was in 1955, it doesn’t know that a toaster is heavier than a pencil. It also thinks the correct answer to “How many rainbows does it take to jump from Hawaii to seventeen?” is two. Whether or not machines can infer meaning from pure text is up for debate, but these examples suggest that the answer is no — at least for now.
Future versions of GPT-3 will improve with the addition of more parameters, but it’s hard to know exactly how many we’ll need to crack the answer to Life, the Universe, and Everything (Geoffrey Hinton joked 4.398 trillion — 2 to the power of 42).