There’s this saying about how if you give the same text to 10 different translators, they will render 10 different, equally valid translations. After all, language is highly subjective, so when it comes to translation, there’s not one universally accepted answer. And so, naturally, linguists have very strong opinions on which translation best expresses the original meaning of the message. 

Since we’re looking for the highest translation quality, this poses a big challenge to us. It turns out the same applies to the annotation of translation errors. Annotators don’t always agree, and not because a translation error has been categorized wrongly, but rather that the same error can be categorized differently, depending on what angle you look at it. So how can we ever hope to train our models to be accurate when even we can’t agree on what’s wrong? And could this diversity of opinions be a good thing?

Supervised learning needs examples

First, we need to take a step back: why are we interested in what annotators have to say?

The reason is simple: currently, almost all the successful AI methods are supervised methods. This means they learn from examples. For image recognition, examples are images annotated with bounding boxes with labels (this part of the image is a cat, this part of the image is a dog, and so on), for speech recognition the examples are speech recordings with their text transcription, and for machine translation this means sentences with example translations.

Some tasks require the classification of words or entire sentences into fixed classes — the challenge with named entity recognition (NER) is to recognize parts of the sentence that indicate certain classes of interest like location, name, date.

An example of the type of data used and produced in NER: LOC is location, ORG is organization and NORP is nationalities or religious or political groups. This particular example is the prediction of Spacy’s large English model on a news article from Eater. Note that an entity can consist of multiple words, and that the last instance of Corona was mistakenly tagged as a location.

This labeled data is the bedrock of any machine learning application that is successful in the real world, because these examples don’t only train models — they also evaluate whether the models have really learned the task at hand. After all, we do not simply want them to copy the examples they were shown, we want them to generalize to the unseen cases. For this reason, we always keep a number of examples aside, used to test the models later on.

The important thing to remember is that these examples are provided by us, the humans! We carefully create the example translations, we decide on the categories for the images, we choose the taxonomy of classes that go into the NER system. We can call this effort, the process of creating examples with labels, annotation, and the person doing it an annotator.

At Unbabel, we use the Multidimensional Quality Metrics framework, or MQM, to assess the quality of our translations. Annotators are a big part of the process — they conduct error annotation, a process which involves, for each translation error encountered, highlighting the span of the error; classifying it from the list of issues, and finally assigning it a severity (minor, major and critical). This is a bilingual effort — the annotator has to be competent in both languages.

Their job comes in different magnitudes: some of it is fine-grained error annotation, like when they’re evaluating if words are incorrectly translated, or overly literal. But sometimes, error annotation exists at a higher level, for example, when they’re judging whether this sentence is a better translation than this other sentence (ranking) or this sentence is a 9/10 but this other one a 3/10 (direct assessment). In some cases, especially when it comes to situations where they performed a direct assessment, it might be hard to understand what drove the judgement of the annotator. It’s one of the reasons why we are particularly fond of the MQM approach: we get a lot of insight into the perceived nature of the errors.

Because here’s the thing: annotators don’t always agree. When we on-board new annotators, it’s not uncommon to see disagreements where, in some instances, one annotator claims it’s a minor error, one claims it’s a major one, and one claims it’s critical! And these annotators are already highly qualified, it’s just not an easy task.

Disagreement happens for several reasons. First of all, the annotation task is an inherently subjective one. Annotators can simply have different preferences: some prefer translations that show greater grammatical fluency, while others put greater value the preservation of meaning in the translation.

But there’s other reasons. Despite the best efforts and constant tuning, instructions aren’t always crystal clear — we can’t predict all cases in which a particular tag should be used, and again, language is ambiguous and poses challenges when you try to classify it.

Plus, humans make mistakes. A lot. They’re also famously riddled with biases, both at an individual level (e.g. they consistently prefer one reading/interpretation over the other) and at a group level, in the more socio-cultural sense of the term.

Lastly, even the quality of a competent annotator may vary — just try taking a language test in your own native language when you are tired or distracted.

But while disagreement is somewhat normal, it can certainly become a problem. If they don’t agree on the severity of an error, how do we know what it is?

Measuring (dis)agreement

For a start, we could use features of the annotation process to measure quality. But that can be problematic. Take as example the time the annotator takes to complete the task — a very simple quantity to obtain. We’re assuming that a fast annorator is probably hasty, and therefore prone to mistakes, while an annotator who takes a bit more time is just being thorough. But it might as well be the case that the fast annotator is just experienced and efficient, while the slow annotator is just dragging.

It’s very hard to distinguish annotators by simple features alone. But when the metadata is more expressive of the task, like the keystrokes behaviour of an editor, then it can become very predictive of quality, as is shown by the Translator2Vec, a model developed at Unbabel.

Instead of looking at behavioural data, we can look at the predictions themselves. If we gather multiple judgements on the same item, we can do something more than characterize — we can compare! And here is where the inter-annotation agreement comes in. Inter annotator agreement is typically measured with statistics that summarize — in a single number — the degree of agreement between different annotators. Take raw agreement, which is the number of times annotators agree on their judgement. This does present a problem: if people pick random labels often enough, they are bound to agree at some point. And we do not want to count that in. That’s precisely why Cohen’s kappa enjoys much greater popularity: it corrects against those chance agreements.

This idea can be further extended to measure the consistency of an annotator, or in other words, the intra-annotator agreement. If there are multiple judgements by the same person on the same item — preferably with some time in between — then the same metrics as above can be used to measure the annotator against themselves.

TextSadness RatingAgreement
India’s Taj Mahal gets facelift80.7
After Iraq trip, Clinton proposes war limits12.5– 0.1

Illustration of annotator agreement (-1 to 1) on a clear example (first) and a questionable example (second) of sentiment rating (0 to 100), taken from Jamison and Gurevych (2015). The second example is one where coherence of the task and the labels breaks down because: “Is a war zone sad or just bad?”, while on the other hand: is a limit on war not a good thing? This objection is reflected in the agreement score which indicates that there was almost no correlation in the annotators’ judgments (0 means no correlation).

At the end, these metrics can help you get a grip on the quality of your data. They provide you with a metric that can guide the decision making: Do you need to demote certain annotators? Do you need to discard certain examples? But don’t be fooled: all metrics have flaws, and Cohen’s kappa is no exception.

Agree to disagree?

Should we always punish difference of judgment? Some data labelling tasks are inherently ambiguous, and in those, disagreement could be telling us something. Consider this example:

Unbabel example of MQM annotations on English-German from two different annotators. Yellow is a minor error, red a critical one. The example comes from an internally used test-batch used to train and evaluate annotators. (The visualization was created using an adaptation of Displacy.)

The source sentence is “Could you also give me the new email address you would like me to attach to your account.” It’s clear that the annotators have different approaches, with one clear point of agreement (the word neuen) and one big disagreement: the last part of the sentence. The MQM resulting from the the second annotation is 70 while that resulting from the first annotation is 40, which illustrates the big influence a critical error can have on the final score.

In this example, we prefer the second annotation. The first annotator claims that the last bit of the sentence is unintelligible, which, according to MQM guidelines, means that the exact nature of the error cannot be determined, but that it causes a major breakdown in fluency. This is an error you would apply to a garbled sequence of characters and numbers such as in “The brake from whe this કુતારો િસ S149235 part numbr,,.”, which is not necessarily what happens in the sentence above.

But we could argue that there is an interesting question here. If the last section of the translation contains so many mistakes that it almost becomes impossible to understand, doesn’t this constitute a “major breakdown in fluency”?

This example is taken from an experiment in which we compare and align annotators. Because both annotators are competent, and the source of disagreement can be understood, the step that follows the above observation is one of calibration: to make sure that all annotators are on the same page — with us and with each other.

Embracing the chaos

When dealing with this kind of disagreement, there’s always a few things we can do to mitigate it. Sometimes, you can reduce disagreement by just providing more guidance. This is a matter of investing more human hours, understanding which labels and which tasks are causing the disagreement, and the solution can include rethinking labels, tools, incentives, and interfaces. This is a tried and trusted approach here at Unbabel.

Or you ask other experts to repair your data. When this was recently done for a classical, and still used NER dataset, researchers found label mistakes in more than 5 percent of the test sentence. That might not sound very significant, but that is a pretty large number for a dataset where the state of the art methods achieve performance of over 93 percent!

Example of corrections made by Wang et al. (2019) to the CoNLL03 NER dataset. (Adapted from Wang et al. using Displacy)

An interesting approach is to merge judgements — If you can get multiple annotations on the same data item, why not try to combine them into one?

We tend to rely on experts, because we believe they are more accurate, thorough, and ultimately, reliable. Since the annotations we use deal with a specialized taxonomy of errors and require a great level of language understanding in order to be used correctly, we rely on highly qualified annotators.

But here’s the fascinating thing: for some tasks that do not use very specialized typology or assume a specialized type of knowledge, the aggregated judgement from several non-experts is equally reliable as a single judgement from an expert. In other words: enough non-experts average into one expert. And the number of non-experts that is required for this can be surprisingly low. It’s this type of collective knowledge that built Wikipedia, for example.

Take the task of recognizing textual entailment (RTE). Textual entailment is a logical relation between two text fragments — the relation holds whenever the truth of one sentence follows from another. For example: “Crude oil prices slump” entails that “Oil prices drop”; it does not entail that “The government will raise oil prices” (adapted from Snow et al., 2018).

Aggregating the judgements of multiple non-experts into that of a single expert (green dashed line). Adapted from Snow et al. (2008)

Here, we see how aggregating the judgement of those non-experts can improve the accuracy of annotations (black line). And we can boost it even further by weighing each non-expert judgement with an automatically determined score that can be computed from their agreement with an expert, effectively correcting for their biases, as the blue line shows.

Instead of weighing your annotators by confidence, you can also try to weigh your examples by their difficulty. For example by assigning less importance to the easy examples — or even more rigorous: by removing them entirely. The beauty of the above two approaches is that the models themselves can be used to identify these candidates.

All in all, it’s hard to remove all ambiguity. Take translation: for a single sentence, there are multiple (possibly very large amounts) of valid translations, perhaps each prioritizing a different aspect of the translation quality — just think about the multiple translations of a novel between translators, or even over the decades. This is explicitly accounted for in the evaluation of translation systems, where it is considered best practice to always consider multiple valid reference translations when using an automatic metric. In the training of machine translation models, on the other hand, it remains an open question how to promote diversity, or in more broader terms: how to deal with the fundamental uncertainty in the translation task.

It turns out, too much agreement isn’t good for your models either. When that happens, annotators can start to leave behind easy patterns, the so called “annotator artifacts”, which are easily picked up by the models. The problem is caused by features in the input example that correlate strongly with the output label but do not capture anything essential about the task. For example, if all the pictures of wolves in the training show snow and all the pictures of huskies do not, then this is very easy to pick up on — and equally easy to fool. The models fail, assuming that the lack of snow is what characterises a husky.

It turns out that language has its own version of snow, as was discovered for a dataset in natural language inference, a generalized version of RTA. The dataset is part of a very popular benchmark for training and evaluating language understanding systems that provides a “single-number metric that summarizes progress on a diverse set of such tasks”, and that has been an important driver of the trend for bigger, stronger, faster models.

PremiseA woman selling bamboo sticks talking to two men on a loading dock.
EntailmentThere are at least three people on a loading dock.
NeutralA woman is selling bamboo sticks to help provide for her family.
ContradictionA woman is not taking money for any of her sticks.

Natural language inference (NLI) example sentences created from a premise by following simple heuristics. (Taken from Gururangan et al. (2018).) The annotator is given the premise and constructs a sentence for each of the three logical relations (entailment, neutral, and contradiction). The generated sentence is called the hypothesis. The machine learning task is to predict the relation given the premise and the hypothesis.

The examples in this dataset are created by humans, who, it turns out, often rely on simple heuristics in the process. The result is a dataset where hypotheses that contradict the premise disproportionately contain not, nobody, no, never and nothing, while the entailed hypotheses a riddled with hypernyms like animal, instrument and outdoors to generalize over dog, guitar and beach, or approximate numbers like at least three instead of two. No wonder many examples can be accurately predicted from the hypothesis alone: all the model needs is to pick up on the presence of such words! And because different annotators resort to different tactics, it helps the model to know which annotator created the example, while it struggles to correctly predict examples from new annotators.

In practice, learning this type of relation will prevent generalization to examples that do not show this correlation. And this generalization is precisely what we are after. After all, you don’t want to be right for the wrong reasons: you will be very easy to fool with adversarially constructed examples. And the best solution to this problem in a dataset can be harsh, as in the above case where it was decided to not include it in the second iteration of the benchmark — a laudable example of attentiveness to advancing insights in our community.

At some point, you’ll have to embrace the chaos. Diversity in data is a good thing, and we should cherish it. From this viewpoint disagreement of annotators is signal, not of noise. We could even make ambiguity an explicit feature of our models — an approach that has been successfully applied in quality estimation of machine translation systems.

SentenceSentence scoreMeaningLabel Score
Domestication of plants has, over the centuries improved disease resistance.0.63improvement or decline0.83
cause to make progress0.68
The dance includes bending and straightening of the knee giving it a touch of Cuban motion.0.24reshaping0.50
arranging0.30
body movement0.30
cause motion0.25

Explicit ambiguity in a dataset on frame semantics (from Dumitrache et al., 2019). The first example fits relatively neatly with the categorization, as demonstrated by the high confidence in both of the labels and in the sentence overall. The second example shows a much greater overlap in labels as it can be seen as a combination of each of them, to some degree.

Taking this one step further, you can decide to create a dataset that contains ambiguity on purpose. Instead of providing a single label for data-points, annotators are allowed to provide multiple labels, and instead of a single annotator per item they request judgments from multiple annotators. This multitude of judgements allows you to create a dataset with multiple correct answers, each weighed by a disagreement-scores that indicates the confidence in that label.

Take the example above, showing the results of that effort. The task is one of recognizing the multiple plausible word-senses (“frames”), and you get a sense of the uncertainty surrounding each item. This uncertainty is expressed by the weights assigned to the classes, and to the sentences (Dumitrache et al., 2019). The label score is the degree to which annotators agreed on that single label weighted by quality of the annotator, and the sentence score is the degree to which all annotators agreed on all the labels in the sentence.

In their research, Anca Dumitrache and her colleagues “found many examples where the semantics of individual frames overlap sufficiently to make them acceptable alternatives for interpreting a sentence.” She argues that ignoring this ambiguity creates an overly arbitrary target for training and evaluating natural language processing systems: “if humans cannot agree, why would we expect the answer from a machine to be any different?”

And indeed, our research is constantly evolving in this direction. This diversity of annotations is actually helping us build better labels, better tools, and ultimately better machine learning models. And while someone who’s pretty organised wouldn’t normally admit this, sometimes you just need to stop worrying and learn to embrace the chaos.

Sources

  • Lora Aroyo, Chris Welty, 2015, “Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation”, Association for the Advancement of Artificial Intelligence, https://www.aaai.org/ojs/index.php/aimagazine/article/view/2564
  • Trevor Cohn, Lucia Specia, 2013, “Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation”, Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), https://www.aclweb.org/anthology/P13-1004
  • Anca Dumitrache, Lora Aroyo, Chris Welty, 2019, “A Crowdsourced Frame Disambiguation Corpus with Ambiguity”, https://arxiv.org/pdf/1904.06101.pdf
  • Mor Geva, Yoav Goldberg, Jonathan Berant, 2019, “Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets”, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, https://www.aclweb.org/anthology/D19-1107.pdf
  • Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, Noah A. Smith, 2018, “Annotation Artifacts in Natural Language Inference Data”, Proceedings of NAACL-HLT 2018, https://www.aclweb.org/anthology/N18-2017.pdf
  • Emily K. Jamison and Iryna Gurevych, 2015, “Noise or additional information? Leveraging crowdsource annotation item agreement for natural language tasks.”, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, https://www.aclweb.org/anthology/D15-1035.pdf
  • Ronan Le Bras, Swabha Swayamdipta, Chandra Bhagavatula, Rowan Zellers, Matthew E. Peters, Ashish Sabharwal, Yejin Choi, 2020, “Adversarial Filters of Dataset Biases”, https://arxiv.org/pdf/2002.04108.pdf
  • Rabeeh Karimi Mahabadi, James Henderson, 2019, “Simple but Effective Techniques to Reduce Dataset Biases”, https://arxiv.org/pdf/1909.06321.pdf
  • R. Thomas McCoy, Ellie Pavlick, and Tal Linzen, 2019, “Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference”, Proceedings of the Association for Computational Linguistics (ACL),
    https://arxiv.org/pdf/1902.01007.pdf
  • Rion Snow, Brendan O’Connor, Daniel Jurafsky, Andrew Ng, 2008, “Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks”, Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, https://www.aclweb.org/anthology/D08-1027.pdf
  • Zihan Wang, Jingbo Shang, Liyuan Liu, Lihao Lu, Jiacheng Liu, Jiawei Han, 2019, “CrossWeigh: Training Named Entity Tagger from Imperfect Annotations”, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, https://www.aclweb.org/anthology/D19-1519.pdf