Why We Built COMET, a New Framework and Metric for Automated Machine Translation Evaluation

Human languages are as diverse and complex as they are plentiful, with more than 6,900 distinct languages spoken around the globe. The subtleties and nuances of different languages—from tense to tone to idiom—makes translating between them one of the biggest, most interesting challenges we’ve undertaken as a species.

This complexity is also why many have long believed that machine translation will simply never meet or even approach human-quality translation.

I’ve spent the last several decades studying natural language processing. I have explored and developed computational algorithms and processes for building automated translation systems and for evaluating their accuracy and performance. These experiences led me to understand that the market needed a new framework and metric for automated machine translation evaluation.

We will always need humans in the loop to help build and train machine translation systems, identify and correct errors, and feed corrections into the data and algorithms used to train and refine them. But our recent project, COMET (Crosslingual Optimized Metric for Evaluation of Translation), offers a fresh approach to measuring and improving MT quality over time. We just presented a research paper describing our innovative new work at the EMNLP-2020 conference in November. Our results in the paper established that COMET is currently state-of-the-art. COMET was also recently validated as a top performing metric by the 2020 Fifth Conference on Machine Translation (WMT20).

In this post, I’ll explain why this matters, share how COMET works, and convince you that high-quality MT is not just theoretically possible – but is closer to reality than it ever has been.

Translation quality matters because customers matter

The quality of MT matters because customers—that is, people—matter. Any business that wants to survive and thrive in 2020 and beyond must consider how it will reach and support customers in their native languages. After all, 40 percent of customers won’t buy in other languages. And 96 percent of customers globally state that customer service is a key factor in their choice of which brands they prefer and are loyal to for their products and services.

Enabling global business and bringing high-quality customer service to people no matter what language they speak is Unbabel’s raison d’être. Our ultimate goal? Bridge global language and cultural barriers and become the world’s translation layer.

It may sound lofty, but it’s a mission we believe in.

So how do we achieve high-quality MT? It starts with having an effective way to measure the accuracy and quality of any given translation. As the well-known adage says: “you can’t improve what you can’t measure.”

Of course, one of the many challenges of measuring translation quality is that language is ambiguous and subjective. However, that doesn’t mean translation quality can’t be measured.

A common approach to quantifying translation accuracy is to ask human translators and bilingual speakers to identify and score translation errors based on their severity.

For example:

Minor issues: Do not affect purpose or understandability, but may make the content less appealing or native-sounding.
Major issues: Affect the purpose or understandability, but the essential meaning and overall goal of the source text is maintained after translation.
Critical issues: Result in major changes or omission of essential meaning and carry the risk of negative outcomes that can have health, safety, legal, or financial implications.

One well-developed translation error categorization and scoring framework has emerged in recent years, known as “Multidimensional Quality Metrics (MQM).” With a basic framework like this in place, we can begin to measure translation quality, even while recognizing that language itself is subjective and there is typically no single correct “gold standard” in translation. MQM is extremely useful for detecting and quantifying errors, but it requires trained human experts. It is therefore slow and expensive. This means it has limited value as a tool for measuring and guiding the training and development of modern high-accuracy machine translation systems. For that purpose, we need an automated translation quality metric that can generate quality scores that accurately correlate with expert human judgments such as MQM.

Where current machine translation quality metrics fall short

Over the last 20 years or so, several different automated metrics have been developed to measure machine translation quality, with varying degrees of success. Widely adopted metrics such as BLEU, chrF, and METEOR—the latter of which I myself invented some 16 years ago —have been extensively studied and improved. While very useful at earlier stages of MT, these metrics are now largely outdated and of limited value with current artificial intelligence technology that powers MT.

So where have they fallen short? To date, metrics for evaluating MT quality have relied on assessing the similarity between a machine-generated translation and a human-generated reference translation. They have focused on basic, lexical-level features. This basically means counting the number of matching characters, words, or phrases between the MT and the reference translation. However, by design, they largely fail to recognize and capture semantic similarity beyond the lexical level.

The fundamental problem is that these approaches don’t capture the semantic similarity between the MT-generated translation and a human reference translation at a level sufficient for accurately matching the quantified judgments of human experts (such as MQM). Now that our MT systems are much better than before, these earlier metrics often no longer correctly distinguish between better and worse translations, and consequently, between better and worse translation systems.

COMET’s path and why we launched it

COMET is a new neural framework (that is, set of algorithms) for training and running multilingual MT evaluation models. That’s a fancy way of saying it’s a new system that can help evaluate and predict the quality of machine-generated translations for many different languages.

Here’s what makes it new and different: COMET is designed to learn to predict human judgments of MT quality. It does this by using a neural system to first map the MT-generated translation, the reference translation and the source language text into neural meaning representations. It then leverages these representations in order to learn to predict a quality score that is explicitly optimized for correlation with human judgments of translation quality.

The resulting neural model can then be used as a metric to assess the quality of any particular MT engine and automate the process of evaluating quality (rather than requiring an expert human to annotate every translation). We complement this approach with periodic human multidimensional quality metrics (MQM) annotations to validate quality and to confirm and improve COMET’s predictions over time. As I said earlier, humans will always be in the loop—and that’s not a bad thing!

COMET wasn’t possible before now. It takes advantage of recent breakthroughs in large-scale cross-lingual neural language modeling, resulting in multilingual and adaptable MT evaluation models unlike anything the world has seen before.

COMET also takes a unique approach of incorporating information from both the source text and a target-language reference translation to more accurately predict MT quality. During our evaluation of COMET, we found that our models trained with the framework significantly outperformed all other metrics in terms of their correlation with human judgments. COMET can also be adapted and optimized to take into account different types of human judgments of MT quality (such as MQM scores or post-editing distance).

In other words, we’re getting closer and closer to being able to accurately judge translation quality using a machine as well as a human being.

One of the coolest things about COMET is that it can help us understand which MT models work the best. Even the most recent contributions to MT evaluation struggle to differentiate between the highest-performing systems. COMET can accurately identify the better system, even when performance of the two systems is very similar. This will provide a very useful tool for continually improving MT, because we can now easily differentiate between models and pick the better one.

How to get your hands on COMET

We have just released an open-source version of the COMET framework and trained models to benefit the wider MT community, and will continue to develop and improve these models over the next year. The code is available at https://github.com/Unbabel/COMET. It’s easy to install and to run, and we encourage all MT developers and users to try it out on their own!

Unbabel’s customers will directly benefit from COMET, because we will use it to refine the models and systems we use over time, and continually improve the quality of our translations for customer service teams. (Yes, we eat our own dog food over here!)

Our hope is for COMET to become a new standard metric for measuring the quality of MT models.

The way we see it, when you try to shoot down a METEOR—you might just land on a COMET.

Why We Built COMET, a New Framework and Metric for Automated Machine Translation Evaluation

Translation quality matters because customers matter

Where current machine translation quality metrics fall short

COMET’s path and why we launched it

How to get your hands on COMET

You might also be interested in:

More content

Translation quality matters because customers matter

Where current machine translation quality metrics fall short

COMET’s path and why we launched it

How to get your hands on COMET

You might also be interested in:

More content

Customer portal

Manage your Language Operations

Editor interface

Start translating

Be an Unbabel insider