Introducing Unbabel-COMET v2.0: Improved Models and Metrics for Better Machine Translation Evaluation

Since they were first released in 2020, the COMET framework and metrics have had a significant impact on the MT field by providing more accurate MT evaluation and helping the MT community move away from outdated lexical metrics such as BLEU which are known to correlate poorly with human judgments [Mathur et al. 2020 , Freitag et al. 2021a , Kocmi et al. 2021].

Since its first release, COMET has consistently been recognized as the most accurate way to evaluate machine translation. Kocmi et al. 2021 showed that COMET was the most accurate metric across 101 different languages within 232 translation directions. Freitag et al. 2021a showed that COMET correlates better with annotations performed by experts than other existing metrics. Recently, Sai et al. 2022 corroborated Freitag’s results for Indian Languages showing that our metrics are also robust to low-resource language pairs. Also, since its first release our research team has been working hard on further improvements to our COMET models. Our team has participated in several WMT shared tasks where our submissions are consistently ranked among winning submissions [Mathur et al. 2020 , Freitag et al. 2021b , Freitag et al. 2022].

This year was no exception and our participation was marked by success, as we leveraged the COMET framework to secure the top spot in the Quality Estimation shared task. Additionally, in the Metrics task, our system was ranked first for one of the three evaluated language pairs. In the other two language pairs, we were ranked second, against a private metric with over 6 billion parameters. Given our recent advances, we believe it’s time to release a new open-source version — unbabel-comet v2.0

New and Improved Metric

With this release, we are also changing the default model for a better one: wmt22-comet-da.

Like previous models, wmt22-comet-da builds on top of XLM-R, an open-source multilingual model developed by Meta Inc, which supports up to 100 languages. This model is then fine-tuned with human judgments to produce MT quality assessments. Compared to other existing MT metrics, COMET metrics utilize the source sentence in the evaluation process which is helpful to detect errors that cannot be detected by comparing with a reference [Amrhein et al. 2022].

Additionally, the new model is trained with more data covering more language pairs and with better hyper-parameters that lead to better generalization across domains and language pairs. Also, to ease the interpretability of scores, the new model outputs scores between 0 and 1.

Results on WMT22 Benchmarks

To compare our new state-of-the-art model against the models from 2020 for different domains we used the MQM annotations from the WMT 2022 Metrics shared task [Freitag et al. 2022]. These annotations cover 3 high-resource language pairs (English-German, English-Russian, and Chinese-English) and, for each language, we have translations for 4 different domains: News, eCommerce, Social, and Conversational.

In Figure 1, we can observe the performance of previous models and the new ones, for each domain, in terms of segment-level Kendall Tau. For all domains, the new model shows superior performance.

To test our models on robustness to various language pairs, including mid and low-resource language pairs, we used data from the WMT 2022 General translation task. This data was collected using SQM and covers a wide range of language pairs including languages where COMET models were never trained on: English-Croatian, English-Ukrainian, Czech-Ukrainian, Yakut-Russian

From Figure 2 we can also observe that the new model generalizes better to these new and unseen languages. Looking at the Ukrainian language pairs (en-uk and cs-uk) we can see that the new model outperforms the model from 2020.

Figure 1 – Kendall’s segment-level correlation with MQM annotation for News, eCommerce, social media, and conversational domain. These annotations were collected for the WMT 2022 Metrics shared task for 3 language pairs (English-German, English-Russian, and Chinese-English) [Freitag et al. 2022].

Figure 2 – Kendall correlation at segment level with SQM annotations from WMT 2022 General Translation task [Kocmi et al. 2022]. These annotations were performed across different domains and include multiple system submissions.

Improved MBR command

Another feature we would like to highlight with this release is the comet-mbr command.

The comet-mbr command was born out of our QUARTZ project where we studied how to prevent MT errors by incorporating QE and Metrics into the decoding process. In our Quality-Aware Decoding paper [Fernandes et al. 2022] we showed that Minimum Bayes Risk (MBR) decoding and/or a two-stage approach (ranking + MBR) could reduce up to 40% in severe errors (critical plus major) across different language pairs. Both these techniques can now be easily used with the new and improved models to further improve MT quality.

New and Improved Metric

Results on WMT22 Benchmarks

Improved MBR command

More content

Customer portal

Manage your Language Operations

Editor interface

Start translating

Be an Unbabel insider