A closer look at Unbabel’s award-winning translation quality estimation systems

We have a huge vision for Unbabel to provide human-quality translations at the scale of machine translation. But how do we know we’re doing a good job? 

For us, quality is a blend of having a good initial text to work from, feeding it through our domain-adapted machine translation, and then intelligently distributing these outputs to a curated community of editors, who we support with tools and aids that allow them to review, post-edit and approve the content as fast as possible.

First, here are the multiple ways that we measure, control and optimise quality across our language pipeline.

Quality Audits and Annotations

We conduct periodic Quality Audits of our customers and weekly annotations of sampled data, testing hypotheses and running deep analyses where we find higher than normal errors in our pipeline. We use the industry standard metric here, MQM, or Multidimensional Quality Metric, to be able to objectively compare our performance with third parties and open source translation libraries. 

Our annotation process is conducted by a pool of specialists with backgrounds in Translation Studies and Linguistics, who are able to build a deep store of knowledge within our platform that boosts overall quality and decreases turnaround time to delivery. 

Customer Customization

At Unbabel we create and maintain glossaries for each client, and make sure that specific instructions, brand guidelines and tones of voice are adhered to. Editors in our community are able to access this information alongside translation tasks to have greater context when working on specific customer communications, ensuring an even higher quality and faster turnaround. 

Editor Evaluation and Editor Tools

Supported by collaborators from our community and academia, we perform continuous evaluations of our community with linguistic feedback. We create Training Tasks which resemble real tasks to accurately benchmark our editors, and produce linguistic guidelines to help educate the community in avoiding common mistakes. 

With the help of researchers in Natural Language Processing and other field specialists, we’re able to develop tools like Smartcheck, which provides alerts and suggestions to our community of editors to aid with proof-reading (think of a supercharged multilingual version of spellcheck). 

Unbabel’s Award-Winning Quality Estimation System

One of the key component’s of Unbabel’s translation pipeline is our Quality Estimation system, which identifies the words that are incorrect to provide an automatic quality score for a translated sentence, enabling human post-editors to pay special attention to the parts of sentences that need to be changed. 

Let’s imagine a source sentence, such as Hey there, I am sorry about that!” (a real example from our Zendesk integration). 

Now, imagine an automatic translation of this sentence into a target language like Portuguese, such as Hey lá, eu sou pesaroso sobre aquele!” (unfortunately, also a real example in this case, a very inaccurate and overly literal Portuguese translation retrieved by a popular MT system). 

For this example, our system marks all non-punctuation words as incorrect and assigns a very low score of 0.222. 

 

 

Why do we care at all about quality estimation? First, there is evidence that quality estimation makes the job of human post-editors a lot easier. Pinpointing incorrect words helps them pay special attention to certain parts of sentences that likely need to be fixed.

Second, it allows detecting that a sentence is not yet ready to be delivered to our customers, if the automatic quality score is below a threshold, and that it needs a human to fix it. This puts Unbabel in the right track to deliver consistent, high quality translations.

Quality estimation is one of the key shared tasks in the Conference/Workshop on Machine Translation (WMT) annual campaign. Every year, these campaigns evaluate and compare the best systems worldwide, both from academia and industry. In 2016, we gathered a team (including Chris Hokamp, a PhD student at the Dublin City University, interning with us in the scope of the EU-funded EXPERT network) and participated for the first time in the word-level track.

Our system won the competition by a great margin (a F1 score of 49.5%, against 41.1% obtained by the best non-Unbabel system), combining a feature-based linear model with syntactic features, with three independent neural network systems, ensembled together.

These results were very encouraging, but the problem was still far from solved. If it were, then machine translation would be nearly solved too, since one could query a quality estimation system to evaluate a long list of candidate translations and retrieve the best.

Beating our own world record with Automatic Post-Editing 

So how could we improve even further? Another technology we make use of at Unbabel is Automatic Post-Editing (APE), the goal of which is not to detect errors or assess the quality of the MT, but to automatically correct a translation. 

In our example above, a good outcome would be transforming the painful Hey lá, eu sou pesaroso sobre aquele!” into something like Olá, peço desculpa pelo sucedido.”

Given the natural similarity between the Quality Estimation and Automatic Post-Editing tasks, we decided to join our efforts to see where we could achieve better Quality Estimation by using the output of an Automatic Post-Editing system as an additional feature.

To test the hypothesis, we grouped with Marcin Junczys-Dowmunt, from the Adam Mickiewicz University (AMU), the team who won the Automatic Post-Editing task in WMT 2016, and who have been extremely successful by creating additional data using round-trip translations” and combining monolingual and bilingual neural machine translation systems with a log-linear model.

The results exceeded our best expectations: combining the AMU automatic post-editing system and our previous Quality Estimation system via a technique called stacked ensembling”, the results were striking: we improved our previous best word-level score from 49.5% to a new state-of-the-art, 57.5% (an absolute improvement of 8 percentage points). 

We also managed to build a quality score system for sentences, obtaining a Pearson correlation score of 65.6%, an absolute gain of over 13% over the previous best system developed by Yandex.

Our continued success here means that we can make quality estimation useful in practice, reducing post-editing times and ensuring fast, high-quality translations to Unbabel’s customers. 

 


The Unbabel AI Research team (André Martins, Ramon Astudillo and Fábio Kepler) led the quality estimation experiments.

The full details are in our TACL paper (see here for a draft), which was just accepted for publication:

André F. T. Martins, Marcin Junczys-Dowmunt, Fabio N. Kepler, Ramon Astudillo, Chris Hokamp. Pushing the Limits of Translation Quality Estimation.”

In Transactions of the Association for Computational Linguistics, 2017 (to appear soon).

Dr. Helena Moniz runs the quality team at Unbabel on a daily basis.

Unbabel’s co-founder and CTO. PhD in Natural Language Processing and Machine Learning at IST + Upenn with Professors Fernando Pereira, Ben Taskar and Luísa Coheur. Author of several papers in machine learning with side information, unsupervised learning and machine translation. Co-founder of the Lisbon Machine Learning Summer School.

No Comments

Sorry, the comment form is closed at this time.

Facebook Instagram LinkedIn Twitter YouTube Menu Toggle