Measuring Translation Performance

6 min read

Performance metrics are a critical component of Unbabel that enables us to monitor progress toward the achievement of our goals, mission, and strategy. A portfolio of performance metrics and their corresponding target performance levels is necessary both to direct focus toward actions that improve these metrics and to communicate how much improvement is needed.

Key Indicators of Translation Performance

The two critical components that we consider in evaluating Unbabel’s translation performance are speed and quality.

Speed means the temporal efficiency of our translation process.

Effort is related to speed. We measure of the amount of effort required from our crowdsourced translators to generate the translations. By minimizing effort, we increase speed and productivity, and the satisfaction of our translators and customers.

Quality means providing fluent translations that convey the same meaning as the input documents. Delivering superior translations requires rigorous qualitative (i.e. customer feedback) and quantitative (i.e. computational analysis) measurement of quality, the combination of which is the final measure of our translation performance.

How We Measure Speed, Effort, & Quality

Speed

Translation speed is measured as the time in seconds spent by the translator in generating the translation.

Effort

Effort is defined as the number of operations performed by the translator to transform the initial MT output into the final translation. Effort is typically represented as a percentage of the number of translated words. Similarly, at Unbabel, we use Levenshtein distance and Translation Edit Rate to compute effort.

Quality

Translation quality is a more complex concept than it may appear in a first glance. For humans, evaluating translation quality is easy: anyone familiar with the source and target languages is able to assess it. But providing a concrete and automatically measurable definition of translation quality is a much more difficult task. In fact, such a definition does not yet exist despite the effort of academic linguists and Natural Language Processing (NLP) scientists.

Though we don’t have a concrete definition of translation quality, experts have identified a multitude of different indicators that correlate (more or less) with the human perception of translation quality (Blatz04, wmt14, seq. 4.3):

  • Inconsistency: gender/number/tone discrepancy between different parts of the translated text, e.g. ‘I have informed Maria in our customer support team, he should be in touch with you later today.’
  • Misspelling: proper nouns and branded language, e.g. ‘I’ve asked Maria to meet me at Philz Coffee on 24th and Folsom.’
  • Duplication: duplicated words in the sentence, e.g. ‘I’ve asked Maria to to be in touch with you today.’
  • Typographical: unmatched parentheses/brackets, out of place punctuation signs, etc. e.g. ‘Hi, M&aria!’
  • Casing: inconsistent casing, e.g. ‘He wrote to Maria but he didn’t know that maria was out of the office that day.’ (note that He and he are not considered an error)
  • Whitespace: out of place, lack of, and multiple whitespaces, e.g. ‘I’ve asked Maria_ to be in touch withyou today.’

To detect these error indicators we employ a wide variety of techniques from ‘simple’ pattern matching on strings to sophisticated NLP models that allow us to detect the linguistic role of each word in a sentence, among other things. For instance, in the sentence:

“Your three items are on the way! Check your order here.”

We are able to determine that “three” is the adjective and “items” is the head noun, and thus our NLP models are able to check for adjective-noun agreement.

Because the indicators we track provide different but complementary information about the quality of the translation under consideration, we need an additional model to combine these indicators into a single, robust translation quality metric. At Unbabel, we adhere to the LISA quality assurance model where errors are categorized according to their relative importance. Following best practices, we place different errors in differing categories, depending on the severity of their impact on translation quality:

  • Critical: inconsistency
  • Major: misspelling, duplication
  • Minor: typographical, casing, whitespace

Each category has an associated weight: we weight critical errors by 1, major errors by 0.5, and minor errors by 0.1. Given a translation task, we add the weight of all detected errors to obtain a total error score.  Finally, the aggregate error score is mapped to a 1-5 Likert scale that denotes the quality of the generated translation.

Error Score:0<= .5<= 1<= 2> 2
Accuracy:54321

In practice, a smooth mapping is computed according to the following equation:

Quality = 1 + 4 * exp(-0.7*error_score)

This is what a smooth distribution looks like on a quality curve

Translation Quality at Unbabel

We have three different metrics to evaluate translation performance at Unbabel: Speed, Effort, and Quality.
These measures are used throughout the Unbabel platform to achieve different goals. Broadly speaking, these uses can be classified into two categories:

  • Internal evaluation: these metrics capture the performance of our translation pipeline at any given moment in time. Monitoring them over time, we are able to evaluate the impact of improvements to our software.
  • Quality Assurance: in addition to our internal evaluation, performance metrics at Unbabel are applied in a wide variety of tasks intended to assure the quality of the translations provided to our customers:
  • Community evaluation: quality metrics are a key component of maintaining a highly effective community of translators and reviewers. They allow us to detect good and bad translators, to evaluate the feedback provided by reviewers, and to make sure our translators are working at appropriate speed and effort.
  • Quality filter: prior to being delivered to a customer, our translations pass multiple tests aimed at assuring the best translation quality. Translations sent to our clients must meet pass the quality filter. If they do not, they go back to the translation pipeline for improvement.

We believe that when properly identified, the value of focusing on very few metrics and aligning our team’s efforts around them can provide immense value to our translators and customers. The Translation Performance Metric is one of these metrics. It makes sure we deliver high quality texts to our customers, thereby helping them communicate with their international audience every day.

Further Information

For more information about how translation quality is evaluated, please see the following resources:

  1. Blatz, John, Erin Fitzgerald, George Foster, Simona Gandrabur, Cyril Goutte, Alex Kulesza, Alberto Sanchis, and Nicola Ueffing. “Confidence Estimation for Machine Translation.” 29 Mar. 2004. Web. 19 Apr. 2015. <http://web.eecs.umich.edu/~kulesza/pubs/confest_report04.pdf>.
  2. Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 12–58, Baltimore, Maryland USA, June 26–27, 2014. c 2014 Association for Computational Linguistics
ArtboardFacebook iconInstagram iconLinkedIn iconUnbabel BlogTwitter iconYouTube icon