Gazing at the Future of Machine Translation Quality with MT-Telescope

July 30, 2021

Throughout the history of machine translation, one of the greatest challenges that researchers and engineers have come up against is the complex task of assessing the accuracy of translations. When putting different translation systems head to head to determine which is better, what makes one system superior to another in terms of translation quality?

The enigma of machine translation quality is something that our research team has been studying closely for many years now. From our award-winning and open-source OpenKiwi quality estimation tool to our Crosslingual Optimized Metric for Evaluation of Translation (COMET) framework, we are proud to have helped the machine translation community make great strides in assessing machine translation quality performance. 

Building off of the momentum and knowledge we’ve gained since releasing COMET at the end of last year, we are very excited to announce the latest development in understanding the quality performance of machine translation systems: MT-Telescope

Let’s take a look at why MT-Telescope is a leap forward for machine translation quality analysis, how it works, and what it means for our customers and the MT community at large. 

Why is MT-Telescope a breakthrough in machine translation quality?

“Our research shows that one of the biggest needs in applying machine translation is insight into its usability, an area where current methods fall short. Guidance-focused evaluation that focuses on how well MT suits particular use cases will help extend the technology to new areas and increase acceptance of machine translation-based workflows.”

— Dr. Arle Lommel, senior analyst at CSA Research

Automated measurement metrics that generate quality scores, such as COMET, are very useful because they reduce the need for human inspection, enabling fast and cost-effective prediction of translation quality. But the automated metric itself is only one piece of a much larger puzzle.  Another important factor in effective MT evaluation is the selection and curation of meaningful and representative sets of test data. 

However, even with well-curated data and much experience in using and applying automated metrics, we came to realize that the performance scores generated by these metrics only tell part of the story. It’s important for our MT engineers to not only determine that one model scored better than another, but to also understand why. 

Let’s say that a team of MT engineers is examining the quality performance of two machine translation systems on a test set using an automated measurement metric, and one gets a score of 78 while the other receives a score of 80. Until now, many engineers in this situation would presume the system that scored an 80 is the “winner,” so they might drop the other system and carry on with their work.

In that scenario, it’s hard to know what factors caused the second system to earn a higher score on that particular test data. What if the lower-scoring system actually produced more accurate translations for specific terms that resonate better with a native speaker of a given language? When engineers manually dig deeper into the translations generated by these two systems to try to find answers, it is extremely time-intensive, which nullifies the efficiency gains that an automated metric like COMET provides. 

This conundrum is why we developed MT-Telescope, which is capable of surfacing data underlying quality performance and providing granular comparison between the two systems. MT-Telescope essentially blows the machine translation models’ performance on test data wide open and allows engineers to make much more nuanced and informed decisions about why they would choose one system over another. 

Machine translation is a rapidly evolving field: As one recently published paper made clear, an increasing number of MT researchers and developers have failed to follow rigorous state-of-the-art evaluation methods in recent years, relying solely on reporting aggregate scores based on outdated automated metrics, without any additional validation. That’s why MT-Telescope was designed to enable the seamless adoption of best practices in MT evaluation and quality analysis as new advancements are made.

How does MT-Telescope work? 

The process is fairly simple. A user uploads a test suite consisting of data extracted from a meaningful and representative collection of test documents. For a given test suite, the data files should include:

  1. The source segments (untranslated content in the source language)

  2. The reference content (perfect, human-generated translations) for the source segments

  3. Translations from two different MT systems for the source segments

At this point, the analysis, based on the underlying automated metric, is automated and visualized in an intuitive browser interface, which allows engineers to execute a rigorous and detailed evaluation effortlessly. Instead of a complex set of outputs, MT-Telescope’s UI makes it simple to compare and contrast the scored translations generated by the two machine translation systems. 

These include graphical representations showing the difference in quality scores for specific subsets of the test data (such as those containing named entities and terminology), a side-by-side error analysis of each system, and a high-level evaluation of the distribution of quality scores between the two systems.

When a user is viewing the contrast between systems with MT-Telescope, they can unlock deeper, more granular insights by applying filters based on specific features of the data, such as the presence of keywords and phrases (named entities and terminology). These keywords could include job titles, company names, product types, geographic locations, and so on. 

Users can also filter by the length of the translation (segment length) to compare the performance of the two MT systems on short vs. long segments. This is especially helpful when comparing a communication channel that is usually brief, like chat, with one that is typically more long-form, like email. 

The MT-Telescope tool is modular enough that users can easily perform the analysis based on other quality measurement metrics in addition to our own COMET, such as Google’s BLEURT or Johns Hopkins’ Prism.

Combined, all of these features enable engineers to perform richer, faster analyses between MT systems so they can make more confident decisions about the best system to deploy. 

What does the introduction of MT-Telescope mean for you?

Unbabel is currently using MT-Telescope to help our LangOps specialists and MT developers evaluate and make deployment decisions for our own machine translation systems. That means our customers and future customers will continue to benefit from a powerful Language Operations platform that will only get better with time. 

MT-Telescope allows us to make smarter choices about which system delivers the best possible quality performance and thus the best possible customer experience. As we continue to push boundaries on new frameworks and tools at the forefront of machine translation and MT quality evaluation, our own solutions will only improve. That brings us closer to the goal of seamless multilingual conversations that are indistinguishable from communications written by a native speaker.

Our COMET metric is already being embraced by leading technology companies on the front lines of machine translation innovation. Our release of MT-Telescope as a complementary open-source MT evaluation tool is intended to serve as an extension of COMET, empowering researchers and engineers to develop their own cutting-edge machine translation models. 

This aligns with a common thread in Unbabel’s research: the academic spirit of sharing knowledge. In keeping with this “rising tide lifts all ships” mentality, we will continue to contribute our tools and best practices back to the machine translation community so that our advancements are able to benefit everyone who wants to help extend the limits of what this technology can do. 

A major advancement for multilingual communications 

MT-Telescope is a key milestone in our ability to assess the quality of machine translation systems, and we’re still in the early stages of determining the scope of everything it will help us achieve. We hope that our customers are just as excited as we are about taking machine translation quality and efficiency to the next level.


Head over here to learn more about how machine translation technology powers an organization’s Language Operations strategy for a better multilingual customer experience and accelerated international growth.

About the Author

Profile Photo of Alon Lavie
Alon Lavie

Alon Lavie is the VP of Language Technologies at Unbabel, where he leads and manages the US AI lab based in Pittsburgh, and provides strategic leadership for Unbabel's AI R&D teams company-wide. From June 2015 to March 2019, Alon was a senior manager at Amazon, where he led and managed the Amazon Machine Translation R&D group in Pittsburgh. In 2009, he co-founded a technology start-up company by the name of "Safaba Translation Solutions"​, and served the company as Chairman of the Board, President and CTO. Safaba developed automated translation solutions for large global enterprises that allowed them to migrate and maintain large volumes of content in all the languages of their markets. Safaba's approach focused on generating client-adapted high-quality translations using machine-learning-based technology. In late June 2015, Safaba was acquired by Amazon. For almost 20 years (1996-2015), Alon was a Research Professor at the Language Technologies Institute at Carnegie Mellon University. He now continues to serve as an adjunct Consulting Professor at CMU. His main research interests and activities focus on Machine Translation adaptation approaches with and without human feedback, applied to both high-resource language pairs as well as low-resource and minority languages. Additional interests include automated metrics for MT evaluation (specifically, the METEOR and COMET metrics), translation Quality Estimation, and methods for multi-engine MT system combination. Alon has authored or co-authored over 120 peer-reviewed papers and publications (Google Scholar h-index of 45 and i10-index of 122). Alon served as the President of the International Association for Machine Translation (IAMT) (2013-2015). Prior to that, he was president of the Association for Machine Translation in the Americas (AMTA) (2008-2012), and was General Chair of the AMTA 2010 and 2012 conferences. Alon is also a member of the Association for Computational Linguistics (ACL), where he was president of SIGParse - ACL's special interest group on parsing (2008-2013). In August 2021, at the 18th biennial Machine Translation Summit conference, Alon was awarded with the 2021 Makoto Nagao IAMT Award of Honour for his contributions to the field of Machine Translation.