Unbabel’s core mission is to deliver adequate customer-specific translations, as accurately as possible. To accomplish this, our technical strategies have continuously evolved with the advancements in machine translation (MT) technology.
We now have a new retrieval-based approach to achieve this, by leveraging kNN-MT (k-Nearest Neighbor Machine Translation), which has several key advantages:
It does not require modifying the general MT model parameters;
It learns “on-the-fly”, i.e., it enables leveraging new data as soon as it becomes available, to continually refine and improve translation quality;
It is more interpretable and controllable.
Comparison to previous methods
When it comes to domain adaptation, the most commonly used methods are fine-tuning the general MT model and performing parameter-efficient fine-tuning, e.g. by training adapter layers.
Fine-tuning involves adapting the model’s parameters using an in-domain parallel corpus, effectively retraining the entire model.
On the other hand, adapters are small layers that can be inserted between the pre-trained MT model’s layers. So, instead of modifying all the model parameters, one only needs to train the adapter layers, which is substantially faster.
Additionally, while fine-tuning the whole MT model involves having a different MT model for each domain, by using adapters we can have a single model and a set of adapters for each domain, which is a lot more efficient in terms of memory.
However, both these approaches require retraining parameters every time the model has to be updated, which is time-consuming and causes a delay between new data becoming available and a new model being deployed which benefits from this data.
In contrast, retrieval-augmented models such as kNN-MT do not require modifying the model’s parameters. Instead, kNN-MT combines a general MT model with a retrieval component which has access to a domain-specific datastore. This physical separation allows the system to be updated whenever new data becomes available. kNN-MT is also more interpretable, since it allows us to observe the words being retrieved at each step, and more controllable, since it allows balancing the importance assigned to the retrieval and parametric components. While it is usually slower than a vanilla MT system due to the overhead of performing retrieval, we recently proposed a caching strategy that makes this process very efficient.
Figure 1: kNN-MT
In kNN-MT, to generate each word of the translation:
The retrieval component searches for ground truth words which follow contexts similar to the current one;
Then, the model computes a probability distribution (over all the vocabulary words), based on the retrieved words;
This retrieval distribution is then interpolated with the distribution outputted by the general MT model, to obtain the final distribution;
The next word is generated based on the distribution outputted by kNN-MT.
EAMT 2023 paper
Our paper, “Empirical Assessment of kNN-MT for Real-World Translation Scenarios,” to be presented at EAMT 2023, studies the use of kNN-MT for domain adaptation and compares it with using a generic MT model (base model), fine-tuning, and training adapter layers.
We use five proprietary datasets across four language pairs (En-Ko, En-Tr, En-De, and En-Fr) and three domains (media descriptions, press releases, and customer service) and evaluate the different models using COMET-20, BLEU, and human evaluation — MQM (Multidimensional Quality Metrics).
Figure 2: COMET scores
Figure 3: MQM (Multidimensional Quality Metrics) scores
Figure 2 shows that kNN-MT shows promising results, leading to significant improvements in all datasets when compared with the base model. Even though it falls short compared to (expensive) full fine-tuning models (according to automatic metrics) it gets close, and the best performance is achieved with a combination of fine-tuning with kNN-MT. The benefits of kNN-MT are found even more striking when we carry out human evaluation: in Figure 3, the MQM assessments indicate that fine-tuning and kNN-MT perform similarly and they both significantly enhance translation performance in comparison to the base model.
Figure 4: Left: COMET score while varying the datastore size. Right: COMET score while varying number of entries used to train the datastore.
In our paper, we also delved into understanding the influence of datastore size (Figure 4 Left) and the required number of entries to create them effectively (Figure 4 Right).
Our analysis reveals that the improvement observed is higher for larger datastore sizes, but, even when utilizing small datastores, there is a substantial increase in translation quality when compared to the base model.
Furthermore, our findings indicate that a relatively small number of entries is sufficient for achieving the best COMET scores, suggesting that it is possible to create a datastore and train its index with a small number of examples. This allows for the flexibility to add more entries as more data becomes available, making the approach both scalable and adaptable to various data requirements.
This work will be presented on June 12 at the Conference of the European Association for Machine Translation in Tampere, Finland. Stay tuned!