The art of speeding up NMT with SYSTRAN 2nd generation engines

Machine Translation users care about quality and performance. Based on our own observations and the feedback we’ve received; the quality of our Neural MT is impressive. Evaluating performance is a stickier subject, but we’d like to dig our hands in and present our innovations and achievements and how it benefits NMT users.

By performance we mostly mean the manner in which a system performs in terms of speed and efficiency in varying production environments. It is important to note that performance and quality in Neural MT are tightly connected: it is easy to accelerate a given model compromising on the quality. Therefore, when evaluating performance improvement, we always check that quality remains very close to optimal quality.

Since switching to NMT at the end of 2016, we’ve invested our R&D efforts into optimizing our engines to be more efficient, while maintaining and even improving translation accuracy. Our latest, 2nd generation NMT engines, available in our latest release of SYSTRAN Pure Neural® Server, implements several technical optimizations that make the translation faster and more efficient.

New model architecture

The first generation of neural translation engines was based on recurrent neural networks (RNN). This architecture requires the source text to be encoded sequentially, word by word, before generating the translation.

In contrast, the latest engines are based on self-attentional networks called the Transformer: each source word can focus independently and anywhere in the source text allowing the encoding to be computed in parallel and more efficiently.

For both architectures, the target generation is an iterative process that starts from this source encoding. However, several caching techniques were implemented to make this decoding as fast as possible for the architecture currently used in NMT 2nd generation engines.

Technical improvements

On the technology side, we have completely re-implemented the inference engine in optimized C++ dedicated to OpenNMT models. It offers a lightweight, embeddable, and customizable solution to run and accelerate OpenNMT models for production needs. In particular, we integrated powerful computational libraries such as Intel© MKL to get the most out of modern CPUs.

To further improve the execution efficiency, this new implementation also introduced new features – for instance most of our 2nd generation NMT models are run in reduced 16-bits precision instead of 32 bits. Model quantization is a way to decrease the overall memory usage and make better use of CPU vectorization while having a very small impact on quality. Another feature introduces a target vocabulary prediction module allowing to restrict the inference to a selected vocabulary, reducing significantly the amount of computation necessary to generate a sentence and thus boosting the translation throughput.

2nd generation NMT engines heavily rely on multi-core architectures. We tuned default settings to provide just the right balance between latency and throughput by batching multiple sentences together and running translations in parallel. In addition, we also optimized the translation speed for GPUs.

A look at gains in performance

The leap in quality brought by Neural Machine Translation means that translation tasks run slower than previous MT technologies.

However, our latest generation using Transformer architecture far outperforms our earlier generation of NMT engines, equivalent to our competitors’ engines in terms of performance. To set a benchmark, we compared translation speeds using a words-per-second ratio in CPU mode and GPU mode.

Translation speed is 15 to 30 times faster in 2nd generation than first generation NMT engines.
Given that algorithms are continually evolving and based on these promising results, we expect to see further gains in performance with subsequent generations. Many additional techniques exist to improve the translation performance and some are being discussed in the research paper that we published for ACL 2018.

The bottom line

Our 2nd generation NMT engines allow users to access ‘real time’ features with the quality of NMT, meaning an improved use of these features (translating web pages, text, etc.). These latest engines can also handle more efficiently larger volumes for the purpose of e-discovery or big data. In terms of ROI, these new engines also have a positive impact on hardware requirements with an economy of 25 to 35% on hosting costs.

Contact us to learn more!