Since 2016, there has been a sharp increase in open source machine translation projects based on neural networks or Neural Machine Translation (NMT) led by companies such as Google, Facebook and SYSTRAN. Why have machine translation and NMT-related innovations become the new Holy Grail for tech companies? And does the future of these companies rely on machine translation?
Never before has a technological field undergone so much disruption in such a short time. Invented in the 1960s, machine translation was first based on grammatical and syntactical rules until 2007. Statistical modelling (known as statistical translation or SMT), which matured particularly due to the abundance of data, then took over. Although statistical translation was introduced by IBM in the 1990s, it took 15 years for the technology to reach mass adoption. Neural Machine Translation on the other hand, only took two years to be widely adopted by the industry after being introduced by academia in 2014, showing the acceleration of innovation in this field. Machine translation is currently experiencing a golden age of technology.
From Big Data to Good Data
Not only have these successive waves of technology differed in their pace of development and adoption, but their key strengths or “core values” have also changed. In rule-based translation, value was brought by code and accumulated linguistic resources. For statistical models, the amount of data was paramount. The more data you had, the better the quality of your translation and your evaluation via the BLEU score (Bilingual Evaluation Understudy, the most widely used algorithm measuring machine translation quality). Now, the move to Machine translation based on neural networks and Deep Learning is well underway and has brought about major changes. The engines are trained to learn language as a child does, progressing step by step. The challenge is not only to process exponential data (Big Data) but more importantly to feed the engines the most qualitative data possible. Hence the interest in “Good data.”
NMT: The Open Source revolution
The Open source factor is also another revolution that changes the paradigm for developing Neural Machine Translation technology. In the last two years, two new open source projects for neural translation have been launched each month. What’s even more impressive is that many of the actors behind these projects are in the private sphere. The three most active projects today are maintained by Google, Facebook and SYSTRAN in collaboration with Harvard NLP on the OpenNMT project. What is most surprising is that, major tech players like Google, Amazon and Salesforce did not previously have an active Open Source culture. One may then wonder why they are taking such an interest in open source now.
An evolving technology based on the human model
In just 14 months, neural translation has undergone three major paradigm shifts in terms of the technology used. The first models used RNNs (recurring neural networks). Then, following research conducted by Facebook, the technology shifted to CNNs (convolutional neural networks). Now, SAT (Self-Attentional Transformers) are more widely used, models initiated by Google. The RNN models processed the translation word by word. The CNN ones treated it in a more general way, by looking at word sequences. Current attention-based approaches, SATs, have the ability to “look” at several parts of the sentence simultaneously by identifying words that can have a significant impact on its understanding and translation. We are therefore moving closer to a human-like approach. Facebook now uses neural translation for 100% of its content, up from 50% in 2017. It is estimated that more than 6 billion translations are processed online every day. Clearly, investing in neural translation is a strategic move.
An Open Source race that masks a competition
An open source project is inherently fragile: launching a new Open Source technology is rather easy, but it is much less so to maintain it, make it evolve and grow an active community. SYSTRAN invests a lot of time providing support to the users in its OpenNMT community, to share data, analyze feedback, update algorithms, ensure stability and compatibility of the technology. So, why are these major tech players putting so much effort into these open source projects knowing they require significant investments, both financially and in terms of resources? The struggle goes beyond simply imposing a specific tool. For users, neural translation is becoming a commodity like running water or electricity. It is quite likely that it will become an integrated function in the majority of everyday applications because of its very low cost. However, the core NMT value lies in the infrastructure as well as the additional services that will support the new standard, whether connectivity and integration or training of these engines in very specific fields of activity for a tailor-made translation quality.
The next step: converging the efforts of major players
For industry-wide standardization, the next step is interoperability. It will enable these NMT tools to be carried from one platform to another. In order to do that, a standardization project led by Facebook, Microsoft and Amazon, ONNX (Open Neural Exchange) makes it possible to make neural networks interoperable: a model trained with a particular tool will be convertible towards others, making neural networks available on mobile technology regardless of their original NMT Framework. This standardization and openness around neural translation also encourages the development of related applications in a good spirit of “coopetition.” This is evidenced by an impressive number of recent developments such as those of ultra-intelligent virtual assistants and unsupervised Machine Learning for less strategic translations (such as subtitling). Adding the notion of context to NMT, in order to allow the algorithm to process an entire paragraph, or even a document as a whole, is also strategic.
The NMT Open Source battle is just getting started!