The minds behind SYSTRAN sit down for an interview regarding the complexities and the capacities of specialized neural machine translation engines.
Participants: Peter Zoldan, Senior Data Engineer -Software Engineer Linguistic Program, Svetlana Zyrianova, Linguistic Program, Petra Bayrami, Jr. Software Engineer – Linguistic Program, Natalia Segal, R&D Engineer.
How much data is required to create a specialized engine?
The more bilingual data, the better the quality. For broad domains such as news, millions of bilingual sentences will be required. However, if the domain is narrow, such as technical support documents for certain products, then even a small set of sentences of 50,000, noticeably improves the quality.
The amount of data required will depend on how broad or narrow the domain is you are specializing the engine into.
Language is messy. Ask any person who has ever had to learn a second language and they will tell you that the most difficult aspect isn’t learning all the rules, but understanding the exceptions to the rules — the real-world application of the language.
e-Discovery can be a long, daunting process even in the best of times. In today’s globalized world of data, however, you not only have to worry about the sheer amount of information but also what language the content is in. This is where Neural Machine Translation comes in to break that language barrier. As fast as NMT is, though, odds are you have dreamed about how to make your systems even more efficient. How do you ensure any job can get completed on even the most ambitious of timelines?
When it comes to protecting classified data, blackout redaction has been in use for at least a century. While it is not the only acceptable form of data sanitization, it is historically the oldest and most commonly utilized by eDiscovery firms. This is despite the fact there are more modern and easy-to-use alternatives that save time and reduce errors. The two main data sanitization alternatives that meet legal requirements include anonymization and pseudonymization.
As noted by Anju Khurana, Head of Privacy of the Americas, Bank of New York Mellon, “There are now over 100+ privacy laws in the world and GDPR is driving other countries to adopt similar regulations.” (corpcounsel.com, Oct. 2019). The California Consumer Protection Act (“CCPA”) which comes into effect on January 1, 2020, is the latest, and very likely not the last. Most data privacy experts anticipate additional states enacting data privacy regulations and think it likely that Congress will eventually do so at the federal level.
SYSTRAN has been wholeheartedly involved in open source development over the past few years via the OpenNMT initiative,whose goal is to build a ready-to-use, fully inclusive, industry and research ready development framework for Neural Machine Translation (NMT). OpenNMT guarantees state-of-the-art systems to be integrated into SYSTRAN products and motivates us to continuously innovate.
In 2017, we published OpenNMT-tf, an open source toolkit for neural machine translation. This project is integrated into SYSTRAN’s model training architecture and plays a key role in the production of the 2nd generation of NMT engines.
Machine Translation users care about quality and performance. Based on our own observations and the feedback we’ve received; the quality of our Neural MT is impressive. Evaluating performance is a stickier subject, but we’d like to dig our hands in and present our innovations and achievements and how it benefits NMT users.
By performance we mostly mean the manner in which a system performs in terms of speed and efficiency in varying production environments. It is important to note that performance and quality in Neural MT are tightly connected: it is easy to accelerate a given model compromising on the quality. Therefore, when evaluating performance improvement, we always check that quality remains very close to optimal quality.
Since switching to NMT at the end of 2016, we’ve invested our R&D efforts into optimizing our engines to be more efficient, while maintaining and even improving translation accuracy. Our latest, 2nd generation NMT engines, available in our latest release of SYSTRAN Pure Neural® Server, implements several technical optimizations that make the translation faster and more efficient.
New model architecture
The first generation of neural translation engines was based on recurrent neural networks (RNN). This architecture requires the source text to be encoded sequentially, word by word, before generating the translation.
Data leakage and lack of information are two critical issues that can harm businesses. Nonetheless, due to the ever-growing global marketing and communication needs, the temptation to use the fast and free online translation tools are rising.
Apart from the apparent dangers that these tools pose to businesses such as miscommunication, loss of business, and cultural insults, there is critical important threat that many enterprises often fail to recognize.
Whenever an employee uses a free online translation tool, they may cause massive data privacy breaches by making the consumer data searchable. Data breaches as such mainly happen due to employee negligence looking for quick machine translation, and it can often put millions of customers’ sensitive data at exposed on the internet.
Companies thus struggle to find the right balance between enabling business and securing information. Without the capability of translating software, potentially hundreds, if not thousands, of employees could turn to free translation tools to get their content translated in turn making the content available online.
Last week we hosted the 2018 edition of SYSTRAN Community Day! The conference was an exciting day full of energy, from Jean Senellart’s opening speech to our client success stories and celebrating SYSTRAN 50th anniversary! Here is a quick look at the conference highlights:
Jean Senellart announces the launch of a marketplace connecting the expertise of neural model trainers with the needs of industrial MT users
Jean Senellart, CEO of SYSTRAN France and CTO of the group opened the conference with a bold statement: the high quality of Neural Machine Translation has “commoditized” Machine Translation. As a commodity, NMT framework provides raw technology that needs to be refined, adapted and integrated for any industrial usage. After a look at the available NMT open source frameworks, including OpenNMT, cofounded and actively maintained by SYSTRAN, he made clear that streamlined training processes and data quality are the most crucial points to industrialize high quality neural machine translation.
Jean concluded his talk with the announcement of SYSTRAN marketplace, an open online platform where language experts have access to best of breed technology and framework to build, share, and sell language or domain models that can be accessed by industrial users. They will be able to select among hundreds of available models for any language pair and share feedback or evolution requests as per their specific needs.
The latest version of our AI-powered Translation Software designed for Businesses
SYSTRAN Pure Neural® Server is our new generation of enterprise translation software based on Artificial Intelligence and Neural networks. It provides outstanding professional quality with the highest standards in data safety.
Our R&D team, extremely active to provide corporate users with state-of-the-art translation technology tailored for business, just released a new generation of Neural MT engines. SYSTRAN new engines are developed with OpenNMT-tf, our AI framework using latest TensorFlow features, and backed by a proprietary new training process: Infinite Training.