The minds behind SYSTRAN sit down for an interview regarding the complexities and the capacities of specialized neural machine translation engines.
Participants: Peter Zoldan, Senior Data Engineer -Software Engineer Linguistic Program, Svetlana Zyrianova, Linguistic Program, Petra Bayrami, Jr. Software Engineer – Linguistic Program, Natalia Segal, R&D Engineer.
How much data is required to create a specialized engine?
The more bilingual data, the better the quality. For broad domains such as news, millions of bilingual sentences will be required. However, if the domain is narrow, such as technical support documents for certain products, then even a small set of sentences of 50,000, noticeably improves the quality.
The amount of data required will depend on how broad or narrow the domain is you are specializing the engine into.
When you select data for a specialized engine, what type of data is best?
It is important that the data selected covers the actual breadth of the data that will be used for the translation. You have to make sure that you consider factors like domain and style. For example, if you want to translate customer emails about your product, but don’t have that, the product marketing data would help get the domain and type of vocabulary needed for the translation, but not the style. So, you would want to utilize emails, even if it’s not specifically about the product, in order to learn the style.
It is also important to look at your workflow to see if it includes any HTML tags ? what your data looks like, and what the data is going to look like when you do the translation. So, when going through the workflow, we will put your training data through the same workflow.
Describe one of the more challenging specializations you?ve worked on.
The most challenging specializations for me arise when the style and domain are vastly different than what I trained on. So, for instance, I had to work on a specialization which involved a lot of sentence fragments with very descriptive words. Training data is usually news or have complete sentences. So, it was a bit tricky to work with the fragments that we were receiving.
For this reason, we had to go in artificially and create a lot of data that had these sentence fragments to make them more familiar.
I also once worked on a project that involved a lot of code, so not a natural language. It had to do with the computer program indicating the color, font style, etc. It was different from the natural language flow we were used to. It was definitely still a challenge I had to overcome.
Are there any languages that are harder to specialize than others? Can you give examples?
With neural networks, a lot of quality is based on data. If you have low resource languages like Latin or Tagalog, it is going to be more difficult compared to Spanish, for example, where there are already a lot of data that has been translated from English to Spanish.
Some languages might have a lot of data, but it is all in one domain. For example, we might just have open subtitles. It becomes difficult to do its specialization because it only knows how to do that one domain. It doesn’t have a very broad-based to work with.
Also, sometimes the language pairs can be very different such as English and Japanese, which have a very different origin. The word styles are very different.So, it is harder to learn compared to Spanish or French, which have Latin or European origin.
As for individual languages, every language has its own quirks. For example, with Chinese we have to deal with the spacing, which can be tricky. With Russian, they have a lot of different cases and the morphology is very advanced. Turkish is agglutinative, which can be tricky as well. German has compound nouns and verb splitting. Every language has its own complexities that can make it more challenging to specialize.
What are some tips for making sure your specialized models are not biased (for example, to a gender, idea)?
Social bias is a problem for all artificial intelligence applications. The bias comes from the data itself. Our whole society is still a little bit biased, so the data society produces can be biased as well. At SYSTRAN, we work on various pre-processing and data cleaning algorithms to reduce this bias in training data and this, in turn, reduces the bias in the models.
How is specializing an AI-based engine different from the older statistical engine technology? How is the AI-based technology making the quality of specialized engines better than the older technology?
Statistical engines were able to combine the most likely translation for a sequence of words with the most likely output for that language. However, to get that, it has to actually have seen that translation. Specifics are based off what?s available.
Neural networking goes deeper. It can create sequences that were never seen before. It can learn the attributes of languages. Embedded in the neural network are attributes like gender, action, parts of speech, etc. Statistical methods don’t have this capability.
The neural network can also keep track of things like long-distance dependencies, it can coordinate gender, and it can create intelligence and put that intelligence in the translation.
Can you describe a success story working with a client on a specialized engine?
We had a client who wanted several different specialized engines for different domains. We had a good solid general model we were able to work with, but we also were able to use the data from each of their several domains to generalize it for their company and further specialize it for each of those individual domains.
They also had very specific patterns that needed to be protected coding, exclusive to the program that would interrupt the natural language process. We were able to isolate those special codes so that the neural networks could still focus on the natural flow of the language while being able to preserve the codes that were embedded in the natural language. It was a proof of concept that we did and today we just did one language pair. They liked it and now we are tasked with doing several more language pairs.
Clients are now able to use SYSTRAN?s tools to create their own specialized engines. What skill sets are required to do this?
At SYSTRAN, we are really working hard to make the tools for training and specializing neural machine translation model as user-friendly as possible. Our new marketplace solution is aiming to put neural machine translation within everyone?s reach.
In regards to the skill set, I would say you need some basic computer literacy, some curiosity, and general interest in language processing. There is a slight learning curve in getting accustomed to the tools. Last, you would need good data appropriate for your model to begin your specialization.
For our commercial clients, without naming them, what are examples of specialized engines we?ve worked on?
We have done a lot of work with the government – legal, electrical, technical support, gaming, marketing, information technology, mechanical business, banking. We have also specialized based on regions. So, we have done Brazilian-Portuguese, European-Portuguese, different Spanish, different French. To put it short, we have done quite a few specializations.
How do you know when an engine is ready to deploy? Is there a metric used?
Yes, we look at various metrics, but the common one we look at is the BLEU score. So, what we have is a test set which has source and target already translated by human and then we translate the source into the target. It automatically compares the translation which we have put in with the human translation and looks for overlaps of words and gets it an overall BLEU score.
This is to handle things like synonyms or different ways of phrasing things. So, it cannot be thought of as a perfect measure but for an automatic score, it is pretty good. Usually, we follow the BLEU score through the training and watch as it increases. Once it converges, basically once it stops improving, that’s generally when we want to deploy the model.
However, to really get a good sense of the model, you need human review. You need to actually go in and look at it to see what?s happening. Sometimes things might happen that are not reflected in the BLUE score. A blind test is the best review, where the reviewer compares quality of the output from two models without knowing which one is coming from which model. This ensures that the reviewer chooses the best model without any bias.
What kind of specialization engines have already been completed for the government?
We have done models based off social media, science and technology. We also released Arabizi along with Arabic, as well as other generic models. Back in the statistical days, we have also done things for cyber security, radar, and nuclear technology.
How can you maintain a specialized engine so that it remains relevant, especially for fields in which terminology quickly changes?
The neural network learns through the data you train it with. The more relevant the data is to the user case, and obviously, the more data you have, the better the results will be. Now, to update a specialized engine, you need to stay up to date with some terminologies like people?s names or scientific discoveries. To maintain quality, you will need to add additional data.
This is where SYSTRAN?s infinite training approach comes handy. Here you can train an existing model by adding new data at the start of each epoch. This automatically saves time as retraining from scratch is not required.
Can specialization be created for social media, where short cuts are used, and some languages are Romanized?
Engines can be specialized for any domain that has some distinctive characteristics such as the lexicon or grammar use. Social media is one such domain, but it does put forward some unique challenges. The content is abundant, but it almost impossible to obtain a line bilingual data which is necessary to train a specialized model.
The data which is generated by users with different backgrounds and language skills is not moderated. In other words, it is not checked or corrected. So, it will have incorrect grammar, spelling errors, and incorrect word usage.
Social media is also generated spontaneously and frequently with a sense of urgency. So, the use of shorthand and neologisms is quite common. The personal lexicon may also change based on trending events, which can happen very quickly.
Non-English social media, for example, Arabic or Russian can be and frequently is generated using the Latin script. Whilst there are some Romanization conventions, there are some variations in the actual transcription used.
The specialized social media engine needs to be robust to handle all variations and changes, as well as shorthand use, lexical innovations, and possible errors.
We at SYSTRAN have trained one specialized Arabic social media model, which can handle text in standard Arabic, dialectal Arabic, as well as Arabizi, romanized Arabic. So the model will translate text in both Arabic script as well as Latin script.
Using the best base model, various techniques we utilize to train the robust model to handle Arabic variation, such as dialectal Arabic or Romanized Arabic or Arabizi. These techniques include converting in-domain data to Arabizi using the transcription conventions generated by SYSTRAN linguists, back translation, and data modification. To keep the engine current, ongoing effort is required to keep abreast of all the changes.
When you work with a client?s data to create a specialized engine as a professional service, how do you ensure the data remains secure?
Data security is very important to us. Client data is stored in an area separate from the other general training data. Also, each individual file is marked as restricted and is only accessible to persons involved in the data processing and training. Once the training is complete, per the customer’s preference, data is purged from our system.