At SYSTRAN, many jobs revolve around machine translation technology. Among them is the Computational Linguist.
Focus on the major stages of machine learning, through the work of a Computational Linguist.
#1 Make and prepare a corpus
Translation machines function like the human brain: they learn and feed themselves on what is available. They learn how to translate correctly and quickly by consulting texts that have already been translated.
To assist them in this learning, the Computational Linguist will build a “training corpus” – a set of already translated texts that combine the source and target languages.
There are two ways to translate these texts:
- With data translated upstream by humans;
- With a monolingual body translated automatically (via back-translation) and then used in the opposite direction. For example, translate from English to French and then use the corpus in a French to English model.
The main role of the Computational Linguist is thus to prepare the data and language resources in order to make them clean and readable by the brain of the machine.
Focus on SYSTRAN’s corpus base
Do you have little data at your disposal? You can buy a corpus database from professional and qualified sites, or create your own from bilingual sites in the language.
In this case, a thorough quality check will be carried out by the Computational Linguist .
Check and guarantee the quality of the corpus
The choice and quality of the corpus are essential points that you should not neglect. It is from there that the machine will learn to translate – and this, according to a generic or very specialized model.
So, once the corpus is created, you need to check its cleanliness and quality: if the translation engine is driven with errors, your translations will have the same problems.
Some examples of problems to identify, filter and repair:
- incorrect encoding of accents (broken characters),
- misalignment (e.g. ‘today is December 13th’ in source language and ‘Aujourd’hui, nous sommes le 16 décembre’ in target language),
- the presence of a language other than that intended (for example, for an English-French corpus, an English sentence translated into Russian).
Focus: anonymization & confidentiality of the corpus
A corpus may contain proper names and sensitive data. It is therefore essential to ensure the anonymization and confidentiality of all information contained therein.
Your corpus should be used to train the translation engine only. No confidential information must be present or used in the translation resource.
Note: the corpus are mostly open source. They are therefore governed by a license which indicates
very standardized conditions of use.
A rich and diverse corpus, a guarantee of quality
The richer your corpus is (a lot of vocabulary from different fields) and the more diverse it is (different sentence structures), the more qualitative machine translation will be.
In particular, the Computational Linguist ensures that punctuation rules and/or key phrases are included.
#2 Prepare language resources
The role of the Computational Linguist does not stop at the preparation of corpus. It must also prepare various language resources that can be used by the translation model.
The use of “resources”
Language resources are procedures and other rules that enable computational linguist to identify translation problems – whether they are recurring or one-off. These help to:
- feed the machine brain with good translated content;
- manage the specificities of a language (rules);
- translate text into a local language (British English vs. American);
- meet customer needs;
- clean the corpus before it is integrated.
In this sense, it is essential to apply rules and best practices when training the machine to solve problems and adjust the corpus. It is the role of computational linguist to identify and create these resources.
The goal? Try to find the best parameters according to the specific skills of each language.
The importance of language resources
This approach makes it possible to discover the problems of certain languages and to manage them in the machine. Some languages have real asperities to take into account:
- How to segment sentences?
- What punctuation rules should apply?
- What are the key phrases?
The little extra? Language resources are identified by a human actor, the Computational Linguist. They therefore have real added value, unlike online machine-based translation.
#3 Turning the Machine Brain
Once the corpus is ready and cleaned, and the resources identified, the computational linguist can define the model parameters and what the engine will learn.
- Is it possible to offer translations with different tones of voice?
- Can the model be adapted to formal or familiar tone?
- Is it possible to apply a specific localization to translation?
The expert will incorporate all the knowledge specificities into the machine brain (a technology made by different teams, including development engineers). The machine will be able to learn the corpus and connect the source and target languages.
This is called training.
In general, this training is fed by 1 to 10 million lines of data and can be more or less rapid depending on the volume of resources. The more information there is, the more the machine spins to learn. The result is a reliable and accurate “translation resource.”
#4 Evaluate and iterate
A translation engine learns by looping. The role of the Computational Linguist is therefore to assess when the engine is “mature” – that is, when translations are accurate, natural, and fluid, and can be made available to users.
The model has turned? The translation resource created still needs to be evaluated. For this, a scoring system is used – the blue score – to compare the translation of a text file and that of the machine, performed upstream.
The system will evaluate the proximity between machine translation and human translation. The closer they are, the higher the rating. On the contrary, the more different they are, the lower the rating. If this score has its limits, it provides a good basis for analyzing the performance of the resource created.
Based on this initial assessment, the resources can be reworked and an iteration is made: the model rotates again. Then, again, we evaluate the resource that comes out of it.
The aim is to understand translation problems and identify areas for improvement. This is where the human assessment of the language engineer comes in. He will also check the translations of the various test files that represent different areas or problems.
#5 Deliver completed translations
When the assessment is satisfactory, the translation model and resource are delivered. However, it is possible to revise a model if:
- problems in translation for use arise;
- internal BOM is reviewed (for example, when a customer wants to add a space in front of a language character or incorporate organization charts)
- a bug side occurs (for example in tag management).
The linguistic richness of a translation engine comes from the corpus – the basis of machine learning. So there is a real challenge in choosing the right corpus and ensuring its quality. The Computational Linguist enables high-quality and sophisticated machine translation as part of a specialized translation model.