Ardas Khalsa has been a Computational Linguist for more than 9 years at SYSTRAN. Her role is to enrich language resources and train machine translation models. Discover her portrait, her job and what drives her every day.
Can you introduce yourself?
I am Ardas, Computational Linguist at SYSTRAN for 9 years. I am part of a team focused on language development, which includes researchers, engineers and language engineers.
Within this team, everyone has his or her own specialty: from data processing, to the search for new methods, to linguistic development, to the creation and enrichment of translation models. Each Computational Linguist on the team is also a specialist in a language family (Germanic, Latin or Slavic for example).
My main languages are English and Spanish, but I can also work on languages for which I am not bilingual according to need, such as Armenian or Georgian. In fact, I am currently working on an African language: Hausa.
What is a Computational Linguist?
At SYSTRAN, the Computational Linguist works on creating and improving translation engines.
For each language pair, we feed this engine with data: it is essentially high-quality translations and dedicated language resources able to cover the particularities of a source and target language.
We also support the engine training on this data, so that it learns how to translate new content into other contexts, thanks to the examples examined and its capabilities that relate to artificial intelligence.
The translation engine is the “brain” that allows the machine to translate. Texts should be provided to enable continuous improvement and learning. This tool, trained in this way, makes it possible to gain in accuracy, in time, in cost, but also to obtain a more human and fluid translation.
How do you do that?
When creating a translation engine for a new language pair or to enrich the database, I will collect the best possible bilingual data – also called bitext corpus – and prepare it to make it as “clean” as possible.
To do this, I will identify and then correct or filter any problems in this data. These can be unaligned segments (a translation that does not match the source) or broken characters (an encoding problem for an accented character).
What do you think defines a good corpus?
The corpus is an already translated text that allows the machine to be trained like a human brain: the more it learns, the more it knows and the better it translates.
The quality of the corpus is therefore essential to correctly translate the contents.
At SYSTRAN, this is a major focus. A corpus of quality must be well aligned – the matches between languages must be well made – and the errors of substance and form well corrected.
The more the corpus is supplied, the more material the machine will have to learn.
If an existing corpus is too short, SYSTRAN will enrich it with other corpus that we rework.
What are your other missions?
In addition to this task, I am enriching the internal documentation used by our users or partners to explain and make public engine updates.
I also work with the R&D teams, who look at the translation in context, and create test files for them by identifying the interesting sentences to test.
As a Computational Linguist, I am also sometimes asked to give my opinion on the accuracy of a translation.
The role of Computational Linguist is a real guarantee of the quality of machine training and therefore of the performance of machine translation. This is also a key role, since Computational Linguists work with technical engineers and the R&D department in a cross-sectional way.