Building a Translation Memory with Elasticsearch
By
Edney Pitta

Localization is a key aspect of Farfetch. Our marketplace (farfetch.com) is translated into 14 languages and every year this number is increasing.
With thousands of new products added to the marketplace each day and countless labels on websites and apps, it’s easy to imagine there is a lot of work for our translation teams. Supporting them with tools and features to accelerate the translation process is crucial for them to keep up with their workload and allow the business to grow.

Translation memories are a part of every translator toolset. It is basically a database that stores sentences of text that have been translated before. When translating a new segment of text, it’s possible to query this database and give a suggestion if an equal or similar text has already been translated - which often happens to be the case.
Elasticsearch
The tech translations team at Farfetch are responsible for services related to localization, and we were already using Elasticsearch as our read model storage for translations. Elastic’s distributed nature suited us well in scaling our Language Platform, since we expected to reach a throughput of hundreds of requests per second (more on that another day!).
When faced with the challenge of building a translation memory, choosing Elasticsearch as storage was a no-brainer. It meant we could leverage its full-text search capabilities for matching similar texts, which is the most important part of a good translation memory.
The fun part: building!
Starting in an API First fashion, we designed a really simple GET endpoint in our Platform that basically gets a text, a source and target languages and returns translation suggestions for the given text. The response includes the suggested text, the language and the type of the match (exact or fuzzy).

Request and response sample
Modelling the data in Elasticsearch was pretty straightforward. Following our previous approach for storing translations, we decided to create one index for each language, so that we could leverage Elastic’s analyzers in the text field and keep our documents small.
To correlate the texts between indices, we calculate a hash on the text of the original language (usually English) and set the same value for all variants in other languages. We then store it in a field called unitId, meaning that all these variants form a Translation Unit (term borrowed from TMX file spec).

Document samples in Elasticseach
When a request comes to our endpoint, we first perform a search in the original culture index to find all unitIds in which this text appears. Then we search these unitIds in other cultures indices, getting back exactly how the original text was translated.

Our translation tool, Dermio (Portuguese people will understand this name)
Finally, here’s how we present it to translators. This is a funny case showing that, as a result of correlating the translations over the original text, it is possible to get different translations suggestions for the same text!
Next steps
As with everything, there’s always room for improvement. Our next goal is to better advance our matching search, as its robustness and efficiency are what really makes a translation memory shine in a translation process. We also want to better support the management of translation memories, segregating them by context and even individual translators.
The translation process at Farfetch is at the foundation of myriads of business processes and routines. With translation memory and other features and tools, our biggest goal is to make it as efficient as possible to contribute to Farfetch’s ultimate mission: to be the global technology platform for luxury - available in everyone’s language.