This website uses cookies. By using the website you agree with our use of cookies. Know more


Smart rankings at FARFETCH: 101

By Edgar Coelho
Edgar Coelho
Data Scientist by day, musician by night, passionate about both. Vans apparel hoarder.
View All Posts
Smart rankings at FARFETCH: 101
Every day, at Farfetch, we receive thousands of new products that are to be made available in our marketplace. All these new products ensure that we have an unrivalled range of fashion for our customers around the world. How can we make sure that with hundreds of thousands of products, our customers can find the pieces they want?

The popularity of an item highly depends on influencers who embody the most recent fashion trends. To stay up to date with the continually evolving luxury fashion world, Farfetch has a dedicated team of fashion experts who curate a large selection of products, ranking them according to a set of principles and fashion know-how. Nevertheless, is this enough? How can one scale to hundreds of thousands of products?

Example of a Farfetch product listing page (PLP).

We aim to deliver an intelligent automated system which is also enhanced by feedback from our fashion experts, so we can extend our manual curation efforts to an almost infinite set of pages. The application of Artificial Intelligence (AI) techniques to construct ranking models that output ordered lists is named Learning to Rank (LETOR or LTR), one of the fields in Information Retrieval. Contrasting with the straightforward ways to sort products (e.g., by query similarity, creation date, category, price range, among others), ranking models provide intelligent ways to order our catalogue by focusing on popularity, trends, stock availability, product information, and other relevant information. This ensures that our customers are exposed to the diversity of the Farfetch catalogue as well as giving our system the capacity to optimise the main business metrics.

Learning to Rank at Farfetch

For you to be able to get a real sense of our problem, let's think about Anna. Anna is a potential Farfetch customer, and she is looking for a specific bag from Gucci. She has been seeing it across social media influencers’ posts in the past few days and she wants to potentially buy it.

Anna opens the Farfetch’s website and directs herself to the Gucci page using the filter for bags, finding more than 600 items meeting these criteria.

She is shown a list of Gucci bags that are cleverly sorted with the sole objective of improving her shopping experience. This page is composed of the variables gender (Women), designer (Gucci) and category (Bags), for Anna’s specific session, which is usually referred to as a query identifier (query id). The listed products are called Documents. Finally, the Relevance is the quantification of how important a product is to a given query. Values for relevance are usually obtained by recording user navigation patterns on websites.

Without the work of our fashion experts and our ranking system, Anna would most likely have to navigate several pages until she found the item she was looking for.

LETOR models use several query ids or in layman terms, logs of interactions of users with the platform to train machine learning models that aim to optimise for relevance. For instance, the Gucci bag Anna was looking for was found in the third position of the listing page she was shown. As she did purchase the Gucci bag, it opened a way to improve future results telling us that for her, the third-ranked item should be more relevant than the first and the second.

Taking a closer look to LETOR

General learning to rank framework. Image from Catarina Moreira's machine learning course at the University of Lisbon.

The image above sums up the typical LETOR learning (also referred to as training) process. The training data comprises a finite number of training queries (q) with each query containing a set of products (x) ordered by relevance. The training data are fed to a learning system, which, as the name suggests, will learn the intricate differences and similarities between each query, resulting in a ranking model. The final model will be able to look into an unseen list of queries and predict the rankings of each query in the list.

LETOR approaches can be categorised into three groups: pointwise, pairwise and listwise. Simply put, pointwise approaches train a model by looking at each document in the data set independently, to predict how relevant it is for the given query, pairwise approaches try to come up with the optimal ordering for a given pair of documents and listwise approaches take the entire list of documents and try to come up with the optimal ordering for it. It’s important to note that, as we are trying to figure out the optimal order for a given list of documents, pairwise and listwise approaches have an advantage over pointwise, as in the latter order is not taken into consideration during model training. However, pointwise approaches have the advantage of allowing the use of any machine learning classification or regression approaches.

Regardless of the chosen learning model, the ranking system will always fall in one of these categories.1

Farfetch’s LETOR framework uses a state-of-the-art tree-based learning model to predict the rankings of documents in listing pages, including categories and designers. We are currently experimenting with different approaches in our LETOR paradigm. The starting point was a pointwise model comprising a diverse set of features that characterise our products and users' behaviour. Currently, we have fine-tuned the model to cope with unexpected behaviours, such as outliers and data inconsistency. In doing so, our ranking system is flexible enough to handle different data patterns so it can rank new products effectively and update rankings to ensure the renewed relevance of our predictions. We will detail our LETOR approaches further in a future blog post.

Our ranking ecosystem is framed as explained in "Where engineering meets data science - an architecture overview", where our API exposes the rankings computed by several AI modules which are continuously being updated by the events triggered across the whole Farfetch platform and captured by our Fetcher module.

Model evaluation and the decision-making process

But how do we actually come to the conclusion that one model surpasses the other? How can we assure our stakeholders that implementing a new model will outperform current baselines while also pleasing our customer base?

One critical step in the development of any learning model is how we evaluate its performance, which is especially crucial when we have these models in a production environment. However, choosing the most appropriate metrics to evaluate a learning model is no trivial task, since the model can yield satisfying results when evaluated using one metric, but poor results when compared with another.

There are several metrics we can use to measure a LETOR algorithm offline. In our case, we mostly focus on two: Normalized Discounted Cumulative Gain (NDCG) and Kendall Tau.
  • NDCG measures the gain of a document based on its relevance judgement (user history) and its position on the product list. The gain will be higher when the model gives priority to highly successful observed products. Since this list varies in length according to the query made, this gain value needs to be normalised.
  • Kendall Tau measures the correlation between two ranking lists based on the count of concordant and discordant pairwise combinations. Meaning that given two ranked lists of products, the Kendall Tau can informally be seen as the number of permutations that we need to do in one list to reach the same order as in the other one.
Offline evaluation is an important component of our work, as it allows optimising ranking policies before they are deployed in production. Recently, there has been renewed discussion around differences between offline and online evaluation.2 One solution to overcome this hurdle is to guarantee that our offline data is refreshed frequently every time that we perform an offline benchmark with new models. Another one is to consecutively iterate with an AB test. By doing this, we can show several possibilities to our users and empirically test our hypothesis. The objective of each iteration is to increase a certain metric.

Nonetheless, we also need to take into consideration the experience of the user on our platform. What we learned with time is that it is not only a matter of presenting the best-selling product or the products with the best business metrics, we also need to present the products with a visually appealing flow. We also need to make sure we align our products with the ever-changing trends and tastes of our customers who want to stay on top of the latest fashion.

What’s next?

As mentioned in a previous blog post done by our teams, "Where engineering meets data science - an architecture overview", we had to make changes in the architecture of our systems and we are continuously improving it to amaze our customers every time they visit and shop with us. Having AI components in a live environment is not a one-size-fits-all solution, as every problem is unique. In this post, we shed some light into our world and explained how the automated processing for ranking our products is important, especially in the constantly changing world that is fashion. In the next post, we want to take you on a more specific journey, the journey of a product alongside a deep dive in our whole process.

Special thanks to António Garcez, Daniela Ferreira, Eraldo Borel, Hugo Galvão, Hugo Pinto, Isabel Chaves, João Teixeira, Jorge Marques, Luís Baía, Marcelo Fernandes, Mário Barbosa, Maurício Junior, Miguel Monteiro, Pedro Nogueira, Peter Knox, Ricardo Gamelas Sousa and Vitor Saraiva.

1. For more detailed information the reader is referred to, click here.
2. See the article from Thorsten Joachims et al. on how to perform an offline evaluation for recommender systems here.

Related Articles