How can machines help us visually describe our products?
By
Luís Baía

Introduction
At Farfetch we aim to conciliate a product-oriented mindset with a scientific research culture. In other words, it means that we aim to continuously deliver novel functionalities to a business product while ensuring robustness and state-of-art quality on our machine learning models. VIPER, which stands for Visual Information for Product Exploration and Retrieval, is a new valuable business application which resulted in a scientific publication on the AI for fashion workshop at KDD 2018. The animation below shows our AI system for automatic product cataloguing in action.

A demonstration of our image recognition AI system.
It is essential for any e-commerce platform to categorise products and produce high-quality item descriptions correctly. For instance, when a user searches for products in a specific colour, category, pattern or detail, the results will match to items with the corresponding descriptive information. Therefore, the importance of a rich categorisation/description of the products becomes evident.
Usually, this is accomplished by a quite demanding manual labour process which requires every new product to be thoroughly described by one or several people. As you can imagine, this is a cumbersome process which increases in complexity as the business grows. Not only a larger team has to be assembled, but also we need to ensure the consistency of the description process between every team member involved. Consider examples such as the "salmon” colour, a "chevron” pattern or an "elegant” type of clothing. A manual process of this kind is a very challenging endeavour.
![]() |
A close-up on a possible result of our automatic categorisation system. One possible application is to alleviate the workload of human annotators and improve quick access to relevant products.
A briefing on image categorisation
Since the earliest developments in computer vision, we have been able to extract information from images successfully. However, it was in the last couple of years that a significant breakthrough was accomplished mostly driven by Deep Learning paradigms, by means of Convolutional Neural Networks (CNN) and such. In a nutshell, a CNN builds a hierarchy of layers, where each layer adds a level of abstraction on top of previous ones. By trying to mimic the human brain, for an image classification example, a CNN first layer might be able to detect edges, or other primitive visual features, from raw pixels, while the second layer will use those to compute simple shapes. Building a layer-wise approach allows one to find higher level features, like clothing specific patterns, which can be used to classify the input as belonging to a given category.A possible simple solution for the Farfetch image categorisation and labelling problem is to take advantage of a pre-trained CNN model and adapt it to our fashion domain. Let us say we use a deep learning model (e.g., ResNet, a widely known CNN), as the basis for our network:
![]() |
A classical recognition pipeline.
What is the problem that we wish to solve?
Over one million products have been integrated into Farfetch's ecosystem, with every product following a strict categorisation hierarchy:- Family: such as clothing, shoes, bags, accessories, ...
- Category: such as dresses, coats, sandals, ...
- Subcategory: which can be, day dresses, military tops, ...
![]() |
Example of Farfetch categorisation tree. Each colour corresponds to a hierarchical level: family in grey, the category in dark blue, the subcategory in green, and attributes in light blue.
In our case, the family prediction is discarded, since a proper category forecast automatically determines the family. The same reasoning could be applied for the category/sub-category relation, but we consider the latter to be a particularly hard problem to solve (distinguishing a day dress from an evening dress is subjective). A wrong prediction here could often destroy upper-levels of the category tree.
Summing up, we have devised a model that should predict the category, sub-category and all the attributes of a product. The family will be automatically inferred given the model predictions.
While a product can only have one family, one category and one sub-category, the same principle doesn't apply to attributes, which might appear multiple times for the same product. Thus, we distinguish the concept of multi-class and multi-label problems. The former stands for cases where a single class must always be predicted. Particularly, a probability vector can be built where the sum of the probability of all the classes is equal to one. If a product was predicted with an 80% probability of being a dress, then the remaining 20% have to be split between all the other possible categories. Multi-labelling happens when each class may occur independently of all the others. Here, the sum of probabilities of all the classes no longer has to be one. For example, a product may have been predicted an 80% probability of having a "v-neck” attribute and 90% of having a long sleeve. The category and sub-category cases correspond to multi-class problems while the attribute case corresponds to a multi-label problem.
Multi-model approach
The hierarchical structure of Farfetch data creates an intricate dependency which should be considered when building our models. For our first approach, we have taken advantage of several independent models:![]() |
A schematic representation of our first approach architecture.
Although simple to implement, this multi-model strategy has some serious drawbacks:
- The pool of models does not scale since training efforts need to be multiplied;
- Memory costs increase linearly proportionally to the number of categories;
- Errors on category prediction simply obliterate the chances of correctly predicting sub-categories and attributes;
- Although sub-category and attributes are correlated to the category, the category model cannot enhance its predictions using any feedback from the sub-category and attributes models;
- The models do not explore the product category tree.
Hierarchical approach
For all the aforementioned reasons, it makes sense to pursue a single model system. In our particular case, we are looking for a three-output model which simultaneously predicts the category, subcategory and attributes of a product image:![]() |
Hierarchical approach architecture.

Predictions of our hierarchical approach. We compare against the manual annotations. Correct predictions (matching the manual annotation) are shown in green, the incorrect ones are in red and the correct predictions that were not manually annotated are shown in blue.
Wrapping up
In this post, we have introduced VIPER, a platform for visual information extraction at Farfetch. After stating the problem, and presenting a brief description on the hierarchical structure of our data, we have covered two modelling strategies: a multi-model approach and a more evolved single-model architecture, both being able to perform single-image description satisfactorily.Farfetch, however, has multiple images for different perspectives of the same product. There are product features which are only visible from certain perspectives (e.g., shoe heels are usually not visible from the front), making the single-view model to fail in certain cases. In a next article, we are going to explore a neural network which can leverage additional product information provided by different perspectives of the images of the product to improve our model accuracy. Stay tuned.
Last but not least, João Faria and I would like to say a special thank you to Beatriz Ferreira (Data Scientist, PhD Student, NETSys), Hugo Pinto (Data Scientist), Vitor Teixeira (Software Engineer), Daniela Ferreira (Test Engineer), Peter Knox (Senior Test Automation Engineer), Ricardo Sousa (Lead Data Scientist), João Santos (Lead Software Engineer) and Hugo Galvão (Product Owner).
Also, we are so into data science, we wrote a scientific article which was published at Knowledge Discovery in Databases (KDD) in 2018.
This article was written by Senior Data Scientist Luís Baia in collaboration with Data Scientist João Faria.