This website uses cookies. By using the website you agree with our use of cookies. Know more


How can machines help us visually describe our products?

Luís Baía
Passionate about Artificial Intelligence and practising sports with Nike.
View All Posts
How can machines help us visually describe our products?

This article was written by Senior Data Scientist Luís Baia in collaboration with Data Scientist João Faria.


At Farfetch we aim to conciliate a product-oriented mindset with a scientific research culture. In other words, it means that we aim to continuously deliver novel functionalities to a business product while ensuring robustness and state-of-art quality on our machine learning models. VIPER, which stands for Visual Information for Product Exploration and Retrieval, is a new valuable business application which resulted in a scientific publication on the AI for fashion workshop at KDD 2018. The animation below shows our AI system for automatic product cataloguing in action.

A demonstration of our image recognition AI system.

It is essential for any e-commerce platform to categorise products and produce high-quality item descriptions correctly. For instance, when a user searches for products in a specific colour, category, pattern or detail, the results will match to items with the corresponding descriptive information. Therefore, the importance of a rich categorisation/description of the products becomes evident.

Usually, this is accomplished by a quite demanding manual labour process which requires every new product to be thoroughly described by one or several people. As you can imagine, this is a cumbersome process which increases in complexity as the business grows. Not only a larger team has to be assembled, but also we need to ensure the consistency of the description process between every team member involved. Consider examples such as the "salmon” colour, a "chevron” pattern or an "elegant” type of clothing. A manual process of this kind is a very challenging endeavour.

A close-up on a possible result of our automatic categorisation system. One possible application is to alleviate the workload of human annotators and improve quick access to relevant products.

A briefing on image categorisation

Since the earliest developments in computer vision, we have been able to extract information from images successfully. However, it was in the last couple of years that a significant breakthrough was accomplished mostly driven by Deep Learning paradigms, by means of Convolutional Neural Networks (CNN) and such. In a nutshell, a CNN builds a hierarchy of layers, where each layer adds a level of abstraction on top of previous ones. By trying to mimic the human brain, for an image classification example, a CNN first layer might be able to detect edges, or other primitive visual features, from raw pixels, while the second layer will use those to compute simple shapes. Building a layer-wise approach allows one to find higher level features, like clothing specific patterns, which can be used to classify the input as belonging to a given category.

A possible simple solution for the Farfetch image categorisation and labelling problem is to take advantage of a pre-trained CNN model and adapt it to our fashion domain. Let us say we use a deep learning model (e.g., ResNet, a widely known CNN), as the basis for our network:
A classical recognition pipeline.

The adaptation process of the network mentioned above consists in training the model on the Farfetch dataset. Such practice is well established in the literature and is known as "transfer learning”. It consists in reusing a network (ideally, a state-of-the-art CNN from publicly available repositories) by applying it to a different problem. The process can go in two ways: 1) using the model without training on the new data, or (2) starting a new training phase on top of the existing model.

What is the problem that we wish to solve?

Over one million products have been integrated into Farfetch's ecosystem, with every product following a strict categorisation hierarchy:
  • Family: such as clothing, shoes, bags, accessories, ...
  • Category: such as dresses, coats, sandals, ...
  • Subcategory: which can be, day dresses, military tops, ...
Besides, some categories may also contain attributes. For instance, dresses may have a length (long, short) or neckline attributes (round, square).
Example of Farfetch categorisation tree. Each colour corresponds to a hierarchical level: family in grey, the category in dark blue, the subcategory in green, and attributes in light blue.

In this post, we will be addressing the problem of product categorisation. We want to build a model that is capable of automatically selecting the proper categories and attributes of a product solely by considering its images.

In our case, the family prediction is discarded, since a proper category forecast automatically determines the family. The same reasoning could be applied for the category/sub-category relation, but we consider the latter to be a particularly hard problem to solve (distinguishing a day dress from an evening dress is subjective). A wrong prediction here could often destroy upper-levels of the category tree.

Summing up, we have devised a model that should predict the category, sub-category and all the attributes of a product. The family will be automatically inferred given the model predictions.

While a product can only have one family, one category and one sub-category, the same principle doesn't apply to attributes, which might appear multiple times for the same product. Thus, we distinguish the concept of multi-class and multi-label problems. The former stands for cases where a single class must always be predicted. Particularly, a probability vector can be built where the sum of the probability of all the classes is equal to one. If a product was predicted with an 80% probability of being a dress, then the remaining 20% have to be split between all the other possible categories. Multi-labelling happens when each class may occur independently of all the others. Here, the sum of probabilities of all the classes no longer has to be one. For example, a product may have been predicted an 80% probability of having a "v-neck” attribute and 90% of having a long sleeve. The category and sub-category cases correspond to multi-class problems while the attribute case corresponds to a multi-label problem.

Multi-model approach

The hierarchical structure of Farfetch data creates an intricate dependency which should be considered when building our models. For our first approach, we have taken advantage of several independent models:
A schematic representation of our first approach architecture.

It all starts with category prediction. Let us say an image is classified as "dress”: it is then passed to the "dresses” module, containing two new models which independently predict the sub-category and the attributes. Although trained independently, each of these models follows the same architecture. Specifically, we have used a ResNet with two dense layers on top, which reduce the latent feature vector to each domain dimensionality. The difference resides in the final layer activation function. For multi-class problems (categories and sub-categories) we have used the softmax function, while for multi-label (attributes) we have leveraged the sigmoid.

Although simple to implement, this multi-model strategy has some serious drawbacks:
  • The pool of models does not scale since training efforts need to be multiplied;
  • Memory costs increase linearly proportionally to the number of categories;
  • Errors on category prediction simply obliterate the chances of correctly predicting sub-categories and attributes;
  • Although sub-category and attributes are correlated to the category, the category model cannot enhance its predictions using any feedback from the sub-category and attributes models;
  • The models do not explore the product category tree.

Hierarchical approach

For all the aforementioned reasons, it makes sense to pursue a single model system. In our particular case, we are looking for a three-output model which simultaneously predicts the category, subcategory and attributes of a product image:
Hierarchical approach architecture.

In this architecture, we split an originally shared sub-network into three different paths, corresponding to each output. There are hierarchical relations we can explore, so we introduce the extra "message passing” layer. Subcategory and attributes depend on the category, and vice-versa, so it makes sense to create extra links between those paths, meaning that the connected subnetworks would mutually benefit from each other's knowledge. Training is similar to the one-output case, but here each of the output values has a contribution to the process.
Predictions of our hierarchical approach. We compare against the manual annotations. Correct predictions (matching the manual annotation) are shown in green, the incorrect ones are in red and the correct predictions that were not manually annotated are shown in blue.

In the above pictorial example, we show a sample of products fed into our hierarchical model. For the three products, the model successfully predicted each category, although it picked the wrong dress subcategory. Regarding the attributes level, there is only one wrong prediction, for the coat. In fact, just by considering the given image, it is hard to distinguish a low from a mid-length coat, even for us humans. Finally, we highlight the importance of the predictions represented in blue. These correspond to missing attributes in the original data, which our new model is now able to predict.

Wrapping up

In this post, we have introduced VIPER, a platform for visual information extraction at Farfetch. After stating the problem, and presenting a brief description on the hierarchical structure of our data, we have covered two modelling strategies: a multi-model approach and a more evolved single-model architecture, both being able to perform single-image description satisfactorily.

Farfetch, however, has multiple images for different perspectives of the same product. There are product features which are only visible from certain perspectives (e.g., shoe heels are usually not visible from the front), making the single-view model to fail in certain cases. In a next article, we are going to explore a neural network which can leverage additional product information provided by different perspectives of the images of the product to improve our model accuracy. Stay tuned.

Last but not least, João Faria and I would like to say a special thank you to Beatriz Ferreira (Data Scientist, PhD Student, NETSys), Hugo Pinto (Data Scientist), Vitor Teixeira (Software Engineer), Daniela Ferreira (Test Engineer), Peter Knox (Senior Test Automation Engineer), Ricardo Sousa (Lead Data Scientist), João Santos (Lead Software Engineer) and Hugo Galvão (Product Owner).
Related Articles