This website uses cookies. By using the website you agree with our use of cookies. Know more


Powering AI With Vector Databases: A Benchmark - Part I

By Pedro Moreira Costa
Pedro Moreira Costa
2020 Edition Farfetcher with a data enthusiasm that loves to work remotely from his new CASABLANCA in Porto
View All Posts
Powering AI With Vector Databases: A Benchmark - Part I
At FARFETCH, we provide an extensive product catalogue of fashion products stemming from major fashion brands. Customarily, users can browse the catalogue in traditional web interfaces (i.e., website, mobile app), where they can open specific product pages, explore the product’s attributes and add their favourite apparel to the shopping cart.

iFetch is a novel multimodal conversational AI system with the ability to interact with a potential customer using Natural Language to assist in the purchase journey. iFetch’s goals are:

  1. Provide high-end clientele with the best in class online shopping experience to serve the massive paradigm shift into online conversational assistants meeting the expectations set by high-end physical stores;
  2. Augment customer experience through accurate and reliable AI-enabled multimodal tools by advising and influencing customer shopping journeys leading to increased conversion rates.

To achieve this, the iFetch team has been steadily developing machine learning models to build a conversational system that can communicate autonomously with our customers. However, a product catalogue is seldom enough to enable state-of-the-art machine learning algorithms to capture product relationships and language semantics. The answer lies in product representations through embeddings! Once we have representation models that can encode the products, we need to ensure that they’re persisted and serviced in a robust and performant indexing service (depicted in the Figure below). Vector databases are the go-to solution for storing this information powered by approximated nearest neighbours (ANN) algorithms.

Holistic representation of the iFetch system architecture. A service will be responsible for orchestrating the components for the utterance processing and state-tracking (NLU+DST), Dialogue Policy (DP), dialogue aware product Retrieval and response generation (NLG).

This blog article surveys the most recent, popular and reportedly robust large-scale vector databases that can sustain conversational AI systems also known as vector similarity engines (VSE). Our analysis comprehended the following databases: Vespa, Milvus, Qdrant, Weaviate,Vald and Pinecone.

Engine Features

As we prepared the experimental setup, we detected some capacity limitations from the selected technologies, which inspired the team to consider long-term concerns in the adopted solution. A small preliminary note is required here: It would be a daunting task to review all vector similarity engines (VSE) thoroughly. We will leave that for future iterations of this research (to follow having in mind GitHub vector-engine list). The table below details key features we wanted to identify within each solution.

A feature summary of the most well known VSE - this is not intended to be a comprehensive list.

Benchmarking VSE: An Analysis

To select VSE candidates, we had to consider several criteria: Diversity of index types, metric types, model serving, open-source community adoption and quality of documentation. We understand that Milvus is the most actively developed engine in the vector databases ecosystem backed up with a rich documentation collection. We were also thrilled by Milvus's diversity of metrics and index types. Besides Weaviate and Vespa, no other engine has a model serving functionality. We don't consider Vespa due to time constraints favouring the most recent ones, allowing graph data models, Weaviate. Qdrant posed several difficulties in our benchmark setup, and Pinecone has a proprietary index type which hinders a reliable comparison.

In this regard, the rest of the article will be focused on both Milvus and Weaviate, as they appear to be the VSEs that entirely meet our criteria.


Without further ado, let us first describe the setup for the experiment we ran for this blog. To evaluate the selected engines, we have used an Azure machine with the following hardware and software conditions:
Intel Xeon E5-2690 v4
RAM: 112 GB
ROM: 1024 Gb HDD

Operative System: Linux 16.04-LTS
Environment: Anaconda 4.8.3 with Python 3.8.12


For a question of reproducible research, we have used a public dataset composed of crawled data from The raw parsed data can be found in this Google Startup Dataset, which comprises 40.474 Startup records, with the following attributes (this is the same dataset as explored in the Qdrant tutorial):
  • name;
  • company description;
  • location;
  • picture URL.
Additionally, we are working with precomputed embeddings for the company descriptions. Embedding dimensionality is fixed at 768. The embeddings themselves have been encoded using a conventional transformer model.

To stress-test these engines, we have built different scenarios to accommodate varying sizes of the created index. For one, we want to see the impact of an increase to the number of records (referred to as entities from this point onwards), and for another, the effect of an increase to the number of columns (assuming we want to append each entity with more than one encoding). Do note that none of the currently tested versions of the engines provides multiple encodings for a single entity. Thus, the scenarios considering more than one encoding per entity (i.e., a startup object encompassing the information mentioned above) have been adapted: We create a replica of the index, for each different entity representation.

The final list of scenarios is listed below.


Number of Entities

Number of Encodings per entity

Scenario #1 (S1)



Scenario #2 (S2)



Scenario #3 (S3)



Scenario #4 (S4)



Scenario #5 (S5)



Scenario #6 (S6)



Scenario #7 (S7)



Scenario #8 (S8)



Scenario #9 (S9)



Indexation Algorithm

The algorithm used to build an index has implications in the quality of the results, not only for the data quality (accuracy) but also for the system performance (used memory and speed). More information on the different approaches can be found in this Pinecone blog article. An up-to-date ANN benchmark repository can also be found in the famous GitHub repository by Erik Bern, with graphical quality/speed representations for popular public datasets.

Qdrant and Weaviate implement only HNSW natively. Thus, the experiment uses HNSW solely as the de facto indexation algorithm [2]. The configuration parameters for HNSW have also been fixed for all engines:

  • building parameters:
    • M: Maximum degree of the node = 4
    • efConstruction: Take the effect in stage of index construction = 8
  • search parameters:
    • ef: Take the effect in the stage of search scope, should be larger than the number of results (top_k) = 100


Following the principles of statistical analysis, we want all scenarios to execute a minimum of 30 times. During the execution of index querying, it is vital to use different queries to avoid the engines employing implicit result caches, which would benefit the querying speed. Thus, we're feeding the following queries sequentially:

  • 'Berlin';
  • 'Chicago';
  • 'Los Angeles';

During 30 runs.

Milvus (1.1.1)

Milvus is an open-source vector database built to manage vectorial data and power embedding search. It originated in October 2019 under an LF AI & Data Foundation graduate project. The latest version is Milvus 2.0.0, which is in steady development, with the release candidate eight having been released just in 5-11-21 (at the time of writing of this technical blog). However, upon trying to setup Milvus, the team has encountered multiple challenges:

  • Indexing spikes lead to increases of 2 to 3 times the average time (see this GitHub issue);
  • Errors set up scenarios S2 through S9 related to networking problems being hashed out as the Milvus team works through the Milvus 2.0.0 GA.


While this release shows promise regarding the revamped and additional features, the following comparative analysis uses Milvus 1.1.1.


In Milvus, users index their data in a collection that encompasses several entities. An entity is an object with a record containing several fields or attributes. For efficient retrieval, a collection will be partitioned, which will hold several segments. More information can be found in the Milvus glossary.

We have developed a script that handles both Collection and Index creation in Milvus. The results for each run are presented below.

Milvus 1.1.1 Indexing Time for Scenarios S1 through S9

Milvus' indexing times appear mostly consistent, but, as the number of entities increases, the system struggles to maintain a constant execution time. This pattern is particularly evident in the scenarios indexing the total number of entities (S3, S6 and S9).

Milvus 1.1.1 Average Indexing Time for Scenarios S1 through S9.

In any case, we see growth in the average indexing time. This behaviour comes as no surprise, given that the indexing time for scenarios S4-S9 sum the indexing time for each different representation’s index. For example, we expected an approximately two times growth from scenario S3 to S6 (S6 has double the representations of S3), and the average execution time shows us as much. The same logic applies to scenarios S7-S9, where representations increase five times from scenarios S1-S3. Another expected behaviour is the increase in execution time as the number of entities increases. This pattern also resurfaces each time the number of representations changes (in scenarios S1, S4 and S7).


The indexes created during indexation had to be explicitly loaded before querying the system, as we’ve detected a warm-up effect during the first run of each scenario. In Milvus, we don’t search the index explicitly. The pymilvus package readily provides a call method that searches indexes associated with the given collection name. Querying time results for each run are presented below.

Milvus 1.1.1 Querying Time for Scenarios S1 through S9.

As the number of representations per entity increases, the results are less stable, as seen in the spikes for scenarios S7-S9. Note that even the most prominent outlier run takes approximately 0.02 seconds, which is acceptable within query performance. 

Milvus 1.1.1 Average Querying Time for Scenarios S1 through S9.

Average querying results show that the number of entities has no relevance in execution. On the other hand, the number of representations appears to be following the expected linear growth, with scenarios S4-S6 showing a two-times increase and scenarios S7-S9 a five times increase.


Weaviate is another option regarding VSE due to its promise within the Vector Search paradigm. As is the case with Milvus, it intends to provide several key features that benefit a distributed and performant system (Horizontal scaling in Fall 2021).


When indexing data in Weaviate, the first step is to create a Class object, which handles the objects' schema and additional configuration parameters (such as the index algorithm parameters), expressed as a JSON file. Weaviate provides no explicit information on the necessity to create an index from an object collection. Implicitly, it appears to index all provided vectors using HNSW. Indexing results are shown below.

Weaviate Indexing Time for Scenarios S1 through S9.

Indexing results for all scenarios are very consistent with little noise between runs.

As we’ve seen previously, the average indexing time sees an increase, first while increasing the number of entities and later when increasing the number of representations per entity. Here, more than in Milvus, we see that the values for scenarios of increasing representations (S1, S4, S7 and S2, S5, S8 and S3, S6, S9, respectively) follow a linear increase.


Analogous to Milvus, query searching in Weaviate is carried out by accessing the specified collection with implicit index information. The results are listed in the charts below.

Weaviate Querying Time for Scenarios S1 through S9.

As was seen during the indexing results, Weaviate hardly deviates from the mean querying time, save for the occasional hiccup. However, during scenario S8, the system did register more consecutive underperforming query searches (see run 10-13 in scenario S8’s chart). Our assumption falls back on the machine’s background processes impacting the experiment. The chart below illustrates a slightly different behaviour from Weaviate.

Weaviate Average Querying Time for Scenarios S1 through S9.

Unlike what we've seen for the average querying times in Milvus, Weaviate performs evenly for collections with a changing number of entities. This is especially visible in the scenario groups S1, S2, S3 and S4, S5, S6. It is also interesting should the experiment evolve to include a much larger number of entities. We still need to compare the numbers exactly with the competition. The following section addresses direct comparisons between the studied engines.

This is the first part of a 2-part article. 
You can read about the results analysis and the conclusions here.
Related Articles