Breaking Cache Conventions

The standard caching mechanism is well known: placing copies of frequently used data for easy access makes a system faster. But you can fine-tune this for your use case, often reaping huge performance rewards.
This is the story of how we fine-tuned the cache for our microservice, bringing a 26-fold performance improvement.
Meet FABS
Our team is responsible for the FARFETCH A/B Service (FABS). If Program Managers/Owners plan an improvement in any area of the FARFETCH platform, they can run a test verifying if their idea, in fact, makes a positive impact.
Most often this test, or experiment, takes the form of an A/B test. A random half of the users/participants are exposed to the "current” version of that area (let’s call it control or variant A). The other half are exposed to the modified version (let’s call it treatment or variant B). You also measure a pre-defined success metric across your test participants, which may be the use of a feature, making a purchase, or something else. After some time (i.e., weeks or months for the test to reach statistical significance) you should have sufficient data that will tell you if B is, in fact, an improvement upon A. It should also tell you by how much and with what statistical significance, or the likelihood that the difference wasn't just due to random chance.
FARFETCH has published an entire e-book on the subject of Experimentation, so feel free to dig in if you’re interested!
From a technical perspective, this sounds simpler than it is. User assignment to variant A or B cannot be truly random, as a specific user must always be shown the same variant. Otherwise, users are likely to experience both variants — sometimes called contamination — and then how would you tell which variant was best? Some more advanced experimentation features (such as calculating the combined impact of a group of experiments) require the exclusion of a few random users from the test. Furthermore, you also need a quick way to start/stop tests as your results start coming in. All of this points to the need for a centralized location (read: a service) that holds experiment state, providing a variant to the caller when an experiment participation is made. Thus FABS was born.
As multiple experiments can be run at the same time and in high-traffic touchpoints of our platform, FABS needs to be both reliable and performant.
Setting the Baseline
The Experimentation Team (now playfully called the Ducks Team in the Experimentation Cluster) was formed in May 2018. Our first job was to stabilize a successful prototype version of FABS while keeping an eye out for performance. FABS had a response time goal of 50ms. Our clients would give up waiting after that time to ensure a fast response time from the overall platform.
We inherited the following microservice architecture.

FABS would store experiment information in both SQL Server and Redis. The details aren’t important, but only Redis was used at runtime to get the information required to satisfy each request.
Redis is an in-memory cache system. Its use as a separate service was a brilliant way to ensure all FABS containers (running instances for load balancing) shared the exact same experiment information. Unfortunately, with Redis as a service, calling it incurred the cost of going through all the software and hardware layers for network communication. This added a few precious milliseconds to our FABS response time. It also added a new point of failure for FABS.
Could we improve upon this?
We believed so, but this would mean not using a network service for caching. Instead, we'd use FABS’ in-container memory. This has perils of its own.
Considering a Memory Cache
There are three great caveats when using memory for caching:
- Memory Space is Limited
Each container has limited memory space to hold this cache. Will enough data fit for the cache to be effective?
- Memory is Volatile
The cached information will be lost between FABS deploys/restarts. What will happen with cache writes that have not yet been flushed to persistent storage?
- Memory isn’t Shared
Each container will have its own in-memory cache which might not be in sync with the other containers’. Out-of-sync containers would mean inconsistency in the behaviour of the service as a whole.
Memory as a cache isn’t a perfect solution. So we need to find compromises and make concessions, as we’re very interested in its performance potential.
Memory Space is Limited:
How Much Cache Memory Will We Require?
Container memory is a limited premium resource and should not be wasted. As a container handles more and more parallel requests, it requires more memory to hold the state of each request. Cache growth could exhaust the available memory and limit a container’s ability to operate or handle the additional load.
The big question is, "How much memory do we need for the cache?”
Our cache only needs to hold the experiment state. We don’t need to store the user IDs to remember what variant we assigned them. This is because we follow common practice in the Experimentation field: we ensure each user is provided with the same variant by using a hash function rather than a random number function.
Holding the experiment state means, for each experiment, to store:
- Unique experiment identifier — a short string, typically about 20 to 30 characters long.
- Variant identifiers and weights — short strings (about 10 characters on average) and integer numbers; a typical experiment has only two variants with 50%/50% weights.
- Additional state information — this has grown through the years, but it currently accounts for less than 1kb of information.
Say that each experiment requires about 1kb of memory to be represented. At the time we had about 1,000 experiments in our database, with more being created at an increasing pace. Those 1,000 experiments would take about 1Mb (1000×1kb) of memory. Even if we grow to have 1000× more (i.e., 1M) experiments, this would require about 1Gb of memory. This is reasonable for our containers for the foreseeable future.
The cache will fit into memory even with a full representation of all Experiments.
Memory is Volatile:
Deciding on a Caching Strategy for Coherence
Memory does not persist state, as each container can be restarted or redeployed. But this is a typical situation for a cache. CPU and hard drive caches are examples of these volatile caches. On restart/redeploy, our implementation will need to retrieve experiment information held on persistent storage to ensure correct operation.
However, we should also write any cached changes back to persistent storage. Any delays increase the chance that important changes are lost on a restart/deploy. This precludes using the write-back cache coherence mechanism, whereby information is only written to the cache, and a background operation eventually writes that information to persistent storage. This delay increases the chance of data loss.
Alternatively, the write-through mechanism ensures better coherence. But its writes are slow because the cache waits for the write operation to complete immediately on persistent storage. So how did we decide on the right model?
A cache is meant to accelerate access to a data repository. Access to the FABS data repository needs to support two types of operations, roughly as follows:
- Participation in an experiment — We expect tens of thousands of participation requests per minute. These are performance-sensitive at the microservice level (milliseconds). They require read access to experiment data.
- Experiment management — These requests are infrequent. They are performance-sensitive at the human level (seconds). They require read/write access to experiment data.
In our use case, the cache is critical to ensure the service level expected for Participation, but our persistent storage (Cassandra) should be able to handle the necessary response times for experiment management on its own. Any delays there should not be a big concern.
We, therefore, opted for having a logical split in our service between Participation and Management operations.
- Participations use this cache to satisfy requests. They would not require Writes.
- Management operations do not use a cache and talk directly to persistent storage.

What happens, though, when an experiment changes state via the management endpoints? Is the cache updated somehow? We address this in the next section where we came upon some unexpected solutions.
Memory Isn’t Shared:
How to Ensure Data Consistency?
Each container is an isolated set of processes with its separate memory quota. Containers do not automatically share memory. Redis is an example of a shared memory cache that we’re trying to improve upon by using local container memory. By using local memory as cache, each container’s cache is independent and we, therefore, have a distributed cache system.
The problem here is that different cache copies may hold (slightly) different information, depending on when they were last updated. Different information leads to inconsistent FABS responses: if one container sees an experiment state as Running (i.e., returning random variants) and the next container sees the same experiment as Ended (i.e., returning a fixed variant), they may return different variants for the same user. This could expose each user to multiple variants at different points in time, and we won't be able to tell which of these governed their behaviour. Left unchecked, this leads to data contamination and unusable results.

Inconsistency occurs when data changes in persistent storage.
The problem of data change (or data write) inconsistency in a distributed cache is the round-trip of information: how long does it take for the write to make it out of the container where it was written (the actual data write), and into all (other) containers’ caches for an eventual data read.
Data Writes:
When the Solution to a Caching Problem Is To Not Use a Cache
We’ve solved this consistency challenge in the previous section by making our cache read-only for Participations. Meanwhile, all writes made by experiment management go directly to persistent storage, minimizing the data write delay.
If we were still to consider supporting writes on the cache, the write-back cache coherence mechanism would not be suitable. This is because delaying permanent storage updates would increase the time other containers are out-of-sync from these changes. There are known mechanisms to keep the containers in sync in these cases, but they would require some sort of communication between them, adding to the complexity and points of failure for this cache.
Write-through could be suitable. However, even simple human-time-scale operations may not work as intended. Take this example. A user makes a change to an experiment state and pushes a "save” button on a UI that requests FABS to change that state.
- One of the FABS containers (say container #3) receives that request. It then changes its cache as it writes the change to persistent storage.
- The visual outcome of successfully saving the experiment is for the UI to go back to normal "summary view” mode, re-fetching the experiment state from FABS. A new request to FABS is made.
- Container #2 receives the request, but it uses its copy of the cache that doesn’t (yet) reflect the changes made by container #3.
- The user becomes confused as they don’t see the changes they just submitted, seeing their summary view update with old cache data.
This reinforces our decision to make all experiment management operations go directly to persistent storage without going through a cache. This includes operations that read back from persistent storage, eliminating the above problem.
But the cache for the service endpoints for Participations will take some time to refresh and must account for changes.
Data Reads:
When No Solution Is the Best Solution
As each container refreshes its memory cache at slightly different times after an experiment changes state, FABS will return inconsistent responses until all containers have refreshed their caches. This means a user could be exposed to multiple variants and we can’t tell which governed their behaviour.
However, a consistent operation is only critical while an experiment is Running.
A transition from Draft to Running, or from Running to Paused/Ended, could expose users to other variants. But this is by design and expected. No other changes are allowed while an experiment is Running.
Our solution? After discussing this with Product, we decided to ignore the problem. We made the compromise of accepting that inconsistent behaviour may occur between cache refreshes if an experiment is updated. To minimize its impact, we use a small cache time-to-live (TTL), i.e., a high refresh rate at every five seconds. We could improve this by having all FABS containers attempt to refresh at the same wall-clock time, but this would also negatively impact our persistent storage.
Improving Latency and Resiliency
Your typical cache contains "entries”. Each cache entry represents a copy of one item (e.g., one experiment) in persistent storage. As the service requests items from persistent storage, the cache saves a copy for later reuse. This is also called read-through or lazy loading.
As the cached copy of that entry may (eventually) become out-of-date, it should be re-fetched from persistent storage. To track this, caches set each entry to expire some time after being fetched called a time-to-live, or TTL. After that time elapses, the entry is considered stale, discarded, and any new request from the service will restart the process with a fresh copy of the entry.
As our memory cache will be able to hold the entire set of entries from persistent storage, we can disregard cache eviction strategies used when a cache runs out of internal space for new entries.
The typical cache process has two main drawbacks:
- The initial request for a non-cached entry is slow (i.e., has high latency), as the entry needs to be retrieved from persistent storage. This is why services using this approach have a "warm-up” time after a restart/deploy, during which they are a bit slower while the cache is being populated. As our team deploys more and more frequently through Continuous Deployment, this becomes more of a problem.
- If persistent storage fails at any time, requests for uncached entries will also fail. As cache entries expire, eventually the entire service is failing until persistent storage is back up.
The solution for these drawbacks is straightforward:
- As all our persistent storage entries fit into the cache, we can simply preload them all. We can then refresh them in the background on a regular period. This is an extended form of cache refresh-ahead.
- We never discard cache entries, so it’s easier to handle persistent storage failures: we just retry at a later time. The cache is still intact (albeit perhaps a bit out-of-date), but the service is operating until the problem is solved.
This reduces latency and eliminates "warm-up” time while simultaneously improving resiliency. It also vastly simplifies our cache operation.
FABS Memory Cache
Our cache is therefore not your typical middle-man between the service and persistent storage.
In your typical cache, the service makes all requests to the cache. The cache then determines whether it needs to fetch that data from persistent storage (a cache miss) or not (a cache hit).

FABS’ memory cache is only used for some FABS endpoints, as we’ve shown above, and it refreshes its entries in the background on its own.

FABS’ cache has a 100% hit ratio.
The cache is also greatly simplified. This is a code block diagram comparison of using a pre-existing cache framework vs. using our own local cache implementation.

Our choice was therefore to develop our own cache.
Final Performance Numbers
Many changes to FABS were applied concurrently with this project, and it’s hard to tell decisively that the improvements are entirely because of the new cache. However, we believe our performance figures are mostly influenced by the impact of the new cache.
We measured the average, 95 and 99 percentiles and maximum duration of requests made to both old and new FABS. We then calculated an improvement ratio defined as how much more work can be done in the same amount of time, i.e. (old/new - 1), as a percentage. This would be zero for matching response times. Positive values would reflect improved response times, with +100% (for example) representing the ability to do twice as much work, i.e., half the response time.
The improvement was massive. Averaging the improvement ratio of all 99 percentile measurements, we noticed an improvement of over 26× the previous version of FABS!
Our current median response time is under an astounding 1ms:


Key Learnings
A well-chosen cache can vastly improve your service — it improved our response time 26 fold. These are some of the key learnings we would like to share with you:
- As there isn’t such a thing as a single shirt size for every person, standard solutions to known problems (such as caches) may also not be the best fit for your needs. Understand the problem you’re trying to fix and simplify your solution to the essential functionality. Complexity is often inversely proportional to performance.
- Some requirement tradeoffs in specific areas may be massive unblockers and timesavers, allowing the team to deliver much faster while retaining most of the features.
- Watch out for distributed processing and resilience tradeoffs. Memory caches are faster, but service instances can become out-of-sync. A shared cache solves this. But if it fails, all your instances may fail as well.
- There is no need to reinvent the wheel: code reuse via libraries/frameworks is a very good thing. But again, understand your problem. The effort to adapt and maintain a new library/framework to your custom needs may prove to be penny wise, but pound foolish.
- Last but not least, remember to benchmark the current state of the service, so you can later on measure and understand the real improvements and value of a cache you’ll have to maintain.