The art of Failure II - Graceful Degradation

Continuing on from the previous post (Part One), we are going to switch now to a web application that needs to render content to the end-user. We can use farfetch.com as an example.
Farfetch.com has several microservices responsible for fetching the necessary data in order to render each page. Each service consumes several other services in order to fulfill its job. So, we are still connected to the same reality described in the previous post. This means that the "Policy of good neighbours” still applies 100%. But there’s an extra catch now, which is how to communicate failure in the proper way? Naturally, this also depends on the type of failure that happened but, whatever the case, it signifies a degradation on the user experience. Degradation can be a lot of different things, even doing nothing when something fails (data not loaded or operation mutating data failed). But what really takes it to another level is when the best possible user experience is at the heart of everything, including failure. Failure is inevitable, so when designing our products we need to account for failure scenarios, not everything is a happy path. Product Managers together with designers should design the product in several phases, like when the product is loading (in this case the site or a specific part of the site), when it finishes loading and when it fails. If this is part of the design, it’s already a good start for building better, more robust software.
So we know that failure forces a degraded experience, but this doesn’t necessarily mean a decontextualized error page with a message stating a failure. With more extreme failures, this might be the only possibility, but we don’t have to have the same approach for every failure. Especially if the web application is part of a microservice architecture, there’s probably still a lot to see when something fails. Users engage with what they see, if they don’t see anything relevant or if the experience is too contaminated, the engagement drops and the user goes away. There’s so much we can do to keep a positive experience.


For instance, if we look at the above page, there’s the header section with all the navigation for the site and then there’s the main content that shows the product. There’s also a footer section that’s not visible.
There are specific services responsible for fetching the data for each of these sections of the page. If one of these fails, should the whole experience fail? In order to answer this question, we need to look at it from the user’s perspective. We want a site that feels consistent and predictable in terms of behaviour. A golden rule is to not completely change the look and feel of the site if we can avoid it, as this reduces engagement. So if we pick up the page above and if the main content fails, we can try to gracefully degrade the experience. Graceful means we switch off features but we try to preserve everything else that is not affected. Every area of the site should follow the same pattern and switch into a degraded state when it is affected by failure.

Even if the main content fails, there’s no reason for the site navigation to be affected. The site was not able to load the product but the user can still move around like before, it was not taken away. Another golden rule here is not to hide that there was a failure because users may be suspicious if we try to cover up the failure in some elaborate way. It’s better to come clean and explain that something unexpected happened in a very clear and visible way. At the same time, we can take the chance to offer an "escape” route which provides some guidance in what to expect and/or what to do next. This is where things can get interesting as we can provide good fallback content to keep the user engaged. The user should not feel that they have hit a dead end. If they were looking for a product, we can try to cross their navigation history or shopping history to show/point them to products or content that are meaningful to them. Trying to amaze customers is something that should never take time off.
Other good examples of graceful degradation that show strong signs of resilience are the following:
- If the page render is done on the server side and it fails, falling back to the client (if possible) can be incredibly beneficial.
- If the site has a problem authenticating users, switch them to guest users so they can continue to navigate and shop.
- Have a fallback for each request as no request is too important to fail.
- If the user loses connectivity while navigating, transition to an offline mode serving cached content using a service worker.
- Switch between serving high-definition and low-definition content based on the user's network using the Network Information API.
Offline as a business driver
Offline is not an exception, it is the norm. It’s true that we are more connected than ever before but what happens when users go down the subway or board an airplane? What about that part of the house where we can’t get any signal? The truth is, we carefully craft our user experience and then realize that it falls short and becomes a source of disappointment because it depends too much on an elusive connection to a server. The lack of a network connection is a harsh reality. Even if you have the fastest phone on the planet, it will do you no good. At some point, as you move along, the network connection will be lost. Having an offline first approach is a smart way to turn things around for the better. And better here is for both sides, whoever is running a business (think about an e-commerce site like farfetch.com) and the user who can still consume an experience, even if disconnected.
There are great examples out there where offline really shines. Everyone has probably already played with chrome’s dino when there’s no internet. Games can become addictive so that’s a great weapon to reduce the frustration of not being connected to the point where it becomes a good experience. When this happens, it’s a special moment, as the site is able to commute between both modes in an effortless way offering a pleasant experience.
Trivago also had this approach when users had no connection switching into a play mode.

I personally believe that these offline moments are precious moments that shouldn’t be wasted. Especially if we are talking about an e-commerce business, we can try to take the time and get to know our users/customers a little bit better. When the user loses connection, the site switches to an offline mode starting a game like the one illustrated next.



I personally believe that these offline moments are precious moments that shouldn’t be wasted. Especially if we are talking about an e-commerce business, we can try to take the time and get to know our users/customers a little bit better. When the user loses connection, the site switches to an offline mode starting a game like the one illustrated next.

It’s a simple game, as we can’t have something too elaborate. This game requires no explanation, it’s a simple "I like it”, "I don’t like it” kind of thing. The binary part here is to further simplify things, however, it could be extended to "I love it” or "I don’t love it or hate it” to assess what really resonates with the user's taste. And that’s it, it’s a very direct approach towards the user. We are literally asking them what they like or dislike, going from one product to the next. We rarely have this chance during the online mode although we are also trying to provide our customers with what they like the most. The answers provided from the user are stored locally in the device and sent to the server once the connection is reestablished.
Architecture wise, a service worker is the way to go with this approach if you want to make it a reality. The following snippet shows how to use a service worker to respond with an HTML page that was previously cached during the service worker life cycle.
It is able to understand that there’s no network connection and serve the offline game page. This page does the rest of the magic by showing a set of products that were also pre-cached. This concept can really make sure that the cached products are not random but customized to further understand the user's taste. It is not by chance, which would be perfectly valid if we don’t know the user, but we can take advantage of the server and surface products that are tailored for known users (based on machine learning possibly). The web application caches these products and they can be refreshed in a periodic background sync. The answers from the user get stored in the user’s device (they can close the browser or switch off the device) and can be submitted back to the server in the same way. The idea is to be able to feed the machine learning algorithms into going deeper and deeper, unlocking that next level of personalization. The user wants to feel special and all of this is orchestrated towards providing a superior user experience.
This concept would fit in a simplified architecture. It tries to represent the web application, getting recommended products to populate the browser’s cache and then the user’s answers (when offline) can help feed the system to give it more to learn (when back online), which on its own will produce even better recommendations in the future. It’s a system constantly and autonomously feeding itself.

This offline strategy is a testament of resilience and a prime example of where we can use failure for the sake of a bigger business and happier users.

This offline strategy is a testament of resilience and a prime example of where we can use failure for the sake of a bigger business and happier users.