This website uses cookies. By using the website you agree with our use of cookies. Know more

Technology

The language of infrastructure

By Gaurav Gupta
Gaurav Gupta
A self-professed geek with a secret passion for building products. Can’t admire a chronograph by Bell & Ross enough.
View All Posts
The language of infrastructure
Farfetch’s mission is to be the global technology platform for luxury fashion. We sell products from over 1000 boutiques and brands from around the world, ship to customers in 190 countries with the help of 3000 employees in 13 different offices. Our platform runs on an active-active geo-distributed infrastructure consisting of thousands of servers and petabyte scale storage, spanning multiple data centres and connected over redundant SD-WAN networks. The nature of this global setup and scale brings unique challenges and opportunities. One of those challenges is how we manage this Infrastructure. We have been going through a transformation of our Infrastructure at Farfetch lately. Why? We simply grew out of our current methods, which weren’t rudimentary to begin with.

Our perspective

Many moons ago in the enterprise world, scaling up by default meant vertical scaling or simply creating a bigger machine. Popular storage, server and networking companies would take pride in announcing their next bigger, better, denser appliance. Scaling up had to consider some physical limitations like power, cooling, rack density, etc. Cloud computing changed everything for everyone; Infrastructure became a commodity, servers became cattle and scaling by default became horizontal. Most engineers never even get to see a physical server or a switch or a storage array anymore, let alone understand the complexities of a datacenter, yet we deploy thousands of servers and petabytes of storage across geographically distributed data centres.

Cloud computing was supposed to be the holy grail of Infrastructure as a service like electricity or clean water, yet Infrastructure is still seen as a bottleneck to ship code fast. The culture of DevOps was created to improve upon this, but it’s hard to maintain it consistently beyond a scale. Some of the side effects of this problem are the disregard for Infrastructure efficiency, inflated costs, complex maintenance of the running applications, unknown security holes and broken experience of Infrastructure-as-a-service to its customers.

Challenges

Our current challenges start with the scale and speed at which we are innovating and executing. There are new ideas born every day that require experiments to validate and verify. Along with them, there is a long list of features and roadmap items, growth hacks, new business units and reduction of technical debt, etc. If you imagine this as a funnel, Infrastructure is at the bottom trying to keep up with this frenzy. The micro-services architecture allows the development teams to work relatively independently. However, a similar contract doesn’t exist with the Infrastructure. This creates a problem where the process is not well understood, documentation is obsolete very quickly, communication is broken, and the whole machinery starts to slow down.

It’s conventional wisdom now that Infrastructure should invest in automating the repeated tasks, build more self serve tools and create an ownership model for micro-services, but this problem is deeper and more complex. There are challenges related to communication, visibility, prioritisation of tasks, maintaining standards, security, access control, auditing and compliance, auto-scaling, Infrastructure cost, monitoring, data protection, etc. We understood that building point solutions would not take us very far before we hit another scale that would make them obsolete. We wanted to build a single framework that defines all our Infrastructure, a single point of truth which is distributed, open for all, always up-to-date and from where all of the Infrastructure requests are made and served automatically.

The Infrastructure Blueprints

Infrastructure as code is not a buzzword anymore. There are plenty of amazing systems and literature about challenges and approaches to managing large scale Infrastructure of cloud scale and increasing developer productivity. Technologies like containers, Kubernetes, Terraform, Consul, Vault, Chef, Salt, Jenkins, etc., or their equivalent exists in a very mature shape to help build the automation of Infrastructure. Our approach is essentially a semantic glue that puts it all together in one platform.

Before we embarked on this journey, we spoke to a lot of software developers and Infrastructure engineers. I don’t recall any conversation where both of the following statements were inaccurate:
  • We have automated our Infrastructure task in Ansible, Chef, Puppet, Salt or something similar.
  • It is a challenge to maintain the services and infrastructure as well as keep up with the new requirements due to the lack of visibility.
The above two statements sound contradictory to me because automation should enable productivity and visibility. The point is not that the automation is not good enough, but that it was solving individual pieces of the puzzle. The silos of tools and information make it impossible to have a single view of Infrastructure and applications. Also, the method of interaction with Infrastructure via meetings, wikis, e-mails, tickets, etc., is not very efficient to describe and track a technical requirement.

So we built a system that offers an interface to define all the Infrastructure requirements for each service or application in one place. This interface acts as a blueprint, which captures all the aspects of the services that interact with Infrastructure in a well-defined, source-controlled, always up-to-date framework. This blueprint is created and owned by the developer of the service, putting them in the driver’s seat. Any new blueprint or update is reviewed like code and once merged triggers the automation built by Infrastructure engineers, so the consistency and standards can be followed and change management is built in. Any new requirement immediately becomes a feature request and can be applied to all once solved.

Here is a sample blueprint for reference.
```
version: 1
platform: commerce
boundary: product
name: preorder
description: Handles product preorder details
maintainers:
- jane.doe
- john.doe
project_type: service
tech_type: dotnet_core
repository: https://code.fftech.info/commerce/product/preorder.git
ports:
loadbalancer_ports:
- 8080
service_ports:
- 8080
telemetry:
active_healthcheck: /monitoring/activecheck
deep_healthcheck: /monitoring/deepcheck
alerting:
contacts:
email: commerce-product-preorder-group@farfetch.com
slack: commerce-product-preorder
alerts:
- alertname: AlertName
expression: farfetch_windows_cpu_used_percent{} > 70
for: 10m
notify:
- environment: prd
routes: [slack, email]
description: server {{ $labels.instance }} is with CPU over 70%.
troubleshooting: https://wiki.farfetch.net/wiki/
dependencies:
- platform: commerce
boundary: infra
name: cassandra
specs:
keyspace: commerce_product_preorder
- platform: commerce
boundary: product
name: api
security:
context: standard
environment_specific:
- environment: dev
datacenter: we1
domain: commerce-product-preorder.farfetch.com
scale:
min_instance_count: 2
max_instance_count: 4
min_cpu: 2
max_cpu: 8
min_memory: 2
max_memory: 8
credentials:
commerce_infra_cassandra_commerce_product_preorder_username: encrypted value
commerce_infra_cassandra_commerce_product_preorder_password: encrypted value
```

At a high level the blueprints define the following aspects of the Infrastructure and the service:

Identity
: The blueprints provide a clear and absolute detail about the service. For example, it includes the name and function of the service, location of the source code, owners and maintainers details, metadata needed to build, deploy and test the application, etc. By using this information, build pipelines are created automatically, DNS and service discovery records are updated, and so on.

Environment Provisioning: The blueprints provide the details about the run-time characteristics of the service including resources like the network, compute, storage, ports, credentials, access control, scale requirements, etc. This information is used for Infrastructure provisioning, auto-scaling, auto-healing, configuring security groups and load balancing, and many other finalities.

Monitoring: The blueprints provide the details to manage and monitor the service in different environments like health check configuration, logging, monitoring and alarms. This information is used to configure telemetry, logging platforms like ELK, alarms and notifications via pagerduty, slack, etc.

Dependencies Management: Each blueprint lists all external and internal dependencies that the service depends on. This allows us to provide service discovery, routing, network security and datastore management while at the same time the graph acts as a central piece when discussing macro-architecture of F-Tech.

Stay tuned for more details about our Infrastructure and each of these areas in subsequent blogs!

Related Articles