dashboard image

Objectives :

7. Site Reliability Engineering

dashboard image
The Problem

The traditionnal approach consists of having a team of sysadmins who are responsible for deploying the services in production and doing the maintenance.

As illustrated in the following diagram, 2 separate teams, the software team developping the applications, and the sysadmin teams, deploying and maintening the application in production.

dashboard image

The problem with this approach, is that as more services are integrated or developped, the sysadmins team needs to grow exponentially, incurring an increase on direct costs.

And on the other hand, there is indirect cost at the organization level, most of the time, software teams and sysadmins teams have different objectives, priorities and use differents terms to define the same event.

Google over the years, provided more and more services, and experienced directly the limitations of the traditionnal approach.

With the implementation of SRE Teams (Site Reliability Engineering), they were able to :

(1) : increase substantially the velocity of features/services deployed in production

(2) : improve the service reliability in production

SRE Principles

4 main principles govern the SRE model and describe how organization can implement them in their processes:

Embracing Risk

Service Level Objectives

Eliminating Toil

Monitoring Distributed Systems

Embracing Risks

Ideally your service is available all the time for your users. But unfortunately, sometimes you can experience unplanned downtime (failures, etc) or you may need to perform some updates (planned downtime).

So you can't really have a 100% service availability. What you can expect is having a number closer, generally the 5 nines: 99.9%, 99.99%, 99.999%.

Experiences have shown that trying the reach absolutely the highest service availability cost you in terms of agility (less frequence updates, few features, etc..)

As illustrates in the following diagram the best way to improve your service overall cost is by balacing the risks, as illustrated in the following diagram :

dashboard image

The first step will be to measure the service risk, trying the get the uptime level. The best formula to caculate the availability of the service :

availability = successfull requests / total requests

In a second step you can prioritize the impact of the services on your business (revenue). A service to free users, will for instance have less impact than a service offered to premium users.

At a third step, you can define the target level of availability. As illustrated in the following diagram, to clearly define the service risks, product owners and software teams can agree on the impacts if the service is unavailable.

dashboard image

Below is a table including the availability level and corresponding unavailability window per year and per day to give you a good understanding.

Availability Level Allowed unavailability window (per/year) Allowed unavailability window (per/day)
90% 36.5 days 2.4 hours
95% 18.25 days 1.2 hours
99% 3.65 days 14.4 minutes
99.9% 8.76 hours 7.20 minutes
99.99% 53.6 minutes 1.44 minutes
99.999% 5.26 minutes 43.2 seconds

Let's understand it better with real use cases.

Use Case 1 : A service that serves 5 millions requests/day and has a target availability level of 99.9% per day, can tolerate up to 25000 errors (or failed requests) per day and still meets the service level target.

Use Case 2 : A service that serves 2 millions requests/day and has a target availability level of 99.99% per day, can tolerate up to 2000 errors (or failed requests) per day and still meets the service level target

Service Level Objectives

Additionnaly to the SLA (Service Level Agreement) which is an indicator more appropriate for the business requirements, SRE introduces 2 types of metrics to measure the service:

SLI (Service Level Indicator), which is a specific metric to measure the service

There are 2 categories of SLI metrics : User facing system and Storage.

dashboard image

User facing system

the time it takes for the service to respond to a http request (1ms, 100ms, 500ms)

the number of failed requests in proportion to all the requests (2% failed http request per/s, 10%, ..)

the number of requests or queries per/s (QPS) the system can handle (1 million requests/s, 5 millions requests/s)


Storage system

define whether or not the data is accessible.

the time it takes to read/write data (in terms of IOPS)

the ability to store information with no disuption, errors or failures.

How to define the SLO ?

You can follow these 3 steps to define your service level objectives.

dashboard image

Eliminating Toil

While running application in production, you may have to perform some operations: manual, repetitive, or automatable; like running a script to automate some tasks, etc ;

The Toil represents basically the operations (cited above) that tend to increase exponentially as your service in production scales as illustrated in the following diagram.

dashboard image

Google approach is to allocate 50% of the time of SRE operations spent on reducing the toil, as illustrated in the following diagram :

dashboard image

Time allocated to reduce toil, is like an investment on the future of your platform operations; Not allocating enough time on toil, can significantly increase the time you spend only on toil, reducing the time you allocate for other tasks (increasing delay for fixing issues, or developping features, etc..)

Monitoring Distributing Systems

Monitoring systems today get more and more complex. The old ways to conduct monitoring to get alerted if something breaks, do not apply to large distribution systems.

The evolution of software architecture (monolitic to micro-services), has brought new challenges and to monitor efficiently these large scale distributed systems, there are few rules to adopt.

There are 4 crucial metrics to help you track how healthy is your system :

How long your system takes to respond to a request ? It's usually expressed in terms of milliseconds;

How many requests your system is handling? for instance for a web application it might be how many http requests per second ?

How many incoming requests fail ? how many http errors (5xx, etc,..). This indicator can help you determine how reliable is your system.

This indicator will determine if your system capacity is overloaded (max cpu, memory, disk consumption).

Some rules are important to keep in mind to maximize the effectiveness of an incident response, but also to prevent SRE teams to stay constantly under-pressured.

Automate incident response as much as possible : You may receive several alerts for incidents that require the same manual resolution. Make a list of these incidents and make sure whether or not they are eligible for an automation (script or any other automated task) to prevent triggering unecessarily your alerting system

Keep things simple : There is no reason to track a metric on everything. Some minor incidents not affecting the user experience or that may have lower impact on the service availability, might be removed from the alerting system. Your team receive several alerts, if there are too many of them in a short period, this might create fatigue.

Dashboards for what really matters : Too many indicators on your dashboard might be inefficient. Your dashboard should remain as simple and clear as possible to catch easily an incident, a pattern, a problem.