The traditionnal approach consists of having a team of sysadmins who are responsible for deploying the services in production and doing the maintenance.
As illustrated in the following diagram, 2 separate teams, the software team developping the applications, and the sysadmin teams, deploying and maintening the application in production.
The problem with this approach, is that as more services are integrated or developped, the sysadmins team needs to grow exponentially, incurring an increase on direct costs.
And on the other hand, there is indirect cost at the organization level, most of the time, software teams and sysadmins teams have different objectives, priorities and use differents terms to define the same event.
Google over the years, provided more and more services, and experienced directly the limitations of the traditionnal approach.
With the implementation of SRE Teams (Site Reliability Engineering), they were able to :
(1) : increase substantially the velocity of features/services deployed in production
(2) : improve the service reliability in production
4 main principles govern the SRE model and describe how organization can implement them in their processes:
Embracing Risk
Service Level Objectives
Eliminating Toil
Monitoring Distributed Systems
Ideally your service is available all the time for your users. But unfortunately, sometimes you can experience unplanned downtime (failures, etc) or you may need to perform some updates (planned downtime).
So you can't really have a 100% service availability. What you can expect is having a number closer, generally the 5 nines: 99.9%, 99.99%, 99.999%.
Experiences have shown that trying the reach absolutely the highest service availability cost you in terms of agility (less frequence updates, few features, etc..)
As illustrates in the following diagram the best way to improve your service overall cost is by balacing the risks, as illustrated in the following diagram :
The first step will be to measure the service risk, trying the get the uptime level. The best formula to caculate the availability of the service :
availability = successfull requests / total requests
In a second step you can prioritize the impact of the services on your business (revenue). A service to free users, will for instance have less impact than a service offered to premium users.
At a third step, you can define the target level of availability. As illustrated in the following diagram, to clearly define the service risks, product owners and software teams can agree on the impacts if the service is unavailable.
Below is a table including the availability level and corresponding unavailability window per year and per day to give you a good understanding.
Availability Level | Allowed unavailability window (per/year) | Allowed unavailability window (per/day) |
---|---|---|
90% | 36.5 days | 2.4 hours |
95% | 18.25 days | 1.2 hours |
99% | 3.65 days | 14.4 minutes |
99.9% | 8.76 hours | 7.20 minutes |
99.99% | 53.6 minutes | 1.44 minutes |
99.999% | 5.26 minutes | 43.2 seconds |
Let's understand it better with real use cases.
Use Case 1 : A service that serves 5 millions requests/day and has a target availability level of 99.9% per day, can tolerate up to 25000 errors (or failed requests) per day and still meets the service level target.
Use Case 2 : A service that serves 2 millions requests/day and has a target availability level of 99.99% per day, can tolerate up to 2000 errors (or failed requests) per day and still meets the service level target
Additionnaly to the SLA (Service Level Agreement) which is an indicator more appropriate for the business requirements, SRE introduces 2 types of metrics to measure the service:
SLI (Service Level Indicator), which is a specific metric to measure the service
There are 2 categories of SLI metrics : User facing system and Storage.
User facing system
the time it takes for the service to respond to a http request (1ms, 100ms, 500ms)
the number of failed requests in proportion to all the requests (2% failed http request per/s, 10%, ..)
the number of requests or queries per/s (QPS) the system can handle (1 million requests/s, 5 millions requests/s)
Storage system
define whether or not the data is accessible.
the time it takes to read/write data (in terms of IOPS)
the ability to store information with no disuption, errors or failures.
You can follow these 3 steps to define your service level objectives.
While running application in production, you may have to perform some operations: manual, repetitive, or automatable; like running a script to automate some tasks, etc ;
The Toil represents basically the operations (cited above) that tend to increase exponentially as your service in production scales as illustrated in the following diagram.
Google approach is to allocate 50% of the time of SRE operations spent on reducing the toil, as illustrated in the following diagram :
Time allocated to reduce toil, is like an investment on the future of your platform operations; Not allocating enough time on toil, can significantly increase the time you spend only on toil, reducing the time you allocate for other tasks (increasing delay for fixing issues, or developping features, etc..)
Monitoring systems today get more and more complex. The old ways to conduct monitoring to get alerted if something breaks, do not apply to large distribution systems.
The evolution of software architecture (monolitic to micro-services), has brought new challenges and to monitor efficiently these large scale distributed systems, there are few rules to adopt.
There are 4 crucial metrics to help you track how healthy is your system :
How long your system takes to respond to a request ? It's usually expressed in terms of milliseconds;
How many requests your system is handling? for instance for a web application it might be how many http requests per second ?
How many incoming requests fail ? how many http errors (5xx, etc,..). This indicator can help you determine how reliable is your system.
This indicator will determine if your system capacity is overloaded (max cpu, memory, disk consumption).
Some rules are important to keep in mind to maximize the effectiveness of an incident response, but also to prevent SRE teams to stay constantly under-pressured.
Automate incident response as much as possible : You may receive several alerts for incidents that require the same manual resolution. Make a list of these incidents and make sure whether or not they are eligible for an automation (script or any other automated task) to prevent triggering unecessarily your alerting system
Keep things simple : There is no reason to track a metric on everything. Some minor incidents not affecting the user experience or that may have lower impact on the service availability, might be removed from the alerting system. Your team receive several alerts, if there are too many of them in a short period, this might create fatigue.
Dashboards for what really matters : Too many indicators on your dashboard might be inefficient. Your dashboard should remain as simple and clear as possible to catch easily an incident, a pattern, a problem.