Site Reliability Engineering (SRE) is the practice of continuously improving the reliability, efficiency, and agility of complex software systems. Google’s SRE team has made these lessons learned available to everyone through their Google SRE book. This document is broken down into two parts: Part 1 describes what SRE is, how it’s implemented at Google, and how it differs from traditional engineering practices; Part 2 contains specific recommendations to make your systems more resilient and predictable.
Key components of SRE
Testing, monitoring, and alerting: These three elements make up a stable service. They’re also easy to implement, since they often rely on existing infrastructure. If you already have tests, monitoring and alerting in place, that’s a good starting point for your SRE journey. In fact, automating these processes is as important as incorporating them into your SRE strategy. Your monitoring system will be key for detecting problems quickly; your alerting system should surface issues that are not being caught by your tests or monitoring. And it may take some time to figure out how best to test your application—but don’t let that stop you from implementing testing early in your SRE process.
Service level objectives
SLOs are an important part of SRE, but it’s not something you can just tack on at will. You should think about your SLOs in three parts: availability, latency, and error rate. What is your availability target? How quickly do you want to respond when there’s a problem? And how often does a given service or application need to succeed? As with any project (SRE included), it’s essential that everyone involved understand these goals. If a problem occurs, you need concrete metrics for how bad it is and what went wrong. If problems are going unaddressed and no one knows why, then you don’t have much of a service anymore—it’s useless as far as customers are concerned.
Quantitative metrics for objective evaluation
A good way to figure out whether your service is working well or not is by using metrics. After all, what’s a better measure of success than hard data? Many engineers start their projects with instrumentation and monitoring in mind, so it’s generally easier than you might think to collect data about how users interact with your service. The real challenge is interpreting that data, which can provide insights that may have been previously undiscovered or unarticulated. Effective use of metrics can help lead you to insights about how your users are interacting with your project and where improvements are needed. It will also allow you to determine whether any problems identified during site reliability engineering exercises have been successfully fixed.
Understanding SLOs from the customer perspective
The metric we monitor most closely at Google is latency. Customers expect their searches and other tasks performed on our platform to return a result quickly, and an SLA of 100 milliseconds or less holds true for about 90% of all requests. While Google’s infrastructure helps us maintain such a low average, individual events can fluctuate significantly—sometimes by more than 10x from their mean response time. A good example of when SLOs are tested is during major weather events—like Hurricane Sandy in 2012, which caused service interruptions throughout New York City. Our tools helped us identify locations with outages faster than would have been possible otherwise, meaning we could initiate mitigation efforts much sooner than if we had relied solely on human reporting systems alone.
Objectives must be flexible and adaptive to change
a motto of SRE is everything scales, but that means different things at different scales. At small scales, everything may need to scale in order for a system or service to be useful; at very large scales, being able to not worry about some components scaling becomes valuable. The goal of SRE is always to maximize an organization’s ability to improve and evolve its software, which means seeking out as much knowledge about users as possible and figuring out ways that developers can respond with greater speed than would otherwise be possible.
Defining SLOs for GCP services
Google’s technical infrastructure can be categorized into four service levels: Tier 0, Tier 1, Tier 2 and Tier 3. Each level is comprised of many services, each with its own SLO. SLOs define what constitutes an issue and how it should be handled in a particular environment. The lower levels (Tier 0 and 1) are primarily used for mission-critical infrastructure with stringent availability requirements; all issues must be resolved within 30 minutes or less. The next tier (Tier 2) involves services that are not quite as sensitive; they must still be resolved within four hours but may take up to eight hours if necessary. Finally, there is Tier 3 which has a defined period of 12 hours to resolve issues. – Article
The benefit of SLOs is better communication with customers
The ideal service doesn’t mean anything unless it’s measured, managed and monitored. SLOs are one way of measuring both internal service quality and external client expectations. Whether you’re writing an SLA with a customer or simply setting internal performance targets, they let you define your service goals and make them visible. In addition, SLOs can be used as part of a proactive monitoring strategy by defining an actionable metric that will trigger mitigation actions when triggered. An example is database latency SLO set at 50ms; if 50ms isn’t achieved for several consecutive periods (typically 5 minutes), an incident ticket will be generated in your incident tracking system for further investigation.
The final word on defining SLOs – Balancing quality, cost, performance and the needs of customers
A commonly held belief among site reliability engineers (SREs) is that service level objectives (SLOs) should be designed to balance four key factors: cost, performance, quality and customer needs. However, a recent study by Healey and colleagues found that while many SREs agreed with SLOs being associated with these four factors, only around half of respondents perceived that all four were captured in their SLOs. To explore what types of things fall outside of classic SLOs but are often considered important goals by SRE teams, Healey and colleagues surveyed 200 participants from North America at companies that primarily offer information technology services via cloud computing platforms. The survey asked questions about where classic SLO attributes lie in relation to these other dimensions.