Part 1.1- Understanding & Defining Requirements

Availability Targets of a System

  1. Service Level Indicators (SLI)

The Google SRE Handbook defines it as “a carefully defined quantitative measure of some aspect of the level of service that is provided.”  SLIs are used to assess the qualities of service in order to give feedback to the target of a service provider. Product-centric SLIs detect actions that have a significant effect on the customer experience.

SRE teams typically define SLIs in two phases when setting them up to measure the reliability of service.

  • They choose the SLIs that have the most effect on consumers.
  • They establish SLIs that have a direct impact on the service’s availability, latency, or performance.

Formula This is the formula used to compute SLIs is (Successful Events/ Valid Events)*100. As stated in all the SRE books, targeting 100% is impractical. The target becomes even more difficult in complex distributed systems with so many components. People who have been practicing SRE for years now go to the extent of saying that targeting 100% is detrimental in many aspects.

Number of Indicators: It is usually good to start with a few SLIs to test. The SLIs that you choose to measure should directly correspond to the system that is being set up. A Customer facing application should always prioritize the metrics that impact the customer’s experience (availability, latency, ease of use) rather only on the internals of the system (CPU percentage, memory used, disk read errors, etc.,).

2. Service level objectives (SLOs)

According to the Google SRE handbook, SLO specifies a target level for the reliability of your service. Service Level Objective is a binding target for a collection of SLIs or in other words, the Objective of achieving the targeted SLI over a longer period of time, say over a month, a quarter, or even a year.

Example: If the SLI of interest is the “availability” of the system and the SLO needs to be calculated for a month, then the availability SLI needs to be calculated for the required time span. A formula suggested in the SRE handbook can be used to derive the availability for a specified duration.

sum(rate(http_requests_total{host="api", status!~"5.."}[30d]))/sum(rate(http_requests_total{host="api"}[30d])
src: https://sre.google/workbook/implementing-slos/#none

When first beginning to define the SLO of the system, it is a good practice to start from a timespan as low as a day. Once you have understood your system’s behavior & complexities better, this can be changed to a week or a month.

Note: To achieve an accurate calculation of the SLI and hence the SLO, the system’s observability setup should be done with the maximum possible precision.

3. Service-level agreements (SLA) are contracts on service availability and performance. The Google SRE Handbook defines an SLA as “an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.”

This is an agreement of what the service provider guarantees if the service fails to provide the promised SLO. This can be a refund of money or free credits etc.,. SLAs assist to build trust and transparency between the service and its users. They’re similar to SLOs in that they’re used externally rather than internally.

4. Error Budgets are “a quantitative measurement shared between the product and SRE teams to balance innovation and stability.”

In simple terms, it is the amount of risk you’re willing to accept to add new features. The monitoring service typically monitors your service uptime, whereas the SLOs specify the goal you must meet. The difference between the two is your error budget, which determines how much time you can spend pushing new releases if your error budget permits them. The higher the Error Budget the more features you can push. However, this means that the SLO considered is less than what would be considered for a highly reliable production-grade system.

Formula: Error Budget = 1 – Availability SLI

If the SLO is 97 percent, then the Error Budget = 1- 97% = 3% 

If your service gets 100,000 requests in four weeks and your service level objective (SLO) is 97 percent, your error budget is 1030 mistakes.

References

  1. Google SRE Workbook- https://sre.google/workbook/implementing-slos