Part 1.2- SLI Specification & Implementation Plan

SLI Specification

The SLI specification helps us define the Service Level Indicators that are of interest to us and those that we will track closely. As stated in the previous article, the indicators of the system that impact the end user’s experience are more important than those that are internal indicators of the system’s performance. This does not mean that we will not define SLIs for the system’s performance at the service level. These are to be handled internally and differently from the primary ones. In the case of a system that has multiple components, the indicator that we choose would be a composite value of the same indicator of the individual components that complete the user’s request

The following video from NewRelic gives you a perspective of the real world SLIs and SLOs, both internal and external – https://www.youtube.com/watch?v=3Aem8DAGyAk

Definition: The assessment of service outcome that you think matters to users, independent of how it is measured.

Helper Steps in completing the Specification (excerpt from Google’s SRE book)

It is always a good practice to use the following helper steps when trying to come up with the SLI specification for a large distributed system.

1. Choose one application for which you want to define SLOs. If your product comprises many applications, you can add those later.

2. Decide clearly who the “users” are in this situation. These are the people whose happiness you are optimizing.

3. Consider the common ways your users interact with your system—common tasks and critical activities.

4. Draw a high-level architecture diagram of your system; show the key components, the request flow, the data flow, and the critical dependencies. Categorization of the components help in identifying the corresponding SLIs easier. This can be used as precursor in majority of the cases

The system that we would build would be an e-commerce application that runs in the Azure Public Cloud. This system would contain components that receive the user’s request and provide a response. The actual processing of the request is internal to the application and the implementation specifics have been avoided for purposes of brevity of the post. This system can be categorized as a “Request-Driven” system.

Helper SLIs for a Request Driven System

Type of serviceType of SLIDescription
Request-drivenAvailabilityThe proportion of requests that resulted in a successful response.
Request-drivenLatencyThe proportion of requests that were faster than some threshold.
Request-drivenQualityIf the service degrades gracefully when overloaded or when backends are unavailable, you need to measure the proportion of responses that were served in an undegraded state. For example, if the User Data store is unavailable, the shopping app is usable but not all features would be available.

SLI Implementation Plan

We have categorized the application and been able to identify of the indicators that will matter the most, “Availability”. The SLI implementation will discuss the details of the method that we would use to capture the necessary metrics at one or more points in the system that can help us calculate the SLI.

Monitoring Solutions

To be able to determine successful and failed requests in a system that contains a Web-Tier (app deployed on Virtual Machine Scale-Sets) behind a layer-7 load balancer and a SQL database that is deployed on Virtual Machines, we would need to use the monitoring capabilities of Azure Monitor and the Azure Application Insights.

Calculating Availability at the Application Layer

Application Insights provides us in-portal experience of examining the “Requests”, “Failures” and “Dependency Failures” of the application that has been instrumented. The first 2 metrics should be able to help us determine the availability of the application. We can use the same formula to get the ratio of successful requests to the total requests.

sum(rate(http_requests_total{host="api", status!~"5.."}[7d]))/ sum(rate(http_requests_total{host="api"}[7d])

Note: For the above mentioned formula to work as is, the application should be coded in such a way that the request failures due to code level issues, request timeout or a dependency failure are perceived as errors with 500 error code. If the app is designed to use the appropriate HTTP error codes (4xx and 5xx) then the formula needs to be modified to consider all the applicable error codes.

Downsides of relying on one source of Monitoring Data

The potential downsides of relying only on the application insights to calculate the availability SLI and SLO is that in the path of a HTTP request that starts from the user’s browser, the probability of the request failing at components even before the application layer is to be considered as well.

  • The Traffic Manager could go down for a brief span
  • Azure DNS service could go down
  • The load balancers at L7 and L4 can fail.

To be able to handle failures in these scenarios, we will have to monitor all the components in the user flow (this concept will be discussed in a later post). The failures at one or more of these components are to be considered while calculating the availability SLI. Following are the suggestions from the SRE guide

  • Application server logs
  • Load balancer monitoring
  • Black-box monitoring
  • Client-side instrumentation

References

  1. Google SRE Workbook- https://sre.google/workbook/implementing-slos