System Performance Baselining
Now that the application is deployed to the dev environment with the necessary monitoring set up, the next step is to determine the baseline performance of the app with respect to the SLI metrics. The system that we have considered is a customer-facing web app, we will need to determine the availability of the app using the status of the requests. If we already know the number of customers that will be using the app on a regular day, the number can be represented as 1x.
We will start by creating a synthetic load on the application at only 1x. Any tool that lets us do load testing can be used for this purpose, Apache JMeter is a standard and promising tool. The steps in creating the synthetic load are
- Start with all the common and possible user flows. This can be an entire flow starting from the user’s login to a product checkout and payment or just a product catalog search
- Perform these user flows using a sample/test user account. Allow JMeter to record the sequence of the steps that happen in executing the flow when you perform the flow in a browser interface
- JMeter helps record each user flow as a separate action that can be replayed
If for example, the total number of users expected to use our app on a regular day is ~ 5000/hour, then a combination of critical user flows (all components need to be available without any option of fallback) and non-critical user flows (displaying the products from a specific category) can be run from JMeter simulating the 5000 users for a duration of an hour.
I have captured the process and a sample screen in the repo – https://github.com/gsriramit/AzWellArchitected-Reliability/tree/main/SystemReliabilityBaseline
The monitoring setup that we have in place should be able to provide us the platform & infrastructure metrics along with the application performance metrics. We will need these metrics to assess the reliability of the application for the simulated 1x load. You should be having a question now about a general SRE suggestion that we discussed in an earlier article – ” We would be focussed more on the metrics that impact the end-user experience rather than the internal system-level metrics”. However, we should be mindful of the fact that the metrics of the system components that indicate a probable failure beyond that point would ultimately result in the failure of those components that cascadingly leads to the failure of the request(s) hence impacting the Availability SLI
Example: Assume that we have the following configuration set up for the VM scale sets.
Initial count of VM scale set instances = 3 Minimum permissible instance count = 2 Scale out action on the condition that "If CPU percentage on 60% of the instances goes beyong 80%, scale out by 1 instance" Maximum permissible instance count = 5
If while performing the load testing of 1x, due to the nature of the api operations, 2 of the instances exceed 80% of CPU utilization while we have only processed 30% of the total user requests in that hour (~1500), the scale-out can continue till the point that the total number of instances reach 5. There could be 2 possibilities from this point
- A minimum amount of requests fail because the 4 out 5 web servers are overwhelmed and the application gateway does not have any more available instances to re-route to
- All the 5000 requests would be have been processed (assuming that there were no application level failures), we have would still used up all the permissible instances
Takeaways from the scenario
- The CPU utilization metrics of the web servers is needed to understand the real-time behavior of the web server component and its resource necessities
- Now that we have reached the permissible limits of scaling, any more than the 5000 requests/hour would obviously lead to failures
- Before we perform the load testing for an increased load, we need to address the limitation specified in point # 2. This can either be increasing the max permissible instances or allocating servers with more CPU resources and retaining the scaling limit
- Measure the baseline metrics of all the system components that are in the line of critical user flows
Measure Critical SLIs
This is an important step in understanding if the system was able to reach the targeted SLI goals. At the end of the load test run, using application insights capture the key SLI metrics. This includes
- HTTP requests – this corresponds directly to the availability SLI
- Other user experience factors
Assess and Improve
The infrastructure, platform, and application logs and metrics give us a picture of how the system can perform at a 1x load. If the system does not perform as expected (metrics indicate possibilities of the system falling short of the SLO) then the next step is to assess and improve. This involves identifying the specific SLIs that have been short of targets, identifying the necessary fixes or changes in the system to improve, and bringing those indicators up. Assessment factors include
- Application design
- Reliability measures including fallbacks and graceful degradation
- Avoiding cascading failures through retry storms
- Latency due to long-running operations
- SKUs – Choice of SKUs of the cloud components
- Sizing – Sizing of the services and the ability of elastically scale to accommodate spikes in user traffic
- Complex interaction between the components
- Removing dependencies that are not critical and hence reducing the number of failure points
- Improving observability & monitoring to understand the system better if there were gray areas identified during the first run
Load Testing for SLOs
We have so far determined the necessary SLIs of the system with the 1x load test. The “Assess and Improve” step addresses the issues in the system and sets it up for a scenario of prolonged use, necessary enough to calculate the System Level Objectives (SLOs). This is indicated in the flow diagram (excerpt) shown above. The differences from the previous step are
- Tests would be run for 3 weeks (or the duration of the rolling window) simulating 120,000 requests/day
- Simulation should cover all the possible user flows
If running the system for the duration of a rolling window is not desirable for reasons of resource usage, cost incurred, and lack of dynamic variables from the synthetic load, then another way of determining approximate SLOs is to use “Data extrapolation“. With the data that we have from the test that has been run for “N” hours, we can extrapolate the data to the duration of the rolling window.
e.g. From the “Requests” data exported from the Application Insights for a day (120,000 requests), we can extrapolate the data to a 28-day representation (33,60,000 requests). From this data, the calculation of the availability and latency SLIs would give us the corresponding SLOs. Note: This is simply a projection of the system’s behavior and performance based on a small dataset and should be used just to forecast the necessary SLOs. The actual SLOs would be known when the system is deployed to production and we receive the real-time user data.