Part 1.7 – Composite SLO & SLA

Definition (from the documentation)

A composite SLA captures the end-to-end SLA across all application components and dependencies. It is calculated using the individual SLAs of Azure services housing application components and provides an important indicator of designed availability in relation to customer expectations and targets.

The composite SLA of a system in a single region is calculated using the basic formula

Composite SLA=  SLA of Component1 * SLA of Component-2 … * SLA of Component-N

In cases where the application components have been deployed to multiple regions, the following formula needs to be used

Composite SLA combining N regions =(1-POWER((1-(N)), R))
N is the composite SLA for the application deployed in one region.
R is the number of regions where the application is deployed.

MS Provided tool to calculate the SLA

Microsoft has provided a tool called the Service Level Agreement Estimator  that can be used to calculate the SLA of your architecture. The interface and the experience of the tool is quite similar to the Azure Pricing Calculator. As mentioned in the previous article, the architects have to be careful in choosing the right SKU and performance tiers of the services. The tool is only as accurate as the information that you feed it. The SLA data of the azure services is being queried from an underlying JSON file. If you want to remove the dependency of on a static file, then you can get creative and find if there is a way to query the SLA of the azure services from the “Azure Charts” (https://azurecharts.com/sla), if there is an API 🙂

Calculating the Composite SLA for our E-Commerce App

Revisiting the Architecture Diagram

There was one important information that was not covered when we discussed Dependency Modeling using the RMA approach in the previous article. The Component Interaction Diagram (CID) can be considered complete only after we label the SLA values beside each component. This is done using small sticky notes in the RMA-CID diagram that is shown below. Doing this is important so that we do not miss any of the components when we transfer the components and the corresponding SLA values to a table

User Flow1: Read Products for the Catalog Page

Dependency NameTraffic ManagerAzure DNSAzure Application GatewayVirtual Machine Scale Sets (Web-Tier)Internal Load Balancer (L4)Virtual Machines for SQL Server (DB-Tier)Domain Controllers
Dependency CriticalityCriticalCriticalCriticalCriticalCriticalCriticalCritical
Standalone SLO/SLA99.99%100%99.95%99.99%99.99%99.99%99.99%
SPOF?YesNoNoNoNoNoNo
Design Considerations(Azure Managed Cluster of DNS LBs)Azure Managed(Highly-Available v2 gateway SKU that is zone-redundant)Minimum of 2 instances deployed across Availability ZonesHighly available Zone-Redundant LB from Standard SKUWindows Server Failover Cluster Instances with SQL Always-On Availability GroupHighly-Available Zone redundant Domain controllers hosted on Azure VMs
Composite SLA (Single Region Deployments)99.90%
Composite SLA (Multi- Region Deployments)99.99990%

I have attached a template of this table in the GitHub Repo (https://github.com/gsriramit/AzWellArchitected-Reliability/tree/main/Design/DependencyAnalysis). This SLA calculation datasheet has to be created for all the major user flows. This may look like a tedious exercise doing it for all the flows but this helps in capturing the reliability capabilities of our architecture in the best possible way

Explanation of the Fields

  • Dependency Name : The first row would be used to list all the component names that are involved in the specific flow
  • Dependency Criticality: According to the Az Well-Architected practices and the general SRE guidelines, it is important that we classify our dependencies as “Critical” & “Non-Critical”. This helps in designing & understanding the behavior of the application if the dependency goes down.
  • Single Point of Failure (SPOF): Is this dependency a single point of failure? The answer to this question is YES if the azure services or the customer managed IAAS have been deployed without regional redundancy. If the component is a SPOF, then the failure of the component will bring the entire system down.
  • Design Considerations: If Design Decisions had to be made to make some of the components highly-available (avoid them being a SPOF) then this row captures all that information
  • Composite SLA (Single Region): Formula that calculates the composite SLA of the system when deployed to a single region
  • Composite SLA (Multi-Region): Formula that calculates the composite SLA of the system when deployed to multiple regions

Why Active-Standby Deployments are Needed?

For the application that we are trying to build in a HA state, even with all the services configured in a HA mode with the maximum possible SLO, the single region composite SLA comes to only 99.90%. The actual requirement was to make the system 99.99% available. The exercise that we just performed while at the Design phase helped us understand that a single region deployment is not going to be sufficient for us to be achieve the expected SLO. We will need a minimum of 2 regions to achieve the expected SLO

The information captured in this sheet can be one of the most credible ways to help the customer/business understand that multi-regions deployments are mandatory to build a system that has an allowed downtime of only 4 minutes & 22 seconds per month and ~52 minutes per year

References

  1. Composite SLAs – https://docs.microsoft.com/en-us/azure/architecture/framework/resiliency/business-metrics#composite-slas