Part 2.1 – Deployment Strategies

Deployment in Increments (Learn & Improve)

We discussed the steps needed to design & architect a reliable distributed system according to the well-architected design principles in Part-1 of the series. Now that we have the necessary artifacts from the design phase, we are ready at this point to deploy the system to a development or test environment. One of the SRE best practices is to deploy and test the reliability of your system in the DEV/TEST environment and make the necessary changes before moving to the Production environment.

If you have a team consisting of junior SRE members and product developers that haven’t paid much attention to the balance between feature rollout and system reliability, then achieving target SLIs and SLOs in the first deployment is going to be close to impossible. The process becomes all the more difficult if the target SLOs are much higher (e.g. 99.99% uptime as in our e-commerce app) and the error budget is therefore too low. In this case, the approach should be

  • Deploy the system in the sand-box or the development environment
  • Configure Observability at all required points
  • Subject the system to different kinds of tests & experimentations that can help validate the SLO requirements
  • Measure all the necessary SLOs and check against the target. If the SLOs fall short of the target, then get back to the system’s design and architecture to look for any room for improvement

We will look at the above-mentioned steps in greater detail in the next article in this series. Paying attention to this feedback loop early in the product development cycle can help in assessing the design decisions and also get us a picture of the possibility of achieving the SLOs with the current design

src: Author’s diagram. An excerpt from the entire process flow diagram in building reliable systems

Gated Deployments (Reliability Gates)

The second important measure is to use gated deployments. The infrastructure before being deployed to the cloud needs to be checked for the presence of the appropriate reliability configurations. This is accomplished through the use of Azure Policies. The example below shows an azure policy

{
  "mode": "Indexed",
  "policyRule": {
    "if": {
      "allOf": [
        {
          "field": "type",
          "equals": "Microsoft.Network/applicationGateways"
        },
        {
        "anyOf": [{
          "field": "zones",
          "exists": "false"
        }
        {
         "field": "zones",
         "equals": "[parameters('allowedzoneids')]"
        }]
        }
      ]
    },
    "then": {
      "effect": "[parameters('effect')]"
    }
  },
  "parameters": {
    "effect": {
      "type": "String",
      "metadata": {
        "displayName": "Effect",
        "description": "Enable or disable the execution of the policy"
      },
      "allowedValues": [
        "Audit",
        "Deny",
        "Disabled"
      ],
      "defaultValue": "Deny"
    },
    "allowedzoneids": {
      "type": "Array",
      "metadata": {
        "description": "Allowed zone ids for the app gateway instances",
        "displayName": "Allowed App-Gw Zone Ids"
      },
      "defaultValue": ["1","2","3"]
    }
  }
}

The code in the first column shows a simple azure enforcement policy that dictates the deployment of the application gateway to all the 3 availability zones. If the Infrastructure code (be it ARM template or Terraform) does not have the zones configured at all or has a zonal deployment of the gateway, then the deployment would fail. This is a simple yet powerful mechanism to capture the issues at the pre-deployment stage.

The SRE and the engineering team should spend a good amount of time in creating an exhaustive collection of policies that check for the system’s reliability.

I have created 4 sample policies that are made available in the GitHub Repo- https://github.com/gsriramit/AzWellArchitected-Reliability/tree/main/ReliabilityPolicies