Part 1.4 – System Architecture (driven by Well-Architected Reliability Principles)

In part 1.3 of this series, we discussed the questions that we would need the customers to answer about the reliability requirements of the system. This would have been the step where we would have also captured other functional and non-functional requirements. The first step in determining the availability SLI & SLO of the system would be understanding the user flows, components in each critical user flow, the detailed dependency graphs, and the possibilities of failures.

Microsoft has helped us a big deal with the Well-architected reliability principles and design checklist. These are to be used by the CSPs when creating the design and the architecture of the system. The checklist prepared from MS’s documentation is made available in my GitHub repo. Please note that the checklist contains questions and suggestions from the documentation and from the SRE workbook. This can be enhanced, improved, or modified with the knowledge that the CSPs would have gathered over the years assisting the customers.

https://github.com/gsriramit/AzWellArchitected-Reliability/tree/main/Design/ReliabilityChecklist

Following are the major categories that the design would be based on

  1. Availability & Recovery Targets
  2. Application Requirements
  3. Data Requirements
  4. Networking Requirements
  5. Infrastructure Requirements
  6. Monitoring Design
  7. Testing & Experimentation

System Architecture Diagram

Well-Archictected-Reliability - Reliability-Reference Architecture

The system as shown in the diagram is a mix of 2 HA+DR reference architectures

At the completion of this architecture diagram, the CSP or the Partner would have answered all the questions in the checklist. The answer should be a simple Yes or No. If a particular principle is not going to be considered, then there should be an equivalent justification on why such a decision was taken

Example

CategoryReliability QuestionsConsidered in Design?Justification
Data RequirementsDo we have stringent data consistency requirements that will be affected by availability (C or A in distributed systems)Moderate Requirements to balance consistency & availability. Commit strategy uses synchronous commit to 2 out of 3 replicas for a write operation.

Favoring Consistency within the same region DB instances. Between the cross-region instances, availability is favored over consistency (using asynchronous replication)

Other Artifacts

At the end of this step, we should be having one other artifact in addition to the architecture diagrams. These would 2 versions of cost estimates(with and without all the HA-DR measures) as detailed in the previous article. The difference between the 2 estimates would help the customers make the decisions on the “High Availability (Reliability) vs Cost Savings” topic.

For the architecture shown above, the following folder has 2 estimation sheets. The second sheet contains estimates based only on a single region. This can be modified to suit the kind of setup that you plan on implementing in your environment. The use-case architecture uses an “Active-Hot Standby” type of HA-DR setup. If the design decision is to use an “Active-Passive” or “Active-Warm Standby” setup, then the cloud costs would vary accordingly.

https://github.com/gsriramit/AzWellArchitected-Reliability/tree/main/Design/CostEstimations

References

  1. Design Checklist- https://docs.microsoft.com/en-us/azure/architecture/framework/resiliency/design-checklist
  2. Testing Checklist- https://docs.microsoft.com/en-us/azure/architecture/framework/resiliency/test-checklist
  3. Monitoring Checklist- https://docs.microsoft.com/en-us/azure/architecture/framework/resiliency/monitor-checklist