Part 1.6 – Modeling Dependencies

After defining the critical User Flows and a considerably accurate version of the “Service or Components Graph”, the next step is to identify the dependencies of these components, both internal and external.

Microsoft Resiliency Modeling Analysis (RMA)

Definition from the White Paper:

This paper briefly frames the motivation and benefits of incorporating robust resilience design into the development cycle. It describes Resilience Modeling and Analysis (RMA), a methodology for improving resiliency adapted from the industry-standard technique known as Failure Mode and Effects Analysis (FMEA)1, and provides guidance for implementation.

src: Resilience_by_Design_for_Cloud_Services_White_Paper
  • Pre-work- Creates a diagram to capture resources, dependencies, and component interactions.
  • Discover-Identifies failures and resilience gaps.
  • Rate- Performs impact analysis.
  • Act – Produces work items to improve resilience.

The Pre-Work step suggests the creation of Component Interaction Diagram (CID) capturing all the components, their dependencies and the way in which they interact with each other. The architects should spend an ample amount of time in this step for this creates the base to determining the overall reliability possibilities of the system considering the current architecture and design decisions.

The “Discover” and “Impact” phases of the process help in determining the Composite SLO of the system. We will be discussing the concept in a greater detail in the next article. The following are some important reads from Google’s SRE book

For example, if a single component is a critical dependency9 for a particularly high-value interaction, its reliability guarantee should be at least as high as the reliability guarantee of the dependent action. The team that runs that particular component needs to own and manage its service’s SLO in the same way as the overarching product SLO.

If a particular component has inherent reliability limitations, the SLO can communicate that limitation. If the user journey that depends upon it needs a higher level of availability than that component can reasonably provide, you need to engineer around that condition. You can either use a different component or add sufficient defenses (caching, offline store-and-forward processing, graceful degradation, etc.) to handle failures in that component.

Unless each of these dependencies and failure patterns is carefully enumerated and accounted for, any such calculations will be deceptive.

Component Interaction Diagram

The following image shows the CID for the E-Commerce App that we have taken as the use-case. To be able to come up with this diagram, we would have completed the following steps

  1. Identified one or more critical user flows (flows defined in the product/app specific terms that translate to one common request flow in the Cloud Architecture diagram)
  2. Captured all the internal and external dependencies of the components in these flows. For e.g., the base architecture diagram would have Azure Active Directory (AAD) as a dependency of the web-application if AAD is to be used as the identity store. But considering the product search or add to cart flows, we have taken AAD out of the diagram. This is done assuming that the user’s session is valid (authenticated) and the user carries a valid access token.
  3. Marked the SLO of the components. Please note that the SLOs vary depending on the SKU of the services you choose and also the distribution of your IAAS instances among Availability Zones (AZs). Make sure that you verify and mark the exact number coz this would be very important in determining the Composite SLA of the system

References

  1. Resilience By Design For Cloud Services- White Paper – https://download.microsoft.com/download/D/8/C/D8C599A4-4E8A-49BF-80EE-FE35F49B914D/Resilience_by_Design_for_Cloud_Services_White_Paper.pdf
  2. Dependency Modeling from Google’s SRE Book – https://sre.google/workbook/implementing-slos/