Part 1.8 – Failure Mode Effect Analysis

If you have listened to at least one SRE expert talk about reliability in the modern age and in the cloud, you would have definitely heard them insisting on the change of mindset to

  1. Embrace failures coz failures happen and
  2. The system in the cloud should be designed to reduce time to recover (TTR) rather then the old approach of extending time between failures (TBF).

With this change in the mindset, a process that can help the engineering and the product teams to be prepared for failure is the Failure Mode Effect Analysis aka FMEA. The idea here is that, once the system architecture has been prepared and the user flows derived, members of all the teams that have some amount of onus in the system should participate in this collaborative activity and identify the possible failures points (individual components, cloud networks, external dependencies) and the different modes in which these failures can happen.

FMEA is a flexible framework for performing qualitative failure analysis in industrial and computing systems. Potential failures are identified and the consequences of those failures are analyzed to assess risk. By completing the RMA process, an engineering team will have thought through many of the reliability issues in depth and be better equipped to ensure that when failures occur, the impacts to customers are minimized.

src: Resilience_by_Design_for_Cloud_Services_White_Paper

Failure Categories

If you are doing the RMA for the first time, then you would for sure be having questions about the error categories and which category applies to which set of cloud services? The following categories can be used for starters (excerpt from the paper)

  1. Non-Existence
  2. Authentication & Authorization Issues
  3. Latency
  4. Failed requests due to incorrectness of information
  5. Degradation of the resources

Applying FMEA /RMA to the E-Commerce Application

Using the RMA diagram that we came up with in the previous step, we can go ahead and build the FMEA worksheet. The following table is an excerpt from the detailed sheet that is available in GitHub (https://github.com/gsriramit/AzWellArchitected-Reliability/tree/main/Design/FMEA).

Failure PointFailure ModesFailure Short NameFailure Short DescriptionImpact/ EffectSystem ResponsePossible Remediation
Azure App GatewayApplication Gateway is unavailableExistence: Public App-GWPrimary Site is not accessible: Gateway is downEntire Customer Website becomes inaccessibleSystem attempts to failover to the secondary siteFailover to the secondary site after the configured “N” retries and “M” failed responses
 Application Gateway DegradationDegradation: Resource Limitation observedApplication Gateway CPU/Memory Metrics indicate that it is unable to handle the volume of trafficNo ImpactApp-Gw autoscales to handle the request rate
Note: The scaleout limit should not have been reached already
 
  Application Gateway DegradationDegradation: Scale-out failureApplication Gateway is unable to autoscale: autoscale limit has been reachedUser Requests can get dropped (unwanted load shedding can happen)System would start returning error responsesManual intervention to either a) increase the threshold of autoscale limit or b) analyze the request volume pattern
 Application Gateway Connection FailureBad Auth:  expired TLS certificateTLS Handshake failure due to expired TLS certificate of the site hosted at the app gatewayUsers may be shown the warning of accessing a risky site The TLS handshake failure if configured for alerting, can then be acted on accordingly

FMEA as a preparation for Chaos Engineering

Chaos Engineering (sometimes known as Fault Injection Testing- FIT) is an SRE practice that helps in experimenting and determining the reliability of the system. This is done by introducing faults of various kinds at various points in the system. We would covering this topic in a greater detail later in this series. FMEA worksheet that captures the failure points and the failure modes can be used as an input to prepare the chaos scenarios. The only thing to be understood is that not all failures can be manually introduced into the system (this is true to a greater extent when the system you are running is in the public cloud and majority of the selected services happen to be PAAS). There are however ways to workaround these limitations.

Another important concept (that we will discuss later as well) to be understood is that the FMEA worksheet that is created during the design phase and the chaos experimental results obtained during the testing/experimentation phase are supposed to complement each other.