Part 1.8 – Failure Mode Effect Analysis
If you have listened to at least one SRE expert talk about reliability in the modern age and in the cloud, you would have definitely heard them insisting on the change of mindset to
- Embrace failures coz failures happen and
- The system in the cloud should be designed to reduce time to recover (TTR) rather then the old approach of extending time between failures (TBF).
With this change in the mindset, a process that can help the engineering and the product teams to be prepared for failure is the Failure Mode Effect Analysis aka FMEA. The idea here is that, once the system architecture has been prepared and the user flows derived, members of all the teams that have some amount of onus in the system should participate in this collaborative activity and identify the possible failures points (individual components, cloud networks, external dependencies) and the different modes in which these failures can happen.
FMEA is a flexible framework for performing qualitative failure analysis in industrial and computing systems. Potential failures are identified and the consequences of those failures are analyzed to assess risk. By completing the RMA process, an engineering team will have thought through many of the reliability issues in depth and be better equipped to ensure that when failures occur, the impacts to customers are minimized.
src: Resilience_by_Design_for_Cloud_Services_White_Paper
Failure Categories
If you are doing the RMA for the first time, then you would for sure be having questions about the error categories and which category applies to which set of cloud services? The following categories can be used for starters (excerpt from the paper)
- Non-Existence
- Authentication & Authorization Issues
- Latency
- Failed requests due to incorrectness of information
- Degradation of the resources
Applying FMEA /RMA to the E-Commerce Application
Using the RMA diagram that we came up with in the previous step, we can go ahead and build the FMEA worksheet. The following table is an excerpt from the detailed sheet that is available in GitHub (https://github.com/gsriramit/AzWellArchitected-Reliability/tree/main/Design/FMEA).
Failure Point | Failure Modes | Failure Short Name | Failure Short Description | Impact/ Effect | System Response | Possible Remediation |
Azure App Gateway | Application Gateway is unavailable | Existence: Public App-GW | Primary Site is not accessible: Gateway is down | Entire Customer Website becomes inaccessible | System attempts to failover to the secondary site | Failover to the secondary site after the configured “N” retries and “M” failed responses |
Application Gateway Degradation | Degradation: Resource Limitation observed | Application Gateway CPU/Memory Metrics indicate that it is unable to handle the volume of traffic | No Impact | App-Gw autoscales to handle the request rate Note: The scaleout limit should not have been reached already | ||
Application Gateway Degradation | Degradation: Scale-out failure | Application Gateway is unable to autoscale: autoscale limit has been reached | User Requests can get dropped (unwanted load shedding can happen) | System would start returning error responses | Manual intervention to either a) increase the threshold of autoscale limit or b) analyze the request volume pattern | |
Application Gateway Connection Failure | Bad Auth: expired TLS certificate | TLS Handshake failure due to expired TLS certificate of the site hosted at the app gateway | Users may be shown the warning of accessing a risky site | The TLS handshake failure if configured for alerting, can then be acted on accordingly |
FMEA as a preparation for Chaos Engineering
Chaos Engineering (sometimes known as Fault Injection Testing- FIT) is an SRE practice that helps in experimenting and determining the reliability of the system. This is done by introducing faults of various kinds at various points in the system. We would covering this topic in a greater detail later in this series. FMEA worksheet that captures the failure points and the failure modes can be used as an input to prepare the chaos scenarios. The only thing to be understood is that not all failures can be manually introduced into the system (this is true to a greater extent when the system you are running is in the public cloud and majority of the selected services happen to be PAAS). There are however ways to workaround these limitations.
Another important concept (that we will discuss later as well) to be understood is that the FMEA worksheet that is created during the design phase and the chaos experimental results obtained during the testing/experimentation phase are supposed to complement each other.