Part 2.2 – Configuring Observability

Observability & Monitoring

Observability represents the setup surrounding or built into a system to get all the needed information about the system’s state (or health). A system that is not observable cannot provide information on the SLIs and SLOs. There are 2 schools of thought around Monitoring and Observability. This article seems to do a good job of discussing the key points in the comparison – https://thenewstack.io/monitoring-vs-observability-whats-the-difference/ In general these 2 concepts are supposed to complement each other.

 You can’t monitor what you don’t understand or know, and as a result you won’t able to deliver the level of service availability promised to the business. 

MS Cloud Adoption Framework (referenced at the end of the article)

Observability Setup

We will be discussing three important features of Azure Monitoring that will need to be set up once the System is deployed to the cloud.

Log Analytics Agent

The log analytics agent needs to be installed on the IAAS resources, specifically the Virtual Machines and the Scale Sets. The agent performs so many different operations that range from collecting monitoring data from the guest operating system and workloads of virtual machines in Azure & hybrid environments to supporting the operations of Azure Security Center and Azure Sentinel. Refer to the following article for details on all the scenarios that require a Log-Analytics agent- https://docs.microsoft.com/en-us/azure/azure-monitor/agents/agents-overview#log-analytics-agent

Application in Reliability Monitoring

The logs and metrics that the agent collects are sent to the Log-Analytics workspace. Required analysis can be performed on these logs and metrics to set up the necessary alerting and automatic remediation if possible. A typical example that you would have come across is the scale-out and scale-in of a VMSS in response to a spike and dip in the CPU percentage. How does this relate to the reliability of the system though?

Example

Rule: If the CPU percentage in 60% of the VMSS instances crosses 80%, then scale out the set by 2 instances.

This rule is a representation of one of the cases in the FMEA analysis of the VMSS

Failure ModesFailure Short NameFailure Short DescriptionImpact/ EffectSystem ResponsePossible Remediation
Web-Tier VM(s) DegradationDegradation: VMs CPU or memory utilization is highA selected set of VMs receive a high volume of traffic due to a) sticky sessions b) load-test scenarios from the same host (srcIP and srcPort)Some of the customer requests might be affected: slow response, error, etc.,The system would raise a diagnostics alert if configured
e.g. alert and email if instances have CPU percentage more than 70% for more than 15 minutes
Auto-Scale the VMSS based on the resource metrics.

If the metrics of the VM instances in the scaleset have not been captured in the first place, then alerting and/or reacting to the threshold conditions using auto-remediations would not be possible. Taking the example specified in the Rule statement, if the scale-out of the VMSS had not been configured, we would be dealing with a potential outage situation.

It is not just the Virtual Machines and the Scale Sets that need to be monitored. As discussed in the FMEA article, every single component is a point of failure even if not a single point of failure. This indicates that we will have to set up monitoring for all the components in the system. For Azure managed services including the app gateway, the data plane collects the metrics (time series of events) by default. We just have to configure the diagnostics settings to collect the logs and metrics in a common Log-Analytics workspace

Azure Monitor Dependency Agent

The dependency agent collects information on the processes running on a virtual machine and also the dependencies. The dependency information collected from the agent is used by Azure Monitor to construct the “Service Map”. We can use the service map to assess/validate the dependency information that we gathered during the design phase. If any of the dependencies were not accounted for, then those need to be added to the dependency graph. Note: The increase in the number of dependencies of a particular service also leads to an increase in the number of failure modes

Following is a service map from the web server’s perspective

Azure Monitor Application Insights

Azure Application Insights is an Application Monitoring solution under the suite of monitoring solutions in Azure Monitor. The service lets us understand the behavior and the performance of web applications. The web applications if instrumented appropriately send information on the requests, dependency calls, custom events, page load times, etc., To be able to run reliable systems in the cloud, we will be using the HTTP requests, Dependency failures, User Flows, Usage Impact, and Availability features.

HTTP Request Status

The key metric to be monitored in a REST-based web application is the number of failed HTTP requests. This corresponds directly to the availability formula that we have used to calculate the uptime of the system (number of successful requests/ total number of requests). The data collected from the request metrics (we are interested only in the request status in this case) is used to calculate the availability SLI. If the application has been running for 3-4 weeks, then the data collected would be sufficient to calculate the availability SLO. We wouldn’t let the application run for the rolling window duration (an SRE concept) if the SLI doesn’t look promising.

Dependency Failures

Similar to how the monitoring was done on the resource metrics and alerting & remediation mechanisms were set up, we will be monitoring the application dependency failures. To be specific, we will be creating a rule as follows

Rule: If the number of failures on the dependency SQL Always-On AG exceed 5 in a time span of 5 minutes, then create an alert of severity-2

This rule will create sev-2 alerts for the SRE team to quickly react and analyze the issue at the earliest. If the system does not self-heal and the issue resolution takes time, we would be using up the error budget.

Note:

  • The dependency failures of the application should have resulted in the failure of the corresponding HTTP requests and the same should have been captured in the “Request Metrics” as explained in the previous section. So, monitoring the dependency failures is a way to improve system reliability by addressing the issues as soon as they happen rather than using them as direct metrics to calculate availability.
  • As the known dependency failures can be introduced in the system, these can be used as a part of the chaos engineering practices.

Note: This is one of the many monitoring configurations that we will use in the real-time deployment.

References

  1. Overview of Azure Monitor Agents – https://docs.microsoft.com/en-us/azure/azure-monitor/agents/agents-overview
  2. Dependency Failures tracking in App Insights – https://docs.microsoft.com/en-us/azure/azure-monitor/app/asp-net-dependencies#failed-requests
  3. Exception Tracking in App Insights – https://docs.microsoft.com/en-us/azure/azure-monitor/app/asp-net-exceptions
  4. Observability as explained in Cloud Adoption Framework- https://docs.microsoft.com/en-us/azure/cloud-adoption-framework/manage/monitor/observability
  5. Rolling Window for SLO Calculation – https://sre.google/workbook/implementing-slos/#none