Part 1.3- Customer Questionnaire

Requirements gathering is a key step before any organization sets out to build a huge system in the Cloud. When enterprises work with Cloud Solution Providers (CSP), the Cloud Solutions architects and/or Consultants would spend a fair amount of time gathering the requirements end-to-end. The clearer the requirements, the easier the implementation would be. We would also be saved from any major rework due to gaps in understanding and rightly interpreting the requirements.

Microsoft in the Well-Architected Framework has provided some template questions that would help us assess the reliability requirements of the customer. We would discuss some of these questions in this post

  1. What is the acceptable downtime of the system (or) What are the exact availability requirements – Customers can begin with defining an acceptable downtime per month. Until the complexity of the system is known, it is unrealistic to determine the availability of the system. The answer to this question should be the specifics of Availability (SLI, SLO & Error Budget) & Recovery Targets (RPO & RTO). Some sample numbers could be as follows. Note: This table does not provide an exhaustive list of all the requirement metrics that need to be known upfront.
Reliability MetricRequirement
SLO- Availability of the system99.9%
SLO- User serving website response times<500ms in all the critical user flows
RPO (Recovery Point Objective)10 minutes of data from the primary DB
RTO (Recovery Time Objective)15 minutes between the system is up and ready to serve the customer requests

2. How much does a potential downtime cost the business – There is no “One Approach fits all” strategy in determining the loss that the business would incur if their system(s) go down. This is a value only the business can derive based on many factors

The most difficult part of accurately calculating your cost of IT downtime is deciding on the percentage impact and estimating what is referred to as intangible costs, such as the cost of a damaged reputation

src: https://www.datafoundry.com/blog/how-to-calculate-the-true-cost-of-downtime

Additional Costs for IT Downtime

These aren’t the only costs to consider when calculating the total cost of downtime. There may be recovery costs, such as the cost of employees working overtime, the cost of repairing devices or systems, and data recovery costs. For some businesses, downtime may negatively affect their supply chain, causing delays and fees

src: https://www.datafoundry.com/blog/how-to-calculate-the-true-cost-of-downtime

The following formulas can be used to obtain a ballpark estimate for labor costs and revenue loss per hour of downtime:

Productivity cost = E x % x C x H

  • E = number of employees affected
  • % = percentage they are affected
  • C = average cost of employees per hour
  • H = number of downtime hours

Revenue loss = (GR/TH) x I x H

  • GR = gross annual revenue
  • TH = total annual business hours
  • % = percentage impact
  • H = hours of downtime

3. How much do you invest in making your application highly available

This needs to be derived from the design choices if this is a Green-Field environment or the Customers need help in understanding the investment that would be needed to make their systems highly available. The approach that we can use is as follows,

Use the azure calculator and create 2 sets of estimates.
A) One with all the services in a highly available mode (for high availability) and with geo-redundant backups & geo-replicated data (for cross-region recovery)
B) Another one with cost-constrained decisions in places where there is some amount of tolerance.

Examples:

  1. If the RPO and RTO are fairly higher then an active-passive deployment model can be used to bring up the system after a disaster in the primary region happens and the databases can be brought up from backups rather than doing constant data replication
  2. If specific applications in a big system can tolerate a longer downtime, then these apps need not be made candidates for a multi-region deployment.

(A -B) should give us a rough estimate of the cost that the organization would need to invest to make their application(s) highly available.

4. What is the risk versus the cost

This is one of the most important questions that would help a business in making decisions. A simple and straightforward method to calculate would be

(Loss from a downtime - Cost to make the system highly available) i.e. (2)-(3)

References

  1. Azure Well-Architected Framework – Target and Non-functional Requirements – https://docs.microsoft.com/en-us/azure/architecture/framework/resiliency/design-requirements