Phy-gital Roundtable: Breakfast Roundup from Germany and Netherlands

02 May '15 | Debjyoti Paul

German Shoppers: Meet Them in the Fast Lane to Phy-gital

15 January '15 | Ralf Reich

Shoppers Will Share Personal Information (But They Don’t Want to be “Friends”)

15 January '15 | Anil Venkat

Modernize or Perish: Property and Casualty Insurers and IT Solutions

14 January '15 | Manesh Rajendran

Benelux Reaches the Phy-gital Tipping Point: Omnichannel Readiness is Crucial

13 January '15 | Anil Gandharve

The New Omnichannel Dynamic: Finding Core Principles Across Industries

13 January '15 | Debjyoti Paul

Technology does not disrupt business – CIO day 2014 Roundup

02 December '14 | Anshuman Singh

Apple Pay – The Best Is Yet To Come

02 December '14 | Indy Sawhney

Digital transformation is a business transformation enabled by technology

01 December '14 | Amit Varma

3 Stages of FATCA Testing and Quality Assurance

06 October '14 | Raman Suprajarama

3 Reasons why Apple Pay could dominate the payments space

18 September '14 | Gaurav Johri

Beacon of Hope: Serving Growth and Customer Satisfaction

05 August '14 | Debjyoti Paul

The Dos and Don’ts of Emerging Technologies Like iBeacon

30 July '14 | Debjyoti Paul

What You Sold Us On – eCommerce Award Finalist Selections

17 July '14 | Anshuman Singh

3 Steps to Getting Started with Microsoft Azure Cloud Services

04 June '14 | Koushik Ramani

8 Steps to Building a Successful Self Service Portal

03 June '14 | Giridhar LV

Innovation outsourced – a myth or a mirage or a truth staring at us?

13 January '14 | Ramesh Hosahalli

What does a mobile user want?

03 January '14 | Gopikrishna Aravindan

Responding to Operator Errors

Posted on: 15 May '17
Ananthanatarajan Muthusamy
Senior Technical Architect
Ananthanatarajan.Muthusamy@mindtree.com

Last March, Amazon Web Services in US East 1 was disrupted by an operator error[1]. Typo by an authorized operator, as an input in one of the tool was identified as the root cause of this outage. Outage lasting for 4 hrs has brought down no. of large services[2]. Interestingly their own status dashboard, depending on their Storage was also impacted (AWS Status Update).

We recognize Amazon as one of the top notch, highly efficient cloud engineering organization. They have published multiple case studies and best practices, perfected based on how they operate their AWS Cloud. For this specific outage, Amazon has published a detailed report and corrective actions.

We want to touch upon 2 aspects in this note,

  1. Operator caused outages can be reduced (Eliminating operator errors can be a lofty, north star goal)
  2. Cultural aspects of responding to operator caused outages

Reducing Operator caused Outages

We have been providing production services for large Cloud Services for our enterprise and service provider customers. We always have critical operator tasks in mitigating an outage, routine maintenance, deploying software in the production. We have had outages caused by these operator errors and using these learnings we have developed a high fidelity risk scoring method. With this score card, we identify hi-risk SOPs and eliminate or automate them. Remaining hi-risk SOP tasks that are to be executed manually are addressed by experts under peer review.

With this approach, we are able to reduce the production outages caused by operator errors.

Responding to an operator caused outage

Despite our best efforts, we still need to be prepared for responding to an operator caused outage. While the immediate focus is to mitigate the outage and restore the service. After restoring the service, the focus shifts to the post mortem exercise. We have seen cases of levying severe penalties on the individuals / teams for their outages. This punishing approach creates a sense of fear, resulting in failures being covered up.

We have adopted some of the industry best practices and implemented blameless post mortems. Instead of zooming on faulty individual, we shift the focus on the time stamp of actual events and impact, systemic vulnerabilities and process gaps. Developing solutions based on these learnings and repair items, helped us reduce the operator caused outages substantially .

We have presented below one of our success story. As part of our production support of a large cloud services, we have a SLA penalty for outages caused by operators.

We used to have an outage almost every alternate month. By adopting the best practices detailed in this note, we have eliminated operator caused outages for the last 2 years.

Exciting times ahead for operators as more and more enterprise and government workloads move to cloud. As Operators, we continue to be prepared for outages and work towards reducing the time to mitigate (TTM).

[1] https://aws.amazon.com/message/41926/
[2] http://blog.catchpoint.com/2017/03/01/aws-s3-outage-impact/