The post mortem summary on Microsoft’s most recent Azure outage is available and online here.  The content makes for interesting reading.  To summarise, the outage occurred as a result of an unanticipated code error but was exacerbated by the accelerated rollout made by an employee who hadn’t adequately assessed the risks of pushing the change out so quickly.  I imagine that poor chap probably no longer works for Microsoft.

So what conclusions can we draw from this disclosure?

  1. Human intervention (and error) will always be a factor. Just look at the airline industry for direction there; a large percentage of aircraft crashes are based on pilot error.  However pilots are heavily trained and go through simulation practice to mitigate the risk they will make mistakes.  I wonder if Microsoft and the other cloud providers can say the same?
  2. More control is being taken away from humans.  The Azure environment is being changed to prevent this kind of rollout in the future and to enforce more conservative policies for deployments.  On the one hand this seems like a positive, but what happens when humans need to override this level of control?  If another scenario occurs that requires the accelerates deployment of a change, could that be done easily?

The Architect’s View

It’s good to see that Microsoft does appear to have sensible pre-testing of changes before they are deployed into the wider environment.  However human error will always be a factor and even more so when the risk exposure is so large.  This necessitates cloud providers employ the best people (not the cheapest), who understand environments in detail and get the big picture, not just their small subsection.  As features are added to cloud services, finding the right people is going to get increasingly harder.

Related Links

Comments are always welcome; please read our Comments Policy.  If you have any related links of interest, please feel free to add them as a comment for consideration.  

Copyright (c) 2009-2018 – Post #DC07 – Chris M Evans, first published on https://blog.architecting.it, do not reproduce without permission.

Written by Chris Evans

With 30+ years in IT, Chris has worked on everything from mainframe to open platforms, Windows and more. During that time, he has focused on storage, developed software and even co-founded a music company in the late 1990s. These days it's all about analysis, advice and consultancy.