The post mortem summary on Microsoft’s most recent Azure outage is available and online here. The content makes for interesting reading. To summarise, the outage occurred as a result of an unanticipated code error but was exacerbated by the accelerated rollout made by an employee who hadn’t adequately assessed the risks of pushing the change out so quickly. I imagine that poor chap probably no longer works for Microsoft.
So what conclusions can we draw from this disclosure?
- Human intervention (and error) will always be a factor. Just look at the airline industry for direction there; a large percentage of aircraft crashes are based on pilot error. However pilots are heavily trained and go through simulation practice to mitigate the risk they will make mistakes. I wonder if Microsoft and the other cloud providers can say the same?
- More control is being taken away from humans. The Azure environment is being changed to prevent this kind of rollout in the future and to enforce more conservative policies for deployments. On the one hand this seems like a positive, but what happens when humans need to override this level of control? If another scenario occurs that requires the accelerates deployment of a change, could that be done easily?
The Architect’s View
It’s good to see that Microsoft does appear to have sensible pre-testing of changes before they are deployed into the wider environment. However human error will always be a factor and even more so when the risk exposure is so large. This necessitates cloud providers employ the best people (not the cheapest), who understand environments in detail and get the big picture, not just their small subsection. As features are added to cloud services, finding the right people is going to get increasingly harder.
Related Links
- Final Root Cause and Analysis Areas: Nov 18 Azure Storage Service Interruption (Microsoft Azure Blog, 17 December 2014)
Comments are always welcome; please read our Comments Policy. If you have any related links of interest, please feel free to add them as a comment for consideration.
Copyright (c) 2009-2018 – Post #DC07 – Chris M Evans, first published on https://blog.architecting.it, do not reproduce without permission.