News has emerged about a potential upgrade issue with the latest version of O/S code for the EMC XtremIO all-flash platform.  As discussed in this blog post, the upgrade to XIOS 3.0 is not simply disruptive but is destructive.  According to Chris Mellor’s post, the reason for the destructive nature of the upgrade is to change the on-disk de-duplication block size from 4K to 8K in order to ensure metadata can be managed efficiently.

This news, which appears to be confirmed through a blog post by Chad Sakac, is of course a PR disaster for EMC.  Chris’ post points to a video that highlights the XtremIO platform supporting non-disruptive upgrades.  Clearly that is now no longer the case.  Chad makes a valiant attempt to divert attention away from the core issue and points out that other platforms in their time have required disruptive and potentially destructive upgrades too.

Chad is of course correct to highlight NetApp’s Data ONTAP platform “upgrade” from 7-mode to c-mode which would be destructive, however as I’ve argued many times, this is a move from one product to a completely different one (albeit supported on the same hardware) and so it’s no surprise there’s disruption.  NetApp chose to market two separate products as the same platform, which for good or bad, isn’t the same argument as the issue with XtremIO.

Neither is the upgrade from VNX to VNX2 a fair comparison either.  Chad himself calls this out as a hardware & software upgrade, which is a totally different scenario.  VNX2 was, by EMC’s definition, “rewritten” to cater for multi-core processor hardware and as a result never designed to be backward compatible.  Customers were never intended to do upgrades in place, but to look at transition to VNX2 during hardware refresh.  In the case of XIOS 3.0, we are talking about an in-place upgrade on the same hardware.

Think of The Customer

Before discussing the technicalities, we should perhaps look at this issue from a  customer perspective.  Why have customers invested in all-flash arrays?  Clearly they want to benefit from the high performance and low latency, which means they will have placed mission critical applications on these platforms.  The $/GB metric is less valid in these scenarios as a comparison of cost benefit, but rather all-flash will have been justified through the increased business benefits of faster application/transaction processing.  As a result, these platforms will be expected to run 24/7.

Now the IT department is faced with having to explain that an upgrade to the latest O/S version will be disruptive to the application, potentially incurring downtime.  Whether tools like Storage vMotion can be utilised or not, either way, the storage team have work on their hands to move the data elsewhere and back again, all of which incurs time, risk and cost.  EMC is apparently offering assistance with “loaner” kit to help in the process – let’s face it, most customers don’t have a spare high-performance all-flash array sitting around specifically for these kinds of scenarios and this data won’t fly on anything but an all-flash array.  Despite this, ultimately the customer is still paying for this botched XIOS upgrade process because the loaner kit and any free consultancy given will be loaded somewhere into some customer’s purchase, making XtremIO more expensive than it need be.

Architecture Matters

Getting storage array architecture right from day 1 is not just important, it’s essential.  This means having a vision of what (future) features a platform will provide and designing with that in mind, even if the future implementations are two, three or five years ahead.  So far, there’s little to suggest that the XtremIO team have either the completeness of vision or the ability to execute that the recently announced Gartner MQ claims they have.  XtremIO’s position on the MQ is even more untenable when we consider the debacle over snapshot management discussed at Storage Field Day 5 earlier this year (watch the videos) and the fact that even now at XIOS 3.0, it appears there is no possibility of in-place hardware upgrades without disruption.

Going back to Chad Sakac’s comments over architectural changes that create disruption, then we have to acknowledge that in non-scale-out architectures, at some stage a redesign is required and a product upgrade in place isn’t possible.  However these are generational changes that occur once every 5-10 years, not something that should occur when new features are implemented.  Chad advises that customers are welcome to stay on the 2.4 GA code, but of course that’s not a real option.  EMC won’t fork the 2.4 code into two versions with a 4K and 8K block size; customers will eventually have to upgrade to receive new features and more critically, bug fixes.  So like it or not, XtremIO customers will have to take the pain at some stage.

Speaking of upgrades, I looked back at the version releases for XIOS.  Version 1.0 was GA on 14 November 2013.  Version 2.4 was GA in May 2014, with 3.0 GA in Q3 2014 (any time now).  In less than 12 months there have been three major releases of the O/S, and less than a year after launch (when you would expect some future vision was still possible), customers are being told they can now not upgrade in place, essentially because a major O/S rewrite has occurred.  This is hardly the release cycle cadence that came with flagship products such as the original Symmetrix arrays.

Competition

What about the competition, what are they doing?  Pure Storage have been quick to respond; Vaughn Stewart has a good post with reference points towards the end of his article on why architecture matters, specifically with reference to Pure’s design.  Dave Wright, CEO of SolidFire provided this quote and explained how 24/7 availability is a key feature of the SolidFire platform:

SolidFire has proven the advantages of a shared nothing architecture for enabling non-disruptive upgrades by offering 100% non-disruptive software AND hardware upgrades from Day 1 of GA over two years ago. No downtime, data migration, or planned outages have been required to upgrade software, firmware, or hardware across hundreds of upgrades in the field.

Read Dave’s full blog post here.

Vish Mulchand, Senior Director, Product Management at HP said of the 3PAR platform:

Storage customers have always demanded non-disruptive online upgrades. No one has time for downtime which is why modern Tier-1 resiliency requires that data access and service levels be maintained not only during software upgrades but also during failure recovery and maintenance. HP 3PAR StoreServ Storage designed is to be non-disruptively upgradeable and scalable. The system is architected for non-disruptive operations and has what we refer to as Persistent Technologies to prevent unnecessary downtime and to maintain service levels during planned as well as unplanned outage events.   We have never required a data destructive upgrade in the last 12 years including when we introduced an All Flash Array.

So it seems architecturally, non-disruptive processes CAN be designed into modern storage array architectures.  A vision not shared by XtremIO.

The Architect’s View

The ultimate losers in this scenario are EMC’s customers, who will have to endure the pain and risk of a disruptive upgrade.  No customers, least of all enterprise customers, who are a conservative lot, want to put themselves into a situation that risks data loss, never mind the cost or time wasted.  I’ve been negative on the XtremIO architecture from the beginning, partly because of EMC’s wildly over exaggerated claims for the platform, but mostly because there are other companies out there simply doing a better job.  As always, choosing the right storage platform come back to customer requirements.  Sadly it’s only now that many current customers will be wishing EMC had been a bit more visionary and had ability to execute on the promises that were offered when XtremIO was first sold to them.

Related Links

Comments are always welcome; please indicate if you work for a vendor as it’s only fair.  If you have any related links of interest, please feel free to add them as a comment for consideration.  

Copyright (c) 2009-2017 – Post #A290 – Chris M Evans, first published on https://blog.architecting.it, do not reproduce without permission.

Written by Chris Evans

With 30+ years in IT, Chris has worked on everything from mainframe to open platforms, Windows and more. During that time, he has focused on storage, developed software and even co-founded a music company in the late 1990s. These days it's all about analysis, advice and consultancy.