Following hot on the heels of NetApp and their SPC-1 benchmark figures for the EF560 all-flash array, HDS have announced that the VSP G1000 has delivered the fastest benchmark figures to date, as highlighted in the graph shown here (borrowed from Hu Yoshida’s blog post on the news).  Links to the various SPC-1 results are shown at the end of this article.

At a shade over $2m and almost twice the $1/SPC-1 IOPS of the NetApp solution, this test (and others) may seem like they are out of reach for many customers and not reflective of real world situations.  In fact I believe the benefits of these kinds of tests are much more subtle than that and in most cases misunderstood by many people, especially those new to the industry.  Here’s why.

Operational Efficiency

Pretty much all technology is designed to be operated at lower than its maximum rated tolerances.  That extra “gas in the tank” ensures performance or throughput is there when we need it, but the extra stress of running at 100% all the time would quickly break many mechanical devices.  There are plenty of cases in point to look at; power supplies and fans in PCs are one; the daily commute in your car is another; how many of us (me included) have a car capable of driving much faster than we ever do.  Imagine an aeroplane that couldn’t be powered by a single engine… So why do we buy and pay for this over-capacity?

Similar logic also applies to our storage devices.  If we run an array at 100% of the rated IOPS, what happens when a disk drive fails and data has to be rebuilt?  Well, quite simply, host-level I/O performance degrades and application performance suffers as a result.  Why are Storage Area Networks (SANs) designed to run at 50% bandwidth?  Because the failure of one path means the throughput isn’t affected (you could argue for 4 or 8 paths with less need to oversubscribe, but of course you’d deploy extra hardware anyway).

So that extra capacity, or to put it another way, running at less than 100% is a design feature to provide reliability into our technology.

Viewing VSP G1000 Results In Context

So where does that lead us in terms of how to interpret the HDS VSP G1000 test results?  From the graph we can see that the G1000 data scales pretty much linearly as IOPS increase (at least up to 1,000,000).  From there the graph slopes upwards slightly, with only a slight up-tick at 2,000,000.  Compare this to the “hockey stick” results of the other tests shown.  In addition, the HDS figures are showing consistently low latency, a key feature I raised in the previous post already discussed.

What the graph indicates is that HDS’s solution is more likely to deliver consistent latency, even if 20% of the IOPS capacity had to be dedicated to recovering from a disk (or FMD) failure.  This is an unlikely scenario, but has to be built into an architecture design, especially if you’re a bank, credit card provider, online retailer, online casino, real-time trading application or any other application where a reduction in latency translates directly into lost business.

Two Million VSAN IOPS

Here’s where we get to the point on understanding in the industry.  This month VMware launched vSphere 6.0 which included a demonstration of VMware Virtual SAN achieving a 2 million IOPS benchmark.  You can find details of the configuration and results here.  Reaching the 2m IOPS mark took 32 server nodes in a cluster with a total capacity of 512 Xeon cores, 4TB of DRAM, 246TB of disk capacity and 12.8TB of flash.  The test ran only 32 VMs and achieved the 2m IOPS with 100% read I/O.  No latency figures were quoted for this test.

When the test was repeated with a 70%/30% read/write mixed workload, the results only scaled to 640,000 IOPS.  Latency figures were quoted and showed an average of only 3ms (far higher than most of the SPC-1 benchmarks).  What neither of these tests show are real workload on the VSAN cluster (32 VMs is hardly representative), no performance figures in failure mode (i.e. disk rebuilds, SSD failures) or whether the latency scaled in a similar way to IOPS.  There’s also no mention of the cost of this configuration.

I consistently see comments on how external storage is too latency restrictive for today’s applications and low latency can only be achieved with converged solutions like VSAN.  The SPC-1 and VMware tests show this is simply not true.  That’s not to take anything away from the VSAN testing; achieving 2m IOPS is commendable, given the caveats in how it was achieved (the same applies for the SPC-1 tests, by the way).

The Architect’s View

The idea of SPC-1 is to provide some consistency on storage performance testing.  Although the total IOPS is the headline grabber, the more significant detail is how the hardware scales up to the maximum, especially in failure scenarios.  Storage solutions based on commodity servers can easily reach the 1m, 2m magic IOPS mark, but the question remains as to whether they can achieve that level of performance with consistent low latency – if that is a requirement of your application – including when rebuilding lost data or rebalancing a cluster.

How systems work in failure mode is vitally important, because at this point your business is directly affected.  Designing for failure isn’t just about adding in resilient nodes, it’s about designing in for failed capacity too.  If you are latency/performance rather than cost sensitive, then solutions like VSP will be a better choice for your business.

Related Links

Comments are always welcome; please read our Comments Policy.  If you have any related links of interest, please feel free to add them as a comment for consideration.  

Copyright (c) 2009-2018 – Post #9117 – Chris M Evans, first published on https://blog.architecting.it, do not reproduce without permission.

Written by Chris Evans

With 30+ years in IT, Chris has worked on everything from mainframe to open platforms, Windows and more. During that time, he has focused on storage, developed software and even co-founded a music company in the late 1990s. These days it's all about analysis, advice and consultancy.