Update: The day after this post was released, VMware issued a clarification post on the Virtual Blocks website.  The post pretty much summarises what is written here, with a little more detail, although it doesn’t look like erasure coding in VSAN will be extended past the 3+1 and  4+2 schemes.  You can find the post here or listed in the Further Reading section at the bottom of this post.

Update 2: I’ve also added in a reference provided by Rob Peglar, which describes an earlier paper by Peter Anvin on RAID-6 calculations.

In case you didn’t notice, VMware has just released version 6.2 (the 4th version) of Virtual SAN, also known as VSAN.  One of the new features is data protection through erasure coding rather than mirroring of data.  Rather confusingly, the regurgitated press releases and many of the of the official blogs seem to be describing the new data protection regime as both RAID-5/6 and erasure coding.  Many of the product screen shots show this too.  So what is VMware offering here, is it RAID or erasure coding?


For the sake of completeness, let’s qualify what the two technologies are.  RAID-5 (and by extension RAID-6) are protection methods that distribute data and parity across multiple HDDs or SSDs.  Historically we’ve seen both hardware and software implementations in storage systems.

Both data and parity are stored in stripes, that could be anything from 64KB upwards (per disk) in size.  Data components are simply that; the actual data being stored.  Parity (for RAID-5) is a calculation made using the XOR logical function that allows any of the failed data component to be recreated from the available data components and the parity.  The parity value itself is calculated through XOR.  RAID-5 requires a minimum of 3 drives (two data and one parity), although typical configurations use 3+1 (3 data, one parity) or 7+1 (7 data, one parity).

RAID-6 calculations are slightly more complex than simple XOR instructions.  This is because with XOR, there’s no way to determine which disk has failed when multiple failures occur (the XOR function is commutative, in maths terms).  There are various solutions to this, including using Reed-Solomon encoding, or in the case of NetApp by implementing Diagonal Parity (hence RAID-DP).

With RAID, read performance is good; you simply read the data directly.  There is no I/O penalty in reading data from disk.  However when writing to disk, parity information has to be updated; RAID-5 requires reading the existing data (if the data to be updated is smaller than the stripe size), reading parity, re-writing the new data and re-writing parity.  So, 4 physical I/Os are required for each logical write I/O.  RAID-6 has an overhead of 6 physical I/Os for each logical write I/O, however the overhead will be implementation dependent; for example RAID-DP has a lower overhead because data is always written to a new location (using WAFL) rather than needing to be prefetched.

So in summary, RAID-5/6 results in a capacity overhead based on the amount of parity data (3+1 protection = 33% overhead, 7+1 protection = 14.2%).  Read I/O sees no I/O overhead and uses all available drives.  Write I/O sees significant I/O overhead, depending on the implementation.

Erasure Coding

Erasure Coding is also a process of creating redundant or parity data from the original source information, in order to facilitate the restore of any missing components.  However the process differs slightly as the original data is transformed using a mathematical algorithm which takes the original data and produces a set of new data that is greater than the original and so has redundant copies built in.  This process is typically expressed as dividing data to be encoded into k components, from which n pieces are generated (n>k), with the property that any k pieces can be used to reconstitute the original data.

Erasure coding is more computationally expensive than simple RAID and therefore has a potential impact on system performance.  Depending on the coding scheme and the specific pieces of data read, both read and write I/O incurs a compute penalty compared to traditional RAID-5/6.  However, there are some special cases.  When n-k=1 (i.e. having a single parity disk like RAID-5), the transformation process can be achieved using simple XOR instructions available in today’s Intel processors.  The same applies for n-k=2 or two parity disks (aka RAID-6 by VMware).  In this instance, additional calculations are required that can also be catered for with Intel instruction set extensions like SSE and AVX.  For more background information on this, check out James Plank’s paper listed at the end of this post in Further Reading.

VSAN Implementation

So where does that leave us?  Well, my assumption is that VMware are implementing the two special simplified cases of erasure coding as just discussed.  This allows them to overcome most of the performance penalties that would be associated with extended erasure coding.  However, calling them RAID-5 & RAID-6 may represent both confusion for traditional storage administrators or a clarification for those not 100% familiar with array-based data protection.  Remember that these implementations are also network RAID – the data is distributed across nodes for protection and that they are only available with all-flash configurations – presumably for performance.

The Architect’s View

Note also that erasure coding as a protection mechanism is only available in this release of VSAN for FTT=1 & FTT=2 (the n-k =1,2 cases) – above that you’re back to mirroring (see Chad’s blog for confirmation).  This begs the question about how RAID-5/6 will be implemented with uneven node counts (e.g. 5, 7 and upwards).  Will the data be cycled round or evenly distributed (partially) across those nodes? Will n-k>2 be supported in the future?  If so, what impacts on performance will there be?  The benefit of erasure coding is the ability to create protection against many failure scenarios and allow protection with many varied configurations (e.g. both disk and node).  However if there’s a performance issue going above two parity drives and this won’t ever be supported, is there any point calling it erasure coding?  Once the dust settles on the announcements, it will be interesting to get into the detail of exactly how RAID-5/6 (erasure coding) has been implemented by VMware and to see if the restrictions I’ve highlighted are, in fact, true.  Also of course to understand exactly how the protection scheme will be extended in the future.

Further Reading

Comments are always welcome; please read our Comments Policy first.  If you have any related links of interest, please feel free to add them as a comment for consideration.  

Copyright (c) 2009-2016 – Chris M Evans, first published on https://blog.architecting.it, do not reproduce without permission.

We share because we care!Share on Facebook0Share on Google+0Tweet about this on TwitterShare on LinkedIn0Buffer this pageEmail this to someoneShare on Reddit0Share on StumbleUpon0

Written by Chris Evans

  • John Martin

    I’m going to be pedantic and say that anything based on Reed-Solomon is unarguably erasure coding – to paraphrase http://web.eecs.utk.edu/~mbeck/classes/cs560/560/notes/Erasure/2004-ICL.pdf “Reed Solomon is the cannonical erasure code”. More arguable, is that any parity based RAID is also technically erasure coding. I think this is what you were getting at you said “the two special cases”. Technicalities aside, I don’t think it’s helpful for the industry to label traditional parity RAID as erasure coding . As Justin Warren (@jpwarren) tweeted recently “Erasure coding to me means I can dial in my % of data loss risk *precisely*. Not the gross risk of dual parity raid.”. Having said that Given the realities of IT marketing it seems inevitable that many things will get called erasure coding. I liked what Howard Marks and Justin came up with where they split erasure coding into two different categories – high level erasure coding for N+3 and simplified erasure coding for RAID-5/6. That way we can avoid abstruse theoretical arguments about whether something qualifies as erasure coding using math 99.9% don’t understand and still allow us to talk about the benefits of the more advanced forms of erasure coding without being too bothered by the unfortunate realities of IT marketing.

    • Nothing wrong with being pedantic, John. I think we’re all saying the same thing. Probably more interesting though is how VSAN implements the simplified cases. Can the implementation efficiency be extended to the “dial-in” scenario for erasure coding? If VSAN is expected to be scale out product then it needs to be, however it isn’t clear what impact that will have on performance.


      • John Martin

        I’ve also got a lot of questions around the performance impact of the new storage efficiency features in VSAN, but given that I work for a NetApp (should have disclosed that in my original comment, sorry), it would be hard to avoid a perception of vendor bias in my line of questioning.

        • John, thanks for providing that background on your employer! As for comments on storage efficiency/performance, I would go ahead and raise them; we could always turn them into a new post for further discussion.


          • John Martin

            The questions come in a few forms, but the main one comes around the issue of CPU and to a lesser extent memory consumption. In the past VMware claimed

            “Farronato says that vSAN will consume less than 10 percent of aggregate compute on the host nodes on which it runs, which works out to somewhere less than two virtual CPUs on a machine with two eight-core Xeon processors with threads turned off. The alternatives above consume something on the order of two to four times as much compute because they are running inside of virtual machine guests and not closer to the bare metal down in the hypervisor”

            Personally I was never really not convinced the 2x to 4x difference was simply caused by running storage functions through a hypervisor, but based on feedback I’ve had, that number was pretty about right for the steady state workloads which were well suited to vSAN. I suspect it would have been a lot higher during storage intensive tasks like VDI boot storms, or a heavy DSS workload, but without actual benchmark data that’s just based on theory crafting.

            For the new updates (which are currently reserved for all flash configs) we’ve got some new items which are going to be more CPU hungry, and it would be nice to see that quantified, possibly using some 4 corners testing. That might seem unfair because those tests aren’t representative of any customer workload, but many people invest in all flash because they’re finding use cases where pushing the limits of storage performance more than justifies the investment. I’ve seen it a few times, all flash storage significantly changes the way many people consume IT.

            I’ll grant that its entirely possible that with more CPU cores the new features still consume less than 10% of the aggregate CPU. There are some great implementations of distributed high speed erasure coding, in fact the more I look at VSAN, the more I see Isilon 2.0, likewise compression can be very efficient, but when you start throwing in other stuff like SHA1 based dedupe calculations, hot block identification for movement between your cache and persistent tiers (it might be an all flash config, but its still hybrid in its architecture), and QOS admission policies, the list of things that consume CPU and cause context switches all start to add up.

            Many of these techniques are also found in dedicated storage architectures which perform really well with relatively little CPU under normal workloads, but I’ve also seen those features contribute to driving large highly engineered multi-core systems to their knees, and that’s without having to drive the majority of the back end I/O over an network stack.

            There are good reasons why the ratio of compute and RAM to media in comparable scale-out shared nothing storage architectures like Isilon and Solidfire are relatively high with usually two or more dedicated multi-core CPUs per node. There’s also a good reason why they both recommend mirroring rather than erasure coding for random workloads.

            Maybe it’s just the storage engineer / vendor bias talking but I know how CPU intensive advanced storage and data management features can be under heavy load, and it seems like VMware (and the entire HCI industry) has been busily sweeping that under the carpet with suggestions that its a non-issue because commodity CPU cores are cheap.

            On the other hand, maybe I’m just like those old cable driven steam excavator guys dissing hydraulics for being too lightweight to do any serious digging.

          • John

            John, to be clear you can pick and choose between mirroring and RAID-5/RAID-6 on a object basis (So as granular as a Virtual Machine or even individual VMDK). This is all managed through SPBM so you can automate away the decisions, as well as change it on the fly if needed.

          • John Martin

            I know that, and it’s really cool to have that level of granularity and ability to switch on the fly, but at this point the recommendation appears to be “suck it and see”. That approach is fine if you have an “agile” infrastructure methodology, or a small enough shop where the infrastructure dude and the application lady has lunch together every day and change control is something nobody has ever heard of, or if you really are running at web scale and have some really sophisticated closed loop automation systems that automagically adjusts the protection level based on available capacity and desired performance SLO’s. But the old school dinosaur part of me feels a lot more comfortable if I’ve got some reasonably solid engineering data to work with, and that doesn’t seem to be readily available.

          • John

            @John we’ve got a Whitepaper out on general guidance (which is based on work that the PE and engineering teams did).


            I understand an engineering led paper is on the way at some point here.

            Christo’s links to some of Planks papers on CPU offload that is used, and they are a good read on why R5/R6 just doesn’t use that much compute these days.

            Going back to your earlier post VDI Boot storms have not be a “thing” since VMware released CBRC that takes most of the sting out of those using a host local memory deduped read cache (I think we did that back in 2012?).

            As far as overhead with the new features the general guidance I’m seeing is ~5% more than before. Keep in mind that while we said “up to 10%” in the field most people saw less (I was a customer and my clusters rarely hit 3% overhead).

            I had an interesting discussion with a very large service provider using your AFF product. They made the comment that they turned all the data reduction on, as it was cheaper to just buy an all flash product (and take the marginal hit for data reduction overheads) than try to manage tiring, and post process data reduction by policy to not impact the performance of 10K drives (Which ended up limiting scaling on the controllers).

            The economics of buying 10K’s just don’t make sense anymore, and a lot of customers buying all flash are not people who need 100K IOPS per host, but rather people who would be happy with 50K (or hell 25K) but get better pricing and performance consistency than they got with their old 10K drive’s or tiered storage systems. Still, for the people who want to RUN TPC benchmarks and HANA, nothing stops you from turning the features off, and having granular control on the EC is a nice touch (vs the old days of LUN migrations, and rebuilding the RAID groups).

      • John

        Disclosure (VMware SABU employee)

        This post might clear some things up in regard to use of XOR and Reed-Solomon (and gets into performance offload for them with instructions etc. He also discuses the possibility of stripe size changes. Also read the comments.


  • Terry Tortoise

    There is a good article in http://www.availabilitydigest.com on erasure coding by Bill Highleyman and high availability. James (Jim) Plank maintains that erasure coding is often used as a generic term, covering FEC and other methods of recovery. References: Availability Digest plus extract below:

    ‘The next link is a presentation
    by James Plank entitled All About Erasure
    Codes. The link following it is a detailed paper on the same topic and
    represents a useful adjunct to the presentation.