Home | Featured | Choosing Between Monolithic and Modular Architectures – Part II

Choosing Between Monolithic and Modular Architectures – Part II

1 Flares Twitter 0 Facebook 1 Google+ 0 StumbleUpon 0 Buffer 0 LinkedIn 0 1 Flares ×

This is a series of post discussing storage array architectures.  Previous posts:

In the first post, I discussed the shared storage model architectures typified by what we sometimes think of as Enterprise arrays, but I’ve called monolithic.  This term harks back to the mainframe days of large single computers (see Wikipedia definition), hence it’s use to describe storage arrays with a large single cache.  In the last 10 years we have seen a move away from the single shared cache to a distributed cache architecture built from multiple storage engines or nodes, each with independent processing capability but sharing a fast network interconnect.  Probably the most well known implementations of this technology have come from 3Par (InServ), IBM (XIV) and EMC (VMAX).  Let’s have a look at these architectures in more detail.

EMC VMAX

The VMAX architecture consists of one to eight VMAX engines (storage nodes) connected together by what is described as the Virtual Matrix Architecture.  Each engine acts as a storage array in its own right, with front-end host port connectivity, back-end disk directors, cache (which presumably is mirrored internally) and processors.  The VMAX engines connect together using the Matrix Interface Board Enclosure (MIBE), which are duplicated for redundancy.  The virtual matrix enables inter-engine memory access, which is required to provide connectivity when the host access port isn’t on the same engine as the data.  There are two diagrams in the gallery at the end of this post, one showing the logical view of the interconnected engines and the second showing how back-end disk enclosures are dedicated to each engine.

What’s not clear from the documentation is how the virtual matrix architecture operates, other than being based on the RapidIO.  I’m not sure if VMAX engines have direct access to the cache in other engines or whether the processor of connected engines is required.  In addition, can an engine access cache in another engine purely to manage throughput of the local host and disk connections? I’m not entirely sure.

3Par InServ

3Par storage arrays consist of multiple storage nodes joined through a high-speed interconnect.  They describe this as their InSpire architecture.  From 2 to 8 nodes are connected (in pairs) to a passive backplane with up to 1.6Gb/s of bandwidth between each node.  3Par use the diagram shown here to demonstrate their architecture and with 8 nodes, the numbers of connections can easily be seen.  I’ve also shown how connectivity increases in 2, 4, 6 and 8 node implementations.  InServ arrays write cache data in pairs, so each node has a partner.  Should one of the node pairs fail, the cache of the surviving partner is immediately written to another node (if one is present), so protecting the cache data.

The InServ and VMAX architectures are very similar but differ from each other in one subtle but important way.  3Par InServ LUNs are divided into chunklets (256KB slices of disk) that are spread across all disks within the complex.  So as an array is deployed and created, all of the nodes in the array are involved in serving data.  VMAX uses the Symmetrix architecture of hypers – large slices of disk – to create LUNs, with four hypers used to create a 3+1 RAID-5 LUN, for example.  As new engines are added to a VMAX array, the data is not redistributed across the new physical spindles, so data access is unbalanced across the VMAX engines and physical disks.  In this way, InServ has better opportunities to optimise the use of nodes, although within VMAX the use of Virtual Provisioning can help to spread load across disks in a more even fashion.  In addition, a fully configured VMAX array has up to 128Gb/s of bandwidth across the VMA, exceeding InServ’s capacity.

In my opinion the tradeoff here comes down to increased scalability with dedicated nodes versus the latency introduced when data isn’t located on the local node.  In the 3Par model, data is always being accessed across nodes.  In the EMC model, nodes only exchange data when the LUN’s physical disks aren’t located on the local node.  This leads to two problems.  Firstly, as more nodes are added, the number of node<->node connections increases exponentially.  For an 8-node array, there are at least 28 node to node connections (not including additional connections for redundancy).  This increases to 120 for 16 nodes (nearly 6-fold increase in connectivity for double the nodes) and nearly 500 connections for 32 nodes, to which VMAX can theoretically scale.  The second issue is that of diminishing returns.  As more nodes are added, more overhead is required to service data not found on the local node.  This leads to a situation where the benefits of adding additional nodes are so small to make it not worth doing.

IBM XIV

The IBM XIV array takes a different approach to node configurations that are directly connected to the underlying data protection mechanism of the hardware.  XIV uses only RAID-1 style protection, based on 1MB chunks of data known as partitions.  Data is dispersed across nodes in an even and pseudo-random fashion, ensuring that for any LUN, data is written across all nodes.  The architecture is shown in the XIV picture in the gallery at the end of this post.  Nodes (known in XIV as modules) are divided into interface and data types.  Interface modules have cache, processors, data disks and host interfaces.  Data modules have no host interfaces but still have cache, processors and disk.  Each module has twelve (12) 1TB SATA drives.  As data is written to the array, the 1MB partitions are written across all drives and modules ensuring that the two mirror pairs of any single partition do not reside on the same module.  Sequential partitions for a LUN are also spread across modules.  The net effect is that all modules are involved in servicing all volumes and the loss of any single module does not cause data loss.

Whilst XIV might be tuned for performance, there is still the inherent risk (however small) that a double disk failure results in a significant data loss, as all LUNs are spread across all disks.  Additionally the XIV architecture requires that every write operation must go through the Ethernet switches as data is written to the cache on the primary and secondary modules before being confirmed to the host.  As a consequence, overall bandwidth of a single module will be limited to the available network capacity, which is 6Gb/s for interface nodes and 4Gb/s for data nodes.  This value halves if either of the Ethernet switches fails.

Summary

The multi-node storage arrays on the market today are all implemented in slightly different ways.  Each has positive and negative points that contribute to the overall decision on which platform to choose for your data.  Whether any of them are suitable for “Enterprise” class data is an open question that continues to be the subject of much debate.  From my perspective I would want a “tier 1″ storage array to provide high levels of availability and performance, something each of these devices are capable of achieving.

Next I’ll discuss modular arrays and the benefits of dual controller architecture.

About Chris M Evans

Chris M Evans has worked in the technology industry since 1987, starting as a systems programmer on the IBM mainframe platform, while retaining an interest in storage. After working abroad, he co-founded an Internet-based music distribution company during the .com era, returning to consultancy in the new millennium. In 2009 Chris co-founded Langton Blue Ltd (www.langtonblue.com), a boutique consultancy firm focused on delivering business benefit through efficient technology deployments. Chris writes a popular blog at http://blog.architecting.it, attends many conferences and invitation-only events and can be found providing regular industry contributions through Twitter (@chrismevans) and other social media outlets.
  • Dennis
  • Pingback: World Wide News Flash()

  • http://storagebuddhist.wordpress.com/ Jim Kelly

    Most XIV systems installed are 15 nodes. With VMAX and 3PAR would I be right in guessing that most systems installed are 2 nodes?

    • http://www.brookend.com Chris Evans

      Jim

      I would have to defer to someone from EMC and/or 3Par to answer that question. I guess there’s an underlying design difference here. Each VMAX array (for example) can scale to over 2000 drives, meaning each node has to support a large number of drives. Whereas XIV modules/nodes support a mere 12 drives, making the system no more than 180 drives maximum. This is much lower than the VMAX/InServ systems which can support between 1000 & 2000 drives. So with XIV there have to be many nodes to get the drive capacity up.

      As you work for IBM, could you explain why the drive density per module is so low? Would it not be possible to scale the drive count in order to create larger arrays?

      Chris

  • https://www.ibm.com/developerworks/mydeveloperworks/blogs/InsideSystemStorage/entry/ddf-debunked-xiv-two-years-later?lang=en Tony Pearson

    Chris,
    To correct one of your sentences, a double drive failure on an XIV, most of the time, results in no data loss, and in a few (rare) cases, will result in at most the loss of a few GB which are easily identified and recovered in less time than a traditional RAID-5 double drive failure would require. Here is my blog post that goes into full detail:

    https://www.ibm.com/developerworks/mydeveloperworks/blogs/InsideSystemStorage/entry/ddf-debunked-xiv-two-years-later?lang=en

    Tony Pearson (IBM)

  • http://nigelpoulton.com Nigel Poulton

    Chris,

    Good post.

    Just a couple of points that I think are needed for clarity.

    1. I think its fair to say that when wide backend striping (which all three of your featured arrays do) is employed, there is an inherant risk that a double disk failure will cause “significant data loss”. The only way to protect data on a VMAX or InServe from data loss during a DD failure would be to employ a double parity standard such as RAID6 which has its own trade-offs.

    2. VMAX also distributes “LUNs” over multiple backend disks based on 768K chunks (assuming you are using Virtual Provisioning (VP) and I think its fair to say most people are using VP today). LUNs are no longer tied to Hypers unless you are deploying legacy style and not VP. The RAID architecure that is still restricted to, for example, 4 disks in a RAID5 (3+1) is purely a unit of protection and no longer tied to the LUN. Making VMAX look a lot more like InServe and XIV than your readers may otherwise interpolate.

    Nigel

    • http://www.brookend.com Chris Evans

      Nigel

      Thanks for the comments. I agree with your points, however I’d add one statement to point 2; In InServ and XIV, wide striping is there by default; with VMAX you have to use wide striping in a “legacy” mode that still uses LUNs and traditional RAID groups. It has to be conscious decision to use VP rather than it being the default. I agree though that if you choose to deploy everything as VP across a VMAX array then it would look similar to the others. It’s just not an original native feature.

      Cheers
      Chris

  • http://www.adtran.com/web/page/portal/Adtran/group/98 Ethernet Switches

    Gigabit ethernet switches make things a lot easier.

  • Damian

    Hi,
    thanks for the article, it helped me to get a good overview.
    Currently, I try to understand the redundancy techniques of those systems. Now I wonder, whether the VMAX is able to create RAID1, RAID5 or RAID6 groups which span multiple storage bays? There is no explicit information in this regard. SRDF is an option to achieve geographical redundancy for physical storage, but needs separate Fibre Channel connections between distinct systems. The obvious solution would be using the VirtualMatrix instead, wouldn’t it?

    Regards
    Damian

    • http://www.brookend.com Chris Evans

      Damian

      If you mean can you spread a RAID group across multiple VMAX nodes, then I don’t think so. The VMAX storage bays, yes due to the way they are configured but that doesn’t buy you much, as the loss of a single bay will have a massive impact on your data availability. The recommended design I believe for VMAX is to place LUNs into Virtual Provisioning groups and use those to stripe I/O across many disks. Now, if you lose a node, then that’s potentially catastrophic as it would lose all of the LUNs that the node participates in. Although not all the data is lost, with 768KB blocks as your stripe size, you’d pretty much find it impossible to retrieve anything meaningful from the data. With XIV, the nodes act as either part of a mirror pair, so the loss of a node isn’t potentially catastrophic. However loss of two disks that are not in the same node would cause an issue if data rebuild is still occurring.

      Regards
      Chris

  • Damian

    Hi Chris,
    thanks for the answer.

    It’s a pity, that a VMAX cluster wouldn’t survive an engine failure. The XIV can survive a complete node failure and so does the HP/Lefthand P4000 Series. This architecture is predestined to support multi-site replication without additional software. However, it might be the case, that EMC didn’t implement it this way, because the RapidIO Infrastructure does not support large distances. Do you have any information concerning the extent of a VMAX cluster?

    Regards
    Damian

    • http://www.brookend.com Chris Evans

      Damian

      I think you answered your own question, however I would point out that the VMAX nodes are built redundantly so the chance of failure is much lower, although the impact is higher. XIV works the other way; the chance of complete node failure is higher but impact is much lower. This becomes an interesting discussion; is it better to be more highly redundant in a multiple-node array or to have less impact of failure. As for Rapid-IO, the distance does restrict both the size of cluster and the ability to cluster over distance. Perhaps that’s why EMC brought out VPLEX.

      Chrid

  • Damian

    Hi Chrus,
    it’s me again. I found a statement in another blog, that RapidIO can cope with up to 80cm distance. So it’s not appropriate for multi-site replication anyway.

    Best regards
    Damian

1 Flares Twitter 0 Facebook 1 Google+ 0 StumbleUpon 0 Buffer 0 LinkedIn 0 1 Flares ×