Intel Ruler SSD and Chassis

The Intel SSD DC P4500 Series in the “ruler” form factor was designed to optimize rack efficiency and will be available by end of 2017. (Credit: Intel Corporation)

Chatting with good friend Enrico Signoretti earlier today on the subject of 1PB flash in 1U, I was reminded of the new Intel Ruler form factor.  In case you missed the news, in August Intel debuted a long, thin form-factor SSD dubbed the ruler that stacks back to front and vertically in a server, potentially allowing up 1PB per 1U of rack space.  Product details are scarce however, from the images shown, the ruler SSD will be easier to hot swap in a server and have better heat dissipation, as I imagine the whole length of the body will be a passive heatsink.  From the images shown, a single server could hold 32 ruler SSDs, each of 32TB, based on a chip size of 1TB.  I’m guessing 32 active media slots and 4 for over-subscription.

Getting back to the discussion with Enrico, we we talking about failure domains.  In this server form factor, the failure domain is either the Ruler blade or server, as the chassis design shown by Intel implies dual controller.  Ruler is hot-swappable, which reduces the risk somewhat.  I would imagine also that the blades themselves are in a redundant configuration to add an extra level of resiliency.

What happens if we get 32TB and larger?  With QLC and a 1.5TB chip size, we could easily see 1.5PB in a chassis.  How much failure can a single ruler tolerate before the whole device fails?  Ultimately this is Enrico’s issue (I think), that with larger and larger devices, we are at risk of huge rebuilds, but more important, to make device pricing viable, Intel needs to be able to easily repair failed devices.

The Intel SSD DC P4500 Series in the “ruler” form factor was designed to optimize rack efficiency and will be available by end of 2017. (Credit: Intel Corporation)

SSD reliability is currently as good as hard drives, when looking at MTBF or AFR (Annual Failure Rate).  As we scale up, it will be interesting to see if this level of reliability can be maintained.  This begs the question as to whether the NAND or controller is likely to be the failure point and how many chips each controller channel will drive.

When Hitachi introduced their FMD, lots of additional intelligence went into the controller but RAID wasn’t included.  Device failure was still managed across multiple devices.  With 32 Rulers in a server and potentially data across multiple Rulers, it would probably make sense to use erasure coding, however that’s not efficient with small block writes.  Implementing the right level of data protection could be an issue for these devices.

The Architect’s View

How comfortable would you feel about storing 1PB of active data on a single chassis?  In the quest for ever higher densities, this level of density could be an issue.  Object and file-based storage will probably be fine with erasure coding, but not so block-based storage.  This exposure is one of the reasons, I think, that Pure Storage went for a more active blade and passive backplane in FlashBlade – with less to go wrong in the chassis.  Funny that even as we move forward, the old issues of storage still exist.

Comments are always welcome; please read our Comments Policy.  If you have any related links of interest, please feel free to add them as a comment for consideration.  

Copyright (c) 2009-2017 – Post #437F – Chris M Evans, first published on https://blog.architecting.it, do not reproduce without permission.

We share because we care!Share on Facebook0Share on Google+0Tweet about this on TwitterShare on LinkedIn44Buffer this pageEmail this to someoneShare on Reddit11Share on StumbleUpon0

Written by Chris Evans

  • Enrico Signoretti

    Thank you for expanding this discussion into a blog post!

    just to recap my thoughts/points about this 1PB/1U thing (Intel is an example, but samsung with its 128TB drive could do the same as well as WD):

    I’m sure this is a design that is focused on capacity and to narrow the gap with the HDD in terms of $/GB. In fact, if I look at the performance there are several issues that come to my mind (i.e. performance consistency for garbage collection for such a large device, total bandwidth needed and so on). Secondly the two controller design would mean that this is not a standard two-socket server. leading to scalability probem and so on (this is why I thought about standard 1 RU x86 server… and scale-out approach).

    I could be wrong butI think all these vendors want to find a solution for very High capacity, which makes sense but we are talking again of the same issues we had in the past with large hard drives… just at a different order of magnitude.

    The question is, are we ready for that? do we have the right data protection mechanisms for such devices and capacities? Will be the system balanced? (even 100Gb/s Ethernet doesn’t sound a lot to manage a 1PB devices). I have a lot of questions… unfortunately not answers yet. 😉

    ciao,
    E

    • Anthony Preston

      Enrico,

      I think when you look at these large all flash solutions, Violin seem to be thinking about all these things. It is a shame no one wants to pick them up or is it a matter of the sharks waiting for the company to collapse and then picking up their designs/patents???

      • Violin Memory is now Violin Systems and back in business. Look out for a post after this Friday when I meet/interview the new CEO.

  • There is a different angle to consider. Today, most companies regard storage as a scarce/limited resource and consider 500TB-1PB to be a large amount of data. Therefore the the failure domain is quite small when a single unit has 1PB of storage.

    Consider an organisation that has 10, 20 or 50PB of data to manage. The failure domain becomes ‘normal’ and multiple 1RU chassis will handle the use case nicely, less space, less power.

    In my view, we have to check & re-check our thinking that IT infrastructure is scarce & precious resource in 2017. Its cheap, small, low power and we need to start ‘wasting’ resources to get bigger.

  • The density of data consideration has applied to tapes for many years. As LTO capacities increase the consequences of a physical media problem increases. There’s nearly a Km of tape in an LTO7 cassette. Tape rotation and read/write refreshes become essential.
    As for 1PB/1U of NAND, somebody has to do the testing to find the weak spots and extrapolate the MTBF. When something breaks, the ability to do a partial NAND recovery (tape splicing anyone?) may become an in demand skillset.

  • John Martin

    Looking at the configuration above, I think that if you wanted to do HA and some form of RAID to address the availability / failure domain issue, you’d be hard pressed to put enough CPU into the 1U enclosure to exploit more than about 1Million sub millisecond IOPS (I’m basing this on the EF570 which seems to use similar amounts of realestate to house the CPU+Memory complex, and which I consider to be state of the art for small form factor flash appliances). Even then that doesn’t allow much in the way of additional power/thermals for multiple 100Gbit ethernet or the additional CPU required to do things like compression, encryption, or dedupe. If that figure holds true then for a petabyte that works out to about 1 IO per GB .. which is about as fast as a 400GB 10K magnetic SAS drive ..if you then push that up to 15TB behind those same controllers you’re down to current SATA IO densities, although most of the IO’s will be at less than 1ms, it still looks more like cool/cold storage option.

    The other possibility is that this is designed to presented as a JBOF without RAID etc that can be accessed directly by other servers via NVMeoF which would improve the IO density somewhat but would it be enough to be used for what most people I’ve talked to think of as a primary storage I/O density (around 10+ IOPS/GB) ? Part of me would love to find out, but personally I don’t have those kinds of R&D budgets 🙂

    Maybe this form factor / packaging is the precursor to something way more interesting using some next gen technologies like OmniPath and 3DXpoint .. until then though, this looks like its a pointer to things to come rather than something you’ll see deployed in a Entrprise datacenters, though I could imagine hyper scale datacenter desginers like the folks at google and facebook etc might get a woody for that kind of density with flash power consumption figures .

    Having said that I’m glad I’m not working for a vendor thats trying to use hardware as a primary differentiator when Intel and Samsung have lots of goodies like this up their sleeve, because all that will find its way into the mainstream eventually.

    • John, I agree, the actual logistics of deployment don’t entirely add up. When Intel release a concrete product, either themselves or with a partner, we should get an idea. It’s fascinating to see the I/O boundary being challenged, though.