Home | Storage | Reinventing The Storage Array and Learning From Blackblaze
Reinventing The Storage Array and Learning From Blackblaze

Reinventing The Storage Array and Learning From Blackblaze

4 Flares Twitter 0 Facebook 1 Google+ 3 StumbleUpon 0 Buffer 0 LinkedIn 0 4 Flares ×

A recent post from the Backblaze blog has received wide coverage across the blogosphere and news outlets. Entitled “How long do disk drives last?“, it attempts to put some science behind the statistics of the life of a hard drive compared to the manufacturers quoted figures.  It’s not a surprise to find that the results show drives fail more often than the vendors would have us believe.  Of course the drives in question were never meant for a 24×7 duty cycle and that could be a contributing factor.  I’m sure Backblaze did a cost/benefit analysis when choosing the drive types they intended to use for bulk storage – although the drives may be in use 24×7, many may not be performing active I/O.

Reading through this and a number of related posts (listed below), what’s interesting is the work Backblaze are performing to test drives before they are put into production use, by using their SMART statistics before and after a general load test.  This is a good practice to perform and is something we expect our storage array vendors to do for us before shipping (and as a way to justify the 3x markup on commodity hard drives).  Of course storage arrays have many additional processes to ensure that drives are looked after nicely or have additional error recovery built in.  For example, NetApp’s Data ONTAP uses block checksums on SATA drives, which results in an additional 11% consumption but obviously provides better resiliency; Hitachi performs a read-after-write process  to protect data on SATA drives on their hardware platforms.  As far as failing disks are concerned, most enterprise vendors implement predictive failure and spare out a drive before it actually fails.  This means data can be copied off, rather than rebuilt from parity, which is much quicker and less impactful on performance.

So, the question is, what do Backblaze do?  Do they do additional disk scrubbing or error checking?  Their Storage Pods run Debian Linux with the ext filesystem (they used to use JFS) and I’m not aware of anything specific that that platform offers.  The pod access protocol is HTTPS; they don’t run iSCSI or anything else to store or retrieve data.  I could ask the converse question, does it matter whether Backblaze do additional error checking and correction, if they have enough redundancy in the rest of their system?  For Backblaze’s usage, which is as a large backup archive, I suspect the answer is that it doesn’t matter and their mode of operation runs perfectly well for them, however there is a reason for following this line of discussion.

I recently discussed the idea of hyper-converged solutions, merging storage and compute on the same platform and last week at the UK VMUG I sat through the presentation on VMware’s VSAN feature in vSphere, which delivers distributed storage across vSphere clusters.  The session presenter indicated that in the GA release of VSAN, there will be no additional error checking built into the product; no predictive sparing of drives, no disk scrubbing and so on.  I don’t know whether any integrity checking like read-after-write will be used to validate the mirrored copies of data as they are distributed across the clusters, however before I pass judgement, I will review all of the material available, just to be sure.  On a side note, although I haven’t written about it in detail, I do know that Maxta does perform additional consistency checking to mitigate against logical corruption.

The Architect’s View

We’ve had 20+ years of building up intellectual property around managing hard drives.  Let’s hope as we see solutions like VSAN, ScaleIO, Maxta and any others that come along, we don’t throw away that knowledge and leave ourselves in a worse position than we were before.  It would be useful, as we see these solutions develop, to have a good view of the health of internal hard drives and to be able to use this knowledge to pro-actively mitigate against hard failures, perform volume evacuations transparently and above all, keep administrator intervention to a minimum.  After all these solutions are not purely storage arrays and any intervention or issue affects both data and applications.

What processes will be available for VSAN users to stress test drives as they go into production, in a similar way to Backblaze?  One final thought; there are startups selling hardware and software solutions, such as Nutanix and SimpliVity.  This same question needs to be posed to them to see how they are dealing with this issue.

 

Related Links

 

Comments are always welcome; please indicate if you work for a vendor as it’s only fair.  If you have any related links of interest, please feel free to add them as a comment for consideration.

Subscribe to the newsletter! – simply follow this link and enter your basic details (email addresses not shared with any other site).

Copyright (c) 2013 – Brookend Ltd, first published on http://architecting.it, do not reproduce without permission.

About Chris M Evans

Chris M Evans has worked in the technology industry since 1987, starting as a systems programmer on the IBM mainframe platform, while retaining an interest in storage. After working abroad, he co-founded an Internet-based music distribution company during the .com era, returning to consultancy in the new millennium. In 2009 Chris co-founded Langton Blue Ltd (www.langtonblue.com), a boutique consultancy firm focused on delivering business benefit through efficient technology deployments. Chris writes a popular blog at http://blog.architecting.it, attends many conferences and invitation-only events and can be found providing regular industry contributions through Twitter (@chrismevans) and other social media outlets.
4 Flares Twitter 0 Facebook 1 Google+ 3 StumbleUpon 0 Buffer 0 LinkedIn 0 4 Flares ×