For the last four years, Backblaze has been producing statistics from their pool of hard drives.  The company provides cloud storage that started with backup and now covers S3-like capabilities.  The figures for 2017 were recently released and make interesting reading.

SMART

Backblaze now has a set of data spanning just over 93,000 live drives.  There is also data from decommissioned drives, with over 88 million data points in all.  Data is collected from the SMART (Self-monitoring, Analysis and Reporting Technology) attributes provided by the drive manufacturer.  The size of the data set is as big as the famous Google study from the mid-2000’s that initially showed a correlation between SMART information and drive failure rates.

SMART isn’t just for hard drives.  It also provides a range of statistics for flash drives too.  Exactly what is available depends on the manufacturer and the tools to extract the data.  For example, drive stats are available on ESXi, but the results aren’t consistent.  Backblaze are querying the drives directly and extracting all available data.  You can find more details here and download the data set yourself.

Analysis

What can we learn from the data published this time around?  The useful stats are those showing data for the entire collection period and for the failures per year.  Smaller samples tend to skew the figures, so the more useful values are for the bigger samples of drives.  Picking out an obvious outlier like the ST4000DM000, there’s a failure rate (AFR) of around 3%, which is higher than we would expect from a standard enterprise drive at around 0.5%.  However, this drive is a desktop model, so not rated for 24×7 operation.  By comparison, the ST12000NM0007 is enterprise-class and has a published AFR of 0.35%, but around 2% in the Backblaze study.  The same applies for the ST8000NM0055 – rated at 0.44%, achieving 1.23%.

Looking at the data per year, a slightly different picture emerges.  Some of the longer living drives show reduced failures over time, whereas some get worse.  Probably the most reliable drives are those from Hitachi/HGST that have AFRs below 1% and some below 0.5% over the period.

The Architect’s View

Could you use this data to decide the best product to buy?  Personally, I don’t think you can.  Some vendors are stronger in one drive capacity to another and in general, enterprise-class drives don’t have a better AFR than cheaper desktop ones.  Backblaze are typically writing sequential data once to the drive, with some subsequent read activity, but the drives aren’t really stressed.  If the workload profile was more aggressive, then I’m sure a different picture would emerge.  Adding activity stats to the mix here would be a good additional step (aggregate read/write)

We also don’t see any pricing associated with the data.  From a TCO perspective, there may be a significant advantage in buying cheap drives (which is typically what Backblaze is doing) rather than enterprise quality.  The cheaper drives in the study don’t appear to be adversely worse than their enterprise counterparts so data protection wouldn’t an issue.

Unless there is a manufacturing defect, the lesson here would appear to be that drives truly should be thought of as a commodity.  That means purchasing in bulk where possible, for the lowest price.  The savings from bulk purchasing probably outweigh the differences in failure rates, assuming you have enough buying power.

What about for the rest of us?  As we move to SDS solutions, I’ve been thinking about how much data we’ve lost compared to traditional shared arrays.  Storage vendors love to talk about their array analytics and it would be nice to see what data they have on drive failures.  But if we move to large-scale SDS deployments, who will collect and analyse the data from all of these drives?  Perhaps we need a central anonymised database and a set of standards that allow anyone to plug into the Backblaze model.  The benefit of this wouldn’t be in choosing one manufacturer over another, but in highlighting manufacturing defects, firmware bugs and other issues that affect the whole storage community.

Further Reading

Comments are always welcome; please read our Comments Policy.  If you have any related links of interest, please feel free to add them as a comment for consideration.  

Copyright (c) 2007-2018 – Post #5097 – Chris M Evans, first published on https://blog.architecting.it, do not reproduce without permission. Photo credit iStock.  Tables in this post, copyright (c) Backblaze.

Share me!Share on Facebook0Share on Google+0Tweet about this on TwitterShare on LinkedIn0Buffer this pageEmail this to someoneShare on Reddit0Share on StumbleUpon0

Written by Chris Evans