This is a series of posts covering critical features for object storage platforms and extends on this post from the beginning of 2017.

Previous Posts

Protecting data in an object store represents different challenges to those for traditional storage.  In block-based storage, for example, capacity reaches a level where backups of individual applications can be achieved relatively easily.  Systems are also capable of using tools like snapshots and replication to recover from data corruption and hardware failure.  Object storage platforms represent a different challenge.  When systems run into petabytes of capacity, it makes no financial sense to have a secondary system onto which the object data is copied in case of failure.  This means object stores have to deal with unique data consistency and protection issues, including:

  • System Failure – how do systems protect against disk, server or site failure?
  • Corruption – how do systems protect against silent data corruption, either within software or hardware?
  • User Error – how do systems protect against users inadvertently deleting content?

When only one logical copy of data exists, some of the above problems can be quite challenging.  We will dig into these in a moment.

A Couple of Definitions

When looking at object stores, the terminology is slightly different from that of traditional storage.  The term Durability is used to describe the risk of data loss over time.  This describes how likely any single read request is likely to fail or succeed.  AWS S3 storage is defined as having 99.999999999% (that’s eleven 9’s) durability or one request in one billion will fail.  The ability to provide such a high level of durability comes from all of the processes AWS use to protect data, including keeping multitple replicas, distributing data geographically and of course media management.

Availability refers to the uptime of the system, or what guarantees there are in accessing content.  Content might be unavailable to access, but that doesn’t mean it has been lost.  In AWS terms, SLAs covering S3 availability are 99.9% for Standard and 99% for Infrequent Access (although targets are higher), with no guarantees for Glacier.

Note that we’ve used AWS as a reference here because as a public cloud service, the company offers SLAs on their platform.

Data Protection at Scale

As storage systems scale, traditional protection methods no longer work effectively.  RAID was designed in an era when individual drives were barely reaching gigabyte capacities.  With today’s multi-terabyte drives, the time taken for rebuilds has become impractical to the level where two or three parity drives are a minimum.  This degrades performance and exposes systems to data loss while rebuilds are occurring.

RAID does not provide any geographic protection of data.  Recovering from site loss means keeping multiple copies, with considerable additional expense in hardware and process to keep the copies consistent.  Early object store platforms like Ceph and SwiftStack simply used replication as the protection mechanism, keeping three or more copies of data.  Even with hard drives as cheap as they are today, replicating multi-petabyte systems makes no sense.

Erasure Coding

The approach taken by today’s modern object stores is to use erasure coding, a form of forward error correction that is used in the telecommunications industry.  Erasure coding uses a mathematical transformation to divide and translate data into a format that builds in redundancy.  This allows data to be rebuilt from only a subset of the parts created.  Data can be distributed across multiple drives, nodes, chassis and data centres, so that the loss of any one component doesn’t result in data loss.

As an example, imagine a 10MB file that is divided into multiple chunks.  The data is transformed with erasure coding into 16 chunks where any 12 can reconstitute the data.  Placing four chunks each across four data centres means a failure in any single data centre can be tolerated because the remaining three can provide 12 of the pieces needed to create the original data.

Media Management

The techniques we’ve discussed so far cover the protection mechanisms that manage entire device or node loss.  Many errors can be much more transient and result in corrupted data on the media itself.  Modern hard drives have a reliability that means around one sector in 1015 bit reads will return an unrecoverable read error.  When this happens the incorrect data can be recovered from the redundancy in data protection (parity/RAID).

However, as a preventative measure, many platforms run CRC checking tasks in the background that scrub the data.  This process periodically reads and tests content, rebuilding incorrect data that doesn’t match checksums.  WOS from DDN, Scality RING and IBM Cleversafe are all examples of platforms performing background data scrubbing.

Performance

The trade-off here with using erasure coding over distance is the potential saving in not creating multiple replicas to cater for site failure.  The disadvantage with using erasure coding is the calculation overhead (both reading and writing) and the latency incurred if data is spread over multiple locations.  The overhead of erasure coding is even more pronounced with smaller files and as a result, some vendors offer a mix of erasure coding and RAID/mirroring depending on the file size.  Using both protection schemes provides the ability to get the right balance between I/O performance and data protection.

Policy

Of course, the actual implementation of data protection schemes should be tied to policies that are applied to data.  Policies dictate a level of protection and are in effect service levels applied to the data.  In public cloud, the specific implementation is hidden from the end user.  In private cloud, the administrators determine the protection levels that are available based on business requirements.

Vendor Implementations

This section lists a brief summary of vendor implementations and will be expanded over time.

Caringo Swarm features both erasure coding and replication as a suite of protection features called Elastic Content Protection.  Up to 16 replicas can be specified, with the ability to reduce the replica count automatically over time using a feature called Lifepoints.  Erasure coding supports a range of schemes, dependent on the number of nodes available for data distribution.

Cloudian HyperStore provides the capability to use both replication and data protection.  The number of replicas can be specified by policy and each is kept on a separate node.  HyperStore supports three erasure coding levels.  These are 2D+1P, 4D+2P, 9D+3P, where D represents the data fragments and P the parity fragments.  Data can be recreated from any 2 of 3, 4 of 6 or 9 of 12 fragments respectively.

DDN WOS uses four data protection techniques.  Replication (either synchronous or asynchronous) stores multiple copies of data, with larger objects broken down into smaller chunks for distribution across multiple media devices.  ObjectAssure (erasure coding) can be implemented locally, reducing the impact of WAN latency.  Replicated ObjectAssure combines both Replication and Local ObjectAssure for enhanced protection.  Finally, Global ObjectAssure provides a combination of local and distributed erasure coding to minimise the effect of latency on data rebuilds.

NooBaa provides data protection across logical nodes using 3-way replicas.  Data can also be protected by replicating to the public cloud (which currently either AWS, Azure or GCP).  Replicated cloud data can sit either in the internal NooBaa format (which is chunked for dedupe) or natively on the cloud providers’ platform.

Scality RING offers both replication and erasure coding.  From zero to 5 replicas can be maintained across separate nodes.  Erasure coding can use any combination of data and parity segments, with the ability to support multiple configurations (or classes of service) across the RING at the same time.  For wider data protection, a RING can be spread across multiple geographic locations, or simply mirrored to another separate and independent RING.

Clearly the different schemes have benefits and disadvantages, however the specifics are a subject for a more detailed set of posts.

Related Links

Comments are always welcome; please read our Comments Policy.  If you have any related links of interest, please feel free to add them as a comment for consideration.  

Copyright (c) 2009-2017 – Post #590A – Chris M Evans, first published on https://blog.architecting.it, do not reproduce without permission.

Last Edit/Review: 24 October 2017

 

We share because we care!Share on Facebook0Share on Google+0Tweet about this on TwitterShare on LinkedIn23Buffer this pageEmail this to someoneShare on Reddit1Share on StumbleUpon0

Written by Chris Evans

  • Well, the future of object storage data protection lies with erasure coding data. It is true that dispersing erasure coded data over multiple data centers in a region is problematic from a data protection and performance perspective. Erasure coding data in one data center and then replicating the erasure coded data to another data center is a better solution and one that Cloudian has implemented.

    Another approach is based on hierarchical erasure coding, which safely disperses erasure coded data over multiple data centers, but I’m not aware of anyone who actually does that today in production. Mostly I’ve read about hierarchical erasure coding in academic papers. Maybe you can let us know if any OBS vendor has actually implemented this in practice. I think IBM Cleversafe might have this capability as they only erasure code data and they do support multiple data centers. Cleversafe also amassed hundreds of U.S. patents prior to being acquired by IBM in November 2015, so there may be indications of having implemented this capability in their patent filings.

  • Glen Olsen

    Chris, great information here, although I notice that for Caringo Swarm you state that “Up to three replicas can be specified” whereas In fact Swarm allows for up to 16 replicas

    • Thanks Glen, I’ll make the amendment. I did quite a bit of research to find out what the maximum was, but 16 never came up. Can you point me to any documentation that shows that figure? Thanks!

      • Glen Olsen

        Hi Chris, Its documented in the Swarm Guide, you can access the guide on connect.caringo.com if you requested/have an account, otherwise let me know and I can provide you a pdf copy. From the Swarm Guide:

        ====
        policy.replicas

        min:2 max: 16 default:2 anchored

        The min, max, and default replicas allowed for objects in this cluster. SNMP name: policyReplicas
        ====

        • Just requested an account thanks. Will let you know if I need a PDF (hopefully the account will be approved and I won’t).