The initial answer to the question, “should I be backing up containers”, would appear to be no.  On the face of it, a container is meant to be a temporary instance of an application component and so not long-lived enough to need backing up.  But digging deeper, there is a need for backup – depending on how an application and its data is implemented.

Container Data

Containers (and in the current sense, those built using Docker) are instances of applications or microservices that are designed to be relatively short-lived.  Containers can be scaled up or down to meet application load and so shouldn’t in themselves need any backing up.  The base image of a container could be modified quite frequently with rapid development.  If a container fails, it can be restarted and configured from code.  For this reason, backing up a copy of the running container image makes no sense.

Container Persistence

As container technology has matured, there has been a desire to make applications more persistent, including running databases within a container ecosystem.  This presents two deployment scenarios.  One is to create stateless containers where the database data is copied in when the container is created.  The second is to store the data on a file system or volume and attach it to the container at startup.

It’s also worth pointing out that a container doesn’t have to be running a database – persistence in the data could be needed for other reasons.

One solution to the persistence requirement has been to attach LUNs or volumes to the container host.  The volume is formatted with a file system, mounted and attached to the container.  This is the data we want to protect.  But unlike backing up say a virtual machine, there’s no intrinsic name or association we can use to track the data through its lifetime.

Application Data

What does this statement mean?  Well think about a virtual machine.  The data and the VM are intrinsically linked to each other.  Even if the application data within a VM is in a separate volume or partition, the name of the VM and the data volume are the same.  If ever that data is needed again, a user will look for the VM name to manage the restore.  The VM will have metadata tagged to it that helps track usage and this will get stored by the backup system.

This scenario isn’t the same with containers.  Because they are short-lived, containers have GUID-style numbering, although we can associate friendly names with them.  The same goes for the application data mount point.  To avoid conflict with hosts that could run hundreds of VMs, mount points may also have GUID-style names and be different each time the container image is started.

Backing Up The Data

If we’re backing up the container data, how does the application name, owner, security settings and other information get associated with that LUN/volume or file share?  Today we assume that an infrastructure object like a VM is the actual application.  In the future, as containers are scaled up or down, an application might have 10 containers one day (and 10 associated volumes), the next it might have 5 containers and only five volumes.  So, if we’re backing up and restoring that application, what’s the object we restore if we have data corruption – 5 or 10 volumes?  Do we back up the data in the file system or do an application-level backup?

The Architect’s View

When an application was a single VM (or group of VMs), understanding the data model was easy.  If the application gets corrupted or lost, just restore the last VM image.  With a container-based application, the data may be more transient and the data model more difficult to understand.  Backing up discrete “objects” like we did with VMs, doesn’t provide enough detail on how to recover that application if the data was lost.

This is one of the reasons why I say that block-based storage will not be the future for containers.  We need the structure that a file system can offer, for lots of reasons.  The whole data management strategy with containers needs some more thinking through.  Perhaps that’s why we’ve not seen containers take over the Enterprise, because the operational aspects of data are more important to the business than the technology.

Note: I’ve deliberately simplified some points for this discussion.  For example, application backups don’t just have to be VM snapshots, but could be application based, taken at the database level.  This subject is more complex than a 800-word blog post and worthy of further discussion.

Further Reading

Comments are always welcome; please read our Comments Policy.  If you have any related links of interest, please feel free to add them as a comment for consideration.  

Copyright (c) 2009-2017 – Post #60D8 – Chris M Evans, first published on https://blog.architecting.it, do not reproduce without permission.

We share because we care!Share on Facebook16Share on Google+0Tweet about this on TwitterShare on LinkedIn85Buffer this pageEmail this to someoneShare on Reddit0Share on StumbleUpon0

Written by Chris Evans

  • shjacks55

    Pardon my math: for the same bit density, data is read at the same rate for a 15K 2.5″ drive as a 10K 3.5″ drive. And if Higher quality SATA drives can also attach to SAS controller over the same connector using the same cabling, and ATAPI supports the SCSI layer that SAS uses the what’s the diff? The better quality drives have multiple LUNs in each hard drive SCSI ID (SATA hard disk; ATAPI). The multiple LUNs allow each head to behave as a separate subsystem. The reliability of large multi-terrabyte drives (based on mathematics) . suggests that RAID 6 is the minimum to protect from data loss.