This is one of a series of posts looking at data mobility for the hybrid cloud. You can find a link to all of the published articles as they become available here. (link).
Caching is not a new technology. In fact, it’s been used as a technique to cost-effectively accelerate I/O performance since computers were invented. Caching works because generally, applications have an active working set that is smaller than the overall dataset size. The trick for the cache is keeping the active data in a faster medium, usually closer to the processor. If we can keep the cache filled with the active data at minimal overhead, we reduce the need to deploy lots of expensive storage. Today we think of the implementation of caching for storage as the use of DRAM in a storage array or host. This model is being extended with Persistent Memory like NVDIMM and the use of new technologies like Optane or flash.
The caching concept also applies to hybrid cloud. I’ve highlighted four main options for caching over distance that allows data to appear locally (either in the cloud or on-site), while only a subset of the data is actually moved.
The idea of NAS caching is to extend a file system to appear in a remote location, without shipping all the data. To achieve this, we need to split data into actual physical content and metadata. The metadata describes the file system, including file hierarchy, size and so on. The data itself is simply physical blocks of data that will be stored in the local cache. All of the metadata has to be shipped to a location where the data is to be accessed, whereas the physical data is on-demand.
The trick for NAS caching is keeping the data concurrent. This means making decisions on how data will be accessed in the source and cached location. If the source is read-only while the cache is read-write, then the cache only has to ensure updates are replicated back successfully. However, if we need to maintain read-write access over distance, then the caching platform also has to manage file locks to ensure data doesn’t get corrupted. This can get pretty complex, once we start thinking about having multiple caches accessing the same data source.
With VM caching, the entire application in the form of a virtual machine is extended to another location. This could be a platform of the same hypervisor or into a public cloud. If the VM/instance moves to the public cloud, the caching process needs to modify the VM image to ensure it can run on a different hypervisor platform. Caching a VM can be quite efficient, depending on where in the cloud the VM is stored. First, the caching algorithm can predict the core files needed to boot the application in the cloud and pre-cache those before the VM is booted up. Second, with many virtual machines, there will be a lot of common data.
So having a shared cache means subsequent VM migrations after the first could be migrated much more efficiently and quickly. VM caching could be a good solution to fixing temporary capacity problems, such as the inability to make VM bigger on-premises or to move workload off a local cluster that is under resource pressure.
This option could be classed as block device caching, however, gateways are a bit more generic than that. In the block-based implementations, a caching device or appliance sits on the customer site, emulating effectively infinite amounts of storage. Inactive data is offloaded to the public cloud, while active data is protected through snapshots. As far as the local client sees, the data is all locally available. Storage gateways have a few challenges. First, the data is generally written as block and archived to cloud as an object store, so some data translation has to occur. Second, the backing store in the cloud can’t be kept 100% accurate with every block update, as the latency would ruin local performance. This means that in general, write data is saved to the cloud via periodic snapshots.
Some storage gateways allow data to be accessed at the cloud side of the link, others don’t and are simply an offload of data that needs to be brought back onsite to access. Of course, with virtual appliances, a cache could be run in the cloud, so if the gateway allows multiple access to the same data (or to snapshots of the data), then this could be one way of accessing data remotely.
With database caching, the application sees accelerated performance by having a local cache of query results stored locally. If the same query is performed, then holding the result set locally improves performance. It’s also possible to proactively not retrieve all of the data from a query and simply maintain a cursor or pointer to the results, so the data can be accessed on demand as it is read or iterated through. Platforms like Redis already offer forms of caching. Data could be cached remotely by creating shards or replicas, offered by traditional relational databases like MySQL/MariaDB and NoSQL platforms such as MongoDB. AWS provides local database caching with Elasticache. The whole subject of database caching is probably material for an entire series of blog posts.
For the cache to be effective, we need a number of things:
- The cache must be clever in predicting the actual data needed. Algorithms could look at previous data patterns, understand the file content and pre-cache, or simply retrieve data from the main store on demand. Obviously, if you know your data format (like VM caching), the process can be made easier.
- The cache must have sufficient local storage to manage the amount of active data. Sizing the cache is therefore important.
- The rate of data change needs to be manageable. As distance extends and latency increases, it becomes more difficult keeping the cache current and ensuring updates are written back safely.
At some point, the amount of data in the cache may make the cache concept pointless. In which case the data could be simply migrated to the new location.
In NAS caching, we’ve seen solutions from Avere Systems (vFXT) and Primary Data. Velostrata and CloudEndure offer VM caching from on-premises to the public cloud. AWS and Azure have storage gateway solutions – the Storage Gateway and StorSimple respectively. CTERA Networks offers solutions that could be both NAS caches and storage gateways. There are probably other solutions I’ve failed to mention, so add a comment if you know a product that should be on the list.
The Architect’s View
Caching over distance can be used for a variety of purposes. File/NAS caching provides the ability to run applications in the cloud against data onsite, without shipping it all across the network. Analytics tools like those being developed by AWS can gain access to large amounts of unstructured data without the data transfer overhead. VM caching is a great way to either move a workload to the cloud or relieve a short-term resource shortage. It will be interesting to see how VMware on AWS works in this scenario and how good long-distance vMotion will be in getting VMs efficiently into an AWS deployment.
Storage gateways are useful if you want to store lots of data in the public cloud, without needing to access it all the time. There is a range of hybrid solutions here that also almost fit this category. Think of how the secondary storage vendors (Cohesity/Rubrik) are extending their data to public cloud rather than deploy many appliances on-site. The object storage vendors are also getting into this area, however, we’ll cover them elsewhere.
- Storage Field Day 11 Previews: Avere, Primary Data, Scality
- Avere Systems Releases Virtual Edge Filer to Deliver Data to Compute
- Avere Systems Embraces Cloud with Cloud NAS
- Focus on High Performance NAS
- Cache or Tier – Does it Matter?
- Hybrid Cloud and Data Mobility
- Storage Field Day 10 Preview: Primary Data
- Cloud as a Tier of Storage
- SFD7 – Primary Data and Virtualisation
- Cloud Data Migration – Shipping Virtual Machines
- Velostrata Presents at Tech Field Day Extra at VMworld US 2015 (Tech Field Day website, retrieved 17 January 2018)
Comments are always welcome; please read our Comments Policy. If you have any related links of interest, please feel free to add them as a comment for consideration.
Copyright (c) 2009-2018 – Post #6F2A – Chris M Evans, first published on https://blog.architecting.it, do not reproduce without permission.