One of the most interesting challenges with using the public cloud is how to get data into cloud storage platforms so it can be used with services like analytics or e-discovery. One scenario is to use “analogue” shipping services to send physical devices like hard drives and appliances to the cloud provider.
Shipping a Drive
Initially we saw AWS and Azure offer the ability to ship individual drives containing data. These services still exist today. AWS Import/Export allows customers to ship drives that meet a standard set of requirements – basically an eSATA or USB connection with a file system that can be read by Red Hat Linux. Customers use AWS-provided tools to ship their data, which can be EBS or S3-based. Azure also has an Import/Export service. This differs slightly in that only internal drives are accepted. Microsoft uses external connectors and docking stations to access the drive contents, which get put into BLOB storage.
Naturally there is quite a lot of work required to prepare a drive for shipment. The contents need to be encrypted and the device loaded with a vendor supplied tool to ensure it can be read at the receiving end. There are also lots of steps to take in order to ensure the drive is identified as belonging to the right customer account. Charging is pretty simple and based usually on a fixed cost per drive unit and additional charges to ship the drive back to the customer.
Compare the Network
With current drive capacities, it’s possible to ship terabytes of unstructured data on physical media. Compare this to a 1Gb/s network connection to the cloud provider which could shift around 100MB/s (without compression or other data reduction technologies). A standard 10TB drive would take around 30 hours to upload. If you have 100TB of data or perhaps a 100MB/s network connection then you’re looking at eleven or twelve days to ship the data in. For IT shops that don’t have or can’t afford that level of networking then physical shipment looks like a good deal.
What happens if you have more than tens of terabytes (or are even petabyte scale)? Both Google Cloud Platform and Amazon Web Services now have appliances they will ship to you. These are basically hardened servers that are capable of taking greater storage capacities than a single hard drive. Google’s Transfer Appliance, launched this week, comes in 100TB and 480TB (raw) capacities. The appliance is a rack-mount server stuffed with drives that can hold up to a petabyte of capacity (assuming 2:1 compression/dedupe). Google claims the appliance can store a petabyte of data, transferred over a maximum of 25 days (before additional charging comes in), with a subsequent upload of up to 25 days at the receiving end. A quick calculation shows that this implies 2x 10Gb/s networking on the appliance, which the customer needs to be able to support on their infrastructure (more on that in a moment).
AWS now has three solutions. Snowball, Snowball Edge and Snowmobile. Snowball is a suitcase-sized appliance that comes in 50TB and 80TB (raw) capacities (42/72TB usable). Network support is 10GbE, so potentially a device can be filled within 24 hours. Externally, Snowball has shipping details displayed with e-ink. It’s a self-contained device, rather than a rack mount server, so could be placed anywhere within the data centre (subject to power/networking connections). Snowball Edge is a standard Snowball with additional compute capacity. Customers can run Lambda code on their data and so do pre-processing before shipping to AWS.
If you have petabyte storage requirements, then AWS offers the Snowmobile, which is literally a truck-full of storage capacity. Snowmobile is recommended for customers with 10PB of capacity or more and can hold a maximum of 100PB. Rather than simply plug into the network, Snowmobile comes with a rack of equipment for managing data transfer, with up to 1Tb/s of bandwidth provided through multiple 40Gb/s network connections. This is a serious piece of hardware that needs 350KW of local power to support, so not for the feint-hearted.
Shipping data around introduces some immediate challenges. Excluding the most obvious around security (which is solved by encryption), two main problems are data concurrency and physical transfer. By concurrency we mean the ability to keep track of updates to the data being uploaded to the cloud provider. We would expect effectively 100% of the content transferred through physical media to be unstructured files and objects, so much of the content may not change. However, with load/unload and shipping times that may run into weeks or months – do the calculation on how long it would take to fill a Snowmobile – then data concurrency is an issue.
Choices have to be made about how data is processed while in the transfer phase. It may be a case of saying that analysis remains onsite until the transfer is complete, but that can be complicated if data changes rapidly or is being added to every day. Looking at AWS uploads, there is a big issue here that could cause customers problems. In the fine print on the limits of using Snowball, we see the following:
All objects transferred to the Snowball have their metadata changed. The only metadata that remains the same is
filesize. All other metadata is set as in the following example:
-rw-rw-r-- 1 root root [filesize] Dec 31 1969 [path/filename]
Ouch! All my metadata goes and I can’t track file/object status by date/time or ownership. This may represent a big problem for customers trying to keep their on/off premises copies in sync.
The second challenge is actually getting data onto the transfer device. Vendors offer tools for data transfer, but these need to be run on relatively high-end machines to be effective. For example, Snowball transfer software requires a minimum recommended 16GB of RAM and 16-core processor, with 7GB of RAM required for each data transfer stream. The high amount of RAM required is due to the in-flight encryption process. Today servers are cheap, but some planning is needed to optimise getting data onto the transfer devices as quickly as possible, including having enough local bandwidth to transfer data without impacting local services.
One final thought; when calculating the difference between network and offline shipping, remember data movement occurs twice (on and off the transfer device), plus shipping time, plus preparation time. So determining the breakpoint where offline shipping is more practical could actually be further away than customers think.
The Architect’s View
Sometimes physical data transfer is the only solution to shipping large amounts of content. Unfortunately today’s offline shipping process means data concurrency has to be managed manually, with significant additional work for the application owner or storage administrator. If metadata could be kept in sync, then some of the problems we’ve described could be avoided. For object storage, this is one of the benefits Zenko (see previous post) is trying to solve. Having a single view of data is really important, however managing the physical shipping is too. There’s certainly still some work to do here, but cloud vendor offerings are doing a good job at least managing the process of acquiring content.
- What is AWS Import/Export? (AWS website, retrieved 21 July 2017)
- AWS Snowball (AWS website, retrieved 21 July 2017)
- AWS Snowball Edge (AWS website, retrieved 21 July 2017)
- AWS Snowmobile (AWS website, retrieved 21 July 2017)
- Use the Microsoft Azure Import/Export service to transfer data to blob storage (Microsoft website, retrieved 21 July 2017)
- Google Transfer Appliance (Google website, retrieved 21 July 2017)
- AWS Snowmobile Introduction (YouTube, retrieved 21 July 2017)
Comments are always welcome; please read our Comments Policy first. If you have any related links of interest, please feel free to add them as a comment for consideration.
Copyright (c) 2009-2017 – Chris M Evans, first published on https://blog.architecting.it, do not reproduce without permission.