Over the past 12-18 months, I’ve done a lot of work looking at options for getting existing data into the public cloud.  You may have noticed that I’ve talked a lot about the subject over this period.  There are links at the end of this post to that content.  However, I thought it might be interesting to share some of the analysis I’ve done, including where vendors fit into this.

Genres

I’ve divided the technology landscape into a number of genres.  The links on each genre title provide another blog post with more detail.

  • Caching Technologies – solutions that cache data from one location to another, without having to ship the entire data set.
  • Global/Scale-out data – solutions that distribute data across locations, such as global NAS or distributed block storage solutions.
  • Database Centric Technologies – this includes ETL, clustering and replication technologies.
  • Data Protection – solutions that use point-in-time data protection as a migration tool.
  • Data Migration – solutions specifically designed for data migration, including those referenced in existing posts.
  • Data Consolidation – this category describes solutions such as Copy Data Management that aggregate multiple secondary use cases.

We can possibly argue about the specific definitions here, but the idea is to provide some kind of classification for understanding the options that exist for end users.  As we look at each in more detail, the genres can be classified by characteristics that help make selection choices.

Data Concurrency

By this, I mean how current are the copies, replicas or versions compared to each other.  Caching solutions, for example, should be real-time accurate.  Whereas Data Protection solutions are creating point-in-time copies that are potentially immediately out of date.  Concurrency becomes important if you want to ensure your data is 100% accurate between locations.  Other scenarios, like feeding data to test/dev environments or running batch from live data, can be done from copies.  Another consideration for concurrency is the refresh rate of data.  Taking copies hourly or daily may suffice, but inevitably with point-in-time solutions, the currency of the data decays over time.  There are many other concurrency issues to discuss that I’ll cover in specific posts.

Performance

How will mobility impact data performance?  Caching is specifically designed to address performance challenges and data protection solutions don’t impact ongoing I/O once a copy is taken.  However, global/scale-out solutions that aim to keep all copies in sync, will have a problem with data over distance.  Bear in mind also that caching solutions can take time to “warm up” on new data sets. So, the efficiency of caching algorithms will have a direct impact on how quickly these solutions can get up to speed.

Traffic Optimisation

Data moving between multiple clouds and/or on-premises needs to be optimised.  Caching solutions are naturally good here.  Data protection, migration and consolidation solutions should be using incremental data, rather than replicating the entire dataset each time.  This is especially key for application and data migrations using VMs, where the base O/S will barely change between copies.

Cost Optimisation

Obviously cost can encompass both network and storage.  Keeping multiple copies of a dataset across multiple clouds could be expensive.  Technologies like de-duplication and caching are a good answer here.  Of course, this means having good metadata and even abstracting the logical view of data from the physical.

Security

How will you secure your data across multiple locations?  Implementing a security policy across both public and private cloud can be a challenge.  Having a global namespace is one important consideration.  With all of the data under the same umbrella, then implementing a security policy becomes easier.  What about data in VM images or replicas of storage LUNs/volumes?  What about file systems?  Depending on the data type, it makes sense to look at implementing AD or another security model that integrates with existing on-prem implementations.

Application Awareness

Naturally, the database replication and ETL type solutions will be data aware.  Some replication technologies work at the VM level, which means a degree of application awareness, depending on the implementation.  Having an ability to work with the application in some way is important to data mobility.  First, data can be kept consistent (better to use quiesced than crash copies).  Second, caching can be more intelligent if the data is understood more.

Automation

Manually moving data around is a pain in the proverbial.  It’s certainly not a scalable option.  Automation has to play a big part.  That means being able drive storage and application migration through API and CLIs.  If you can’t script it, you can’t effectively use it.  I’d add one more comment here and that’s an observation on how orchestration platforms will handle some of these requirements.  We started to see some storage plugins being developed, but what happens when you want to move between geographic locations?  This is still a relatively unsolved issue.

Use Cases

Why might you want your data in the cloud?  Hybrid doesn’t have to mean data and applications being spread across clouds at the same time.  It could mean:

  • Application migration – permanent migration of applications to the public cloud or between them.
  • Cloud bursting – short-term expansion to the public cloud.
  • Seeding – moving data to the cloud for seeding test/dev.
  • Analytics – moving data to the cloud to take advantage of analytics or ML-type applications.
  • Data protection – keeping a backup of data or an application for recovery on/off-cloud.

The Architect’s View

In future posts, I’ll dig down into the genres in more detail.  This is where things will get interesting, as we look at some of the pros and cons of different solutions.  We will look at the strengths and weaknesses, including talking about some of the vendors in the market today.

Further Reading

Comments are always welcome; please read our Comments Policy.  If you have any related links of interest, please feel free to add them as a comment for consideration.  

Copyright (c) 2009-2018 – Post #6D50 – Chris M Evans, first published on https://blog.architecting.it, do not reproduce without permission.

 

Written by Chris Evans

With 30+ years in IT, Chris has worked on everything from mainframe to open platforms, Windows and more. During that time, he has focused on storage, developed software and even co-founded a music company in the late 1990s. These days it's all about analysis, advice and consultancy.