Edit: Chris Mellor at The Register has some additional comment from vendors, including Scale Computing. (link).  Further Register emails (link) point to significant slow downs in workload.  And here’s another post regarding the need (or not) to patch storage appliances.

Unless you’ve been 100% disconnected from technology (and the Internet) over the last few days, it’s been impossible to avoid the discussion about new vulnerabilities discovered in Intel and other processors.  Two new exposures, dubbed Spectre and Meltdown, identify issues with speculative execution of code that allows leaking of or access to sensitive user data.  The Spectre exploits expose data through a number of branch execution issues, whereas Meltdown provides unauthorised access to kernel memory from user space.

So far, the industry has responded quickly (although the threats were identified in the middle of last year).  Patches are available for popular operating systems and the major cloud providers have already started patching machines supporting their cloud infrastructure.  However, with Meltdown, in particular, there is a workload dependent performance impact that could be anywhere between 5% and 50%, especially for storage-intensive workloads.

KAISER

Looking a bit deeper into the Meltdown vulnerability, the ability to access kernel-mode memory is being mitigated using a patch called KAISER.  This implements stronger isolation between kernel and user space memory address spaces and has been shown to stop the Meltdown attack.  KAISER was already in development for other reasons and so I guess that is why we have seen the quick rollout of fixes for Linux, Windows and MacOS.  Patching against Meltdown has resulted in performance degradation and increased resource usage, as reported for public cloud-based workloads.

Storage

Presumably, the overhead for I/O is due to the context switching that occurs reading and writing data from an external device.  I/O gets processed by the O/S kernel and the extra work involved in isolating kernel memory introduces an extra burden on each I/O.  I expect both traditional (SAS/SATA) and NVMe drives would be affected because all of these protocols are managed by the kernel.  However, I wonder (pure speculation) if there’s a difference between SAS/SATA and NVMe, simply because NVMe is more efficient?

The additional work being performed with the KAISER patch appears to be introducing extra CPU load in the feedback reported so far.  This means it also must affect latency.  Bearing in mind almost the entire storage industry uses x86 these days, what will be the impact for the  (hundreds of) thousands of storage arrays deployed in the field, plus software-defined solutions?

Traditional Arrays

The impact to traditional storage is two-fold.  First, there’s extra system load, second potentially higher latency for application I/O.  Customers implementing this patch need to know if the increased array CPU levels will have an impact on their systems.  A very busy array could have serious problems.  The second issue of latency is more concerning.  That’s because like most performance-related problems, quantifying the impact is really hard.  Mixed workload profiles that exist on today’s shared arrays mean that predicting the impact of code change is hard.  Hopefully, storage vendors are going to be up-front here and provide customers with some benchmark figures before they apply any patches.

SDS

Then there’s the issue of how Meltdown affects SDS-based implementations.  The same obvious questions of latency and performance exist.  However, there’s another concern around solutions that use containers to deliver storage resources.  Meltdown has been shown specifically to impact container security, enabling one container to read the contents of another.  If storage is being delivered with containers, how is the data being protected in this instance?  What protection is there to ensure a rogue container doesn’t get access to all of the data containers on a host?

Hyper-converged

Extending the discussion further,  there seem to be some specific issues for hyper-converged solutions and storage.  Hyper-convergence distributes the storage workload across all hosts in a scale-out architecture.  Implementing patches for Meltdown could increase the storage component overhead by up to 50%.  If storage uses 25% of the processor of each host, then the impact (for example) could be an increase in 12.5% of CPU utilisation.  This could put some deployments under stress and will certainly impact future capacity planning.

Vendor Responses

A quick check across vendor websites shows few statements on the impacts to storage products from either Meltdown or Spectre.  The only communication I’ve received has been from Storpool, which indicates the company is still investigating the impact of the bugs and the recommended patching.  Of course, storage vendors may be writing to their customers directly, in which case I wouldn’t see it.  However, a public statement would be good to see.  For hyper-converged, I’ve found feedback from Nutanix (via this link), but there’s no mention of the impact to performance.  Here’s what I’ve located so far.

The Architect’s View

Meltdown and Spectre could be seen as “once in a generation” type flaws that are actually very hard to exploit.  However, I would say that as we see more transparency in the hybrid cloud age, then it’s unlikely we’ve seen the end of big issues like this.  It would be good for the storage vendors to put a stake in the ground and say what they are doing to mitigate the impact of Meltdown/Spectre.  The public cloud providers have been quick to do it, although that focus has been more about getting patched than the impact on application performance.

Further Reading

Comments are always welcome; please read our Comments Policy.  If you have any related links of interest, please feel free to add them as a comment for consideration.  

Copyright (c) 2009-2018 – Post #64B0 – Chris M Evans, first published on https://blog.architecting.it, do not reproduce without permission.

 

Share me!Share on Facebook4Share on Google+0Tweet about this on TwitterShare on LinkedIn47Buffer this pageEmail this to someoneShare on Reddit0Share on StumbleUpon0

Written by Chris Evans

  • MTB moose

    So, I get why it would be necessary to patch HCI and SDS systems, but what’s driving the need to patch array-based storage running on closed x86 platforms? What’s the attack surface if the only interaction with the array is through storage protocols?

    • Good question. However, how many storage arrays are now purely accessed through storage protocols? Even today’s VMAX has a network connection (DMX/Symmetrix didn’t of course). If your storage platform offers a network connection for management and can be managed via a web browser, then that could be compromised.

      If hackers are clever enough to exploit CPU timing issues, then I’m sure a web server represents no problem. So if you are a storage vendor claiming your platform is not affected, can you be 100% certain your customers’ data can’t be compromised because your CLI or web interface is 100% bulletproof?

      I would prefer to have the choice of deploying patches, rather than the storage vendor insulting my intelligence. I wouldn’t want to be that 1 in a million lucky scenario where the hacker got through and deleted 500TB of my core corporate assets…. Just sayin… 🙂

      • equals42

        Nonsense, your scenario isn’t even related to Spectre. Your scenario is that the on-board management webserver is compromised and then run a meltdown/spectre attack? If you’re executing code on the array, you’ve gained access through a different vector than meltdown/spectre and there’s more pressing concerns. These new attacks are mainly based on the ability of regular applications being able to “see” data outside their VM/container/memory allocation. Arrays that do not allow arbitrary code execution onboard are not directly susceptible to these new attacks as nearly all the vendors have stated in public forums and written statements.

        HCI and SDS are a different story since they are on the same CPU as other user applications.

        • I’m not sure I see what you are saying. The onboard management is a server running a generic O/S. If there is any way to run user code on that server, then there is an exposure from Meltdown/Spectre. The management server doesn’t have to be compromised to the point of opening root access.

          Ultimately, do you really believe that storage appliances/arrays are 100% safe because they are “self contained”?

          • equals42

            Because you cannot run user code binaries on a web server by simply making http calls or using their interfaces. Gaining access to run your own binaries through a web server would be a completely separate attack vector and unrelated to Spectre and Meltdown.

          • So, we already know and accept that web servers have been compromised in the past, but those exploits need elevated privileges to do anything useful. Assuming that non-privileged exploits exist (which they may do) they would normally be useless in an attack. However, if we can now run non-privileged code through a web server exploit, then we have a vulnerability. TBH the effort of using Spectre/Meltdown as an exploit is high. So, it wouldn’t be hard to expect a non-privileged web server exploit to be possible.

          • equals42

            Again, that’s a different attack vector than Spectre/Meltdown (S/M). If you’ve exploited a web server, that’s not the same as using S/M to intuit the cache contents outside of your normal app’s sphere of access. That’s the big S/M deal here: using a user app to steal information without having to break out of containment, not using it as a tertiary hacking tool after you’ve already pwned the webserver.
            I don’t believe folks should be casting aspersions where they don’t need to be. Serious security professionals are working to address the real threats from S/M and need not worry about unfounded assertions. The major array vendors mentioned (EMC, HDS, HP, NetApp) have all posted advisory notices to detail this. They all have very good talent looking at this and a bevy of lawyers covering their tails.
            As far as storage goes, this is primarily an issue with shared-core HCI and SDS. Not a fatal flaw but a consideration.