I recently had a discussion with a vendor (who shall remain nameless) as to whether we really needed Quality of Service in shared storage arrays.  His thinking went as follows; if we have a storage array and network with sufficient bandwidth/IOPS, then why bother implementing QoS?  At first this seems like a reasonable assumption; if I have more resources than required, what’s the problem as I can cater for all requirements.  To think this through as to whether this makes sense, let’s step back and look at how persistent storage has been delivered over the last 15-20 years.

The Problem

Persistent storage has always been the bottleneck in computing because I/O to disk and tape occurs so much slower than operations in the processor and memory.  The differences are huge, with storage being 3 or 4 times the order of magnitude slower than the speed of data being moved around in the processor (nanoseconds & microseconds compared to milliseconds).  As a result there was a good reason Gene Amdahl said “the best I/O is the one you don’t have to do”.  External I/O slows things down.  Because of this, storage has always worked to deliver I/O requests as fast as possible.  It’s the difference between the “McDonalds” method of food delivery compared to booking into a restaurant where the time slots are allocated in advance.

To extend the analogy further, with McDonalds, customers are served pretty much in order, even if their selection isn’t immediately available.  Choose the wrong queue and you could be behind someone who is indecisive, is ordering for a coach-load of people or simply has a slow server.  There’s no prioritisation or special treatment – delivery time is unpredictable.  More bandwidth is provided by adding more servers (which has limits of scalability).  Restaurants by comparison, book time slots to ensure that the food can be delivered by the chefs in a timely order.  The cooking is spread out (hopefully) evenly across the evening to provide a more consistent experience.  Slots are limited, curated and managed.  Turn up without a booking and you will be turned away.  Restaurants “scale up” by adding more covers (seats & tables) and matching this with more staff.

When storage arrays were built from hard disk drives (HDDs), I/O response was unpredictable and very variable, depending on the workload profile of the I/O requests.  Vendors used techniques like caching, pre-fetch, queue re-ordering and destaging to mitigate the peaks and streamline the I/O.  Some vendors implemented prioritisation techniques that were not QoS but aimed at getting as much backend I/O completed as possible.  With flash, these problems have been less apparent, as SSDs provide higher throughput and much lower latency than HDDs, even with random workloads (subject to managing issues like garbage collection).  I/O to hosts is more predictable and consistent but still occurs over shared components like front-end ports, internal software queues, back-end controllers and shared SSDs.

Noisy Neighbour

Because of this shared nature, it’s possible to experience the ‘noisy neighbour’ problem, where one host monopolises the I/O traffic at the detriment of others.  Even with SSDs on the backend, front end queues in hardware (like FC HBAs) and software queues (like those updating metadata) will still see contention and potentially some delay.  QoS allows that contention to be controlled and SSDs allow the I/O to those hosts to be delivered consistently.

So QoS does have a place, even with a system that appears to have plenty of I/O capacity, if for no other reason than to ensure the I/O capability is shared evenly between all the systems.  QoS then comes into use even more when there is contention for other resources such as on SSDs, processors or system memory.  Prioritisation can be used to determine which workloads are throttled first, protecting the mission critical systems.  Finally we should remember that QoS also allows cloud-based deployments to ensure that customers (internal or external) only get the resources they pay for (politely known as a “consistent experience”).

This final point is quite important.  We are moving to a model that delivers IT as a service for all components, not just storage.  Today SSD is the fastest medium of choice currently widely adopted in the industry; tomorrow it could be NVDIMM or 3D Xpoint.  Without some service-based controls, IT organisations will find it difficult to introduce new technology and not affect the user experience.  Separating the two allows technology to be delivered in the most optimum way possible.

Comments are always welcome; please read our Comments Policy first.  If you have any related links of interest, please feel free to add them as a comment for consideration.  

Copyright (c) 2009-2015 – Chris M Evans, first published on http://blog.architecting.it, do not reproduce without permission.

Please consider sharing!Share on FacebookShare on Google+Tweet about this on TwitterShare on LinkedInBuffer this pageEmail this to someoneShare on RedditShare on StumbleUpon

Written by Chris Evans

  • Chris, I have never used QOS on fibre channel storage, I have always believed that QOS is required when using iSCSI and FCOE connected storage, but what you say makes sense, Since most of the SAN’s that I managed at Intel were stand alone and only had smaller HP MSA 1000 & 1500 SANS at the other companies I worked at, this was not much of an issue. The last version of FOS that I implemented at intel was version 5 before QOS was implemented in FOS v6.

    • Jose, I know that in systems I built, we had a tendency to over-configure; very few hosts pulled the 4/8Gb/s their cards enabled. Your iSCSI and FCoE comment is interesting though; I can see how QoS would be more useful there.

      Regards
      Chris

  • Steve C

    With storage, so many people pick some arbitrary resource (a network link comes to mind), or an arbitrary queue, prioritize that, and call it QoS.

    A multi application (or even multi tenant) environment is a very, very complex mesh of queues, and all it takes is one of those queues being flooded by some app (or OS) in an uncontrolled way…

    And yes, I had a hand in a storage QoS product which came to market 7 or 8 years ago. Got “invited” to help with FCoE, so wasn’t there as the QoS one faded.

    Will be interesting to see how this all plays out in hyperconverged environments over time. More resources to conflict there, harder to get control of all the moving parts.

    Fun topic!

    @FStevenChalmers
    (works at HPE, speaking for self)