Home | Uncategorized | Enterprise Computing: Barclays Bank Services Down Due to Storage Array Problems

Enterprise Computing: Barclays Bank Services Down Due to Storage Array Problems

0 Flares Twitter 0 Facebook 0 Google+ 0 StumbleUpon 0 Buffer 0 LinkedIn 0 0 Flares ×

It’s been reported in a few places that yesterday Barclays (UK bank) suffered an issue with a “disc array” (presumably they mean disk array) that took out their ATM and online banking systems.  See the comments here and here.

Allegedly, Barclays now use USP-V arrays as their back-end storage devices, so presumably HDS USP-Vs were involved in yesterday’s problems.  Systems seemed to have been down for a number of hours before normal service was resumed.

The first thing to say is that “stuff” happens.  Hardware fails – arrays fail and it’s the same for all vendors.  No vendor can ever claim that their hardware doesn’t fail once in a while.  We all know that RAID is not infallible; in fact, it isn’t even necessary to have a hardware failure to experience service outage as many problems are caused by human error.  

What surprises me with this story is the time Barclays appeared to take to recover from the original incident.  If a storage array is supporting a number of critical applications including online banking and ATMs, then surely a high degree of resilience has been built in that caters for more than just simple hardware failures?  Surely the data and servers supporting ATMs and the web are replicated (in real time) with automated clustered failover or similar technology?

We shouldn’t be focusing here on the technology that failed.  We should be focusing on the process, design and support of the environment that wasn’t able to manage the hardware failure and “re-route” around the problem.  

One other thought.  I wonder if this problem would have been avoided with a bit of Hitachi HAM?

About Chris M Evans

Chris M Evans has worked in the technology industry since 1987, starting as a systems programmer on the IBM mainframe platform, while retaining an interest in storage. After working abroad, he co-founded an Internet-based music distribution company during the .com era, returning to consultancy in the new millennium. In 2009 Chris co-founded Langton Blue Ltd (www.langtonblue.com), a boutique consultancy firm focused on delivering business benefit through efficient technology deployments. Chris writes a popular blog at http://blog.architecting.it, attends many conferences and invitation-only events and can be found providing regular industry contributions through Twitter (@chrismevans) and other social media outlets.
  • http://blogs.cinetica.it Enrico Signoretti

    Chris I agree with you. The most important thing is not the hardware, it can fail or not, but design the right process and test it every time it’s necessary!
    ciao,
    Enrico

  • Rob

    “We shouldn’t be focusing here on the technology that failed. We should be focusing on the process, design and support of the environment that wasn’t able to manage the hardware failure and “re-route” around the problem.”

    Right. Along those lines, did the architects miss a single
    point of failure for a very important design?

    A design doc surely shows it to be a single point of failure.
    Or: “We didn’t know it to be a single point of failure!”
    “Who know what when? Who signed off on it?”

    Either way, someone is called on the carpet somewhere. Not good.

  • Locutus

    As is often the case, DR (not to be confused with HA) is looked upon as an IT expense with low ROI. Therefore, the DR infrastructure is built not so much as a useable environmnet but rather as a tickeyboo to satify an audit requirement.
    You can preach the merits of the golden copy.
    You can preach the importance of like for like storage footprints at the source and target to preserve the performance of the applications post DR.
    You can preach the necessity of maintaining consistency to guaranty a recoverable copy in the event of a rolling disaster.
    You can preach ad nauseum.

    In the end, it comes down to money Vs risk and often times customers err on the side of money because no one expects to fall into the .0001% availability hole in enterprise storage. They hope they’ll retire before the perfect storm hitts.

    I can’t say that is the case at Barklays. But it has always been a sore point with me when I discuss HA/DR with my customers.

  • Gar

    Yep, from reading the article on Barclays, I agree with you Chris. I’ve witnessed outages caused by H/W but exacerbated by failings in the support system, a lack of true understanding of operating in a DR mode, and poor design. It would be intersting to know if they had exectued a full Disaster Recovery test recently.
    Regards,
    Gar.

  • http://storage-sense.blogspot.com Storage Sense

    It it also a reminder that no matter how resilient a single array is desgined to be, the frame and the datacenter it is housed in are in themselves single points of failure. The only way to mitigate that is to have a well thought out business continuity strategy that inlcudes sound processes and technology such as mirroring of arrays.

    http://storage-sense.blogspot.com

    • Vidar Letto

      Sounds wierd, since the different Components inside the cabinet have several separate powerlines, batteries etc, and all parts are redundant. In theory (and practice), no piece of hardware in the cabinets or array are SPoF, including powerlines and network.
      I cannot image such a Critical system NOT utilizing this build in redundancy.

  • http://storage-sense.blogspot.com Storage Sense

    It is also a reminder that no resilient you design an array to be, the actual physical frame, the firmware in the array, and the datacenter the array is housed in are themselves are single points of failure.

    What is needed is a well thought out Business Continuity strategy that includes sound processes, personnel managment, and useful technology such as array mirroring.

    http://storage-sense.blogspot.com

  • Pingback: HDS disk array failure suspected in Barclays outage; where’s the HAM? - Storage Soup

  • Chris Evans

    Sounds like everyone commenting is of a similar mind – don’t blame the technology per se, look at the process. Thanks for all the comments.

    Chris

0 Flares Twitter 0 Facebook 0 Google+ 0 StumbleUpon 0 Buffer 0 LinkedIn 0 0 Flares ×