Home | Featured | Virtualisation: Learning The Hard Way

Virtualisation: Learning The Hard Way

0 Flares Twitter 0 Facebook 0 Google+ 0 StumbleUpon 0 Buffer 0 LinkedIn 0 0 Flares ×

They say that you learn the most when you make mistakes and things go wrong.  Well, last night I certainly must have learned a lot.  What started as a simple physical re-organisation of my hardware turned into a rebuild of my production VMware ESXi server – finishing at 1am.  Here’s what happened.

Failing Disk

I started by shutting down and moving my production ESXi Server out and back into the standard rack it occupies.  On power up, the server failed to reboot, claiming the boot disk was no longer present.  A quick check inside showed that the SAS connector on the boot disk had come loose, so I plugged it back in and tried again (Oh, SAS specification guys – bad design, no retainers on the plugs).  Unfortunately, the boot disk had somehow become corrupted and the server wouldn’t come up.  No problem, I thought, just repair using the installation media.  This is where things started to get complicated.

My ESXi server runs off a Seagate Savvio 2.5″ 15K 73GB drive, one of four Seagate generously loaned me last year for long term testing.  More on that another day.  The server has two disks installed, one of which has VMs on it.  During the repair process I wasn’t sure which disk was the O/S and which was data.  ESXi doesn’t help much, only indicating that both disks contained data in partitions, data that would be lost if I reinstalled. 

Lesson 1 – Make sure you know exactly how your hardware is configured, down to the SAS ports each drive is plugged into.

Actually having multiple drives of the same type is a pain.  So rather than risk data loss, I removed both drives and re-installed the ESXi O/S from a third Savvio drive.  All good.  Now I need to locate and import all my VMs, however some were on the removed Savvio disks.  This meant installing each disk independently and checking the contents to determine which contained VMs and which contained the broken O/S.

Lesson 2 – Wherever possible, place your VMs on disks separate from the server itself.

Yes, I do have most of my VMs on my Iomega ix4-200d, but, rather crucially, not my Windows 2008 AD Server, which needed to be moved from internal disk to the ix4 before I continued (schoolboy error there).  The AD server was rather important for accessing my, ahem, ix4, which is configured to validate logins using AD.  This creates a bit of a circular reference which could have been a disaster.

Lesson 3 – Place your Windows domain controller on a physical server, or have another independent backup elsewhere.

Having a physical server just for AD control isn’t part of my total virtualisation plan, so I’m looking at whether I can host a backup controller with Amazon AWS and use VPN to secure it into my private network.  This way, if I ever have an issue, I can still authenticate.  The issue of course is cost, which may make a dedicated server the cheaper option.

So, by 1am everything was back up and running.  Did I learn anything else?  Well yes…

Lesson 4 – after 22 years in IT, I should remember that adequate documentation and a DR plan are crucial.  In fact, in a virtualised environment, they are essential due to the concentration of risk placing all systems on a single server causes.

So what next for my virtual infrastructure?  I have a few changes planned; I’ll create a backup ESXi server that can import and run the VMs in the event of a future server failure.  I will also be investigating AWS with Windows 2008 and VPN to create a backup domain controller and see if I can continue to work if both server’s hardware failed.

That leaves one Single Point of Failure… my ix4-200d.  Anyone want to donate me a spare one?

About Chris M Evans

Chris M Evans has worked in the technology industry since 1987, starting as a systems programmer on the IBM mainframe platform, while retaining an interest in storage. After working abroad, he co-founded an Internet-based music distribution company during the .com era, returning to consultancy in the new millennium. In 2009 Chris co-founded Langton Blue Ltd (www.langtonblue.com), a boutique consultancy firm focused on delivering business benefit through efficient technology deployments. Chris writes a popular blog at http://blog.architecting.it, attends many conferences and invitation-only events and can be found providing regular industry contributions through Twitter (@chrismevans) and other social media outlets.
  • Pingback: Best BluRay Disc()

  • http://stevetodd.typepad.com Steve Todd

    Very entertaining story Chris. It’s especially entertaining when it happens to someone else!

  • InsaneGeek

    Not to divert from the very good message/example of documentation for DR plans… but there some ways to have gotten things back relatively easy.

    All you should have had to do was look at the partition table in whatever program you wanted. You should have seen different partition layouts depending upon if the entire drive was used or not. If you had a drive used just for VM’s it would only contain a single partition (partition type fb), where as the boot disk would have more as you can’t boot an OS directly from vmfs.

    Additionally you should have been able to either run esxi from a Live CD or install esxi to a usb thumbdrive and brought up your system. With the caveat that I haven’t done this myself to make sure.

    Also related to redundancy if you aren’t as concerned about offsite protection and looking for the inepensive option, you could run vmware server on an existing system locally (so you don’t have to give up your entire system to ESX) and run a second instance of AD server there. You’d be protected against the same type of physical failure as you’d have separate physical pieces of hardware (assuming you were using some local drive on each system not a shared drive)

  • Pingback: Tweets that mention The Storage Architect » Blog Archive » Virtualisation: Learning The Hard Way -- Topsy.com()

  • codewrtr

    Seems like a pretty good place to mention benefits of a RAID configuration.

    -r

  • http://www.rivnet.ro Ionut Nica

    wow, I guess I’m not alone when it comes to vmware homelab failures :)

    my homelab is not as advanced as yours is [single server, 2 disks no raid, 2 nics]. I had my hard drive start acting weird [overall slugging system, console kept spitting out sector read errors].
    took me 12 hours to copy 300GB of data from bad disk to good disk, with a barely running esxi host, and get back on my feet. I lost 1 VM due to being careless, level 8 issue, nothing wrong with vmware.
    In the end it turned out the failing disk was fine, just the controller port was screwed up, thank god I have 6 more ports :)

  • GP

    Using single disk or JBOD config? Ouch, that’s lesson number one. There is no such thing as a “good disk,” just bad ones that haven’t bitten you yet. ALWAYS use RAID, even for local storage.

0 Flares Twitter 0 Facebook 0 Google+ 0 StumbleUpon 0 Buffer 0 LinkedIn 0 0 Flares ×