[clug] Story: Fijian Resort complex loses a single disk: business process stops for 1-2 days

Thu Jul 24 18:53:33 MDT 2014

Another story I contributed last night about running “servers”, but only half-remembered :(
The story, maybe provided by the Data Recovery business, of a Hero Admin dealing with a dead disk.

I’d presume it was a Windows Server and their main DB was on a hardware RAID (3xRAID-5) and the “configuration” was stored on the system disk, provided using Software/LVM to mirror two drives. I don’t believe two drives failed together, I’ve seen failed RAID drives left operating, unnoticed, for many months, possibly over a year. It’d be more likely the Admins didn’t notice the original RAID-1 drive fail. [But correlated “bad batch” drive failures are known, not impossible. All fail within 2 weeks.]

<http://www.zdnet.com/how-one-business-recovered-from-a-raid-failure-7000025177/>

Backups had silently failed.

This may have been “good practice” in 1995-2002, but not now.
Feel free to contribute suggestions for what they should’ve done… Like dual servers, RAID-10 (4 drives not 3), and more.

Enough people on this list run servers at home, work or at a hosting site for it to be relevant.

The question posed but never answered at the meeting was:
  “how do you create a backup system, at least for ‘2nd copy of precious data’, where you _know_ if it fails”.

Just sending “it worked!” emails not only doesn’t work, you’ll end up automatically deleting them, or if they stop coming, you won't notice  their absence after years of working… 

Setting up a Great Big Alarm is attractive, but how do you _know_ the monitor is working and working properly? You’ve now also got to regularly test the Great Big Alarm.
False/stuck instrument readings and burnt-out indicator lamps have brought down many planes, this is a widespread problem.

What if the backup regime gets changed and the monitor treats errors or “zero copied” messages as “Proof of Life” or Completion? [Seen that too :(]

If anyone has a good solution to this question, I’d love to hear.
I know it has to be simple, because adding complexity will not address the root causes of failed monitoring.

--
Steve Jenkin, IT Systems and Design 
0412 786 915 (+61 412 786 915)
PO Box 48, Kippax ACT 2615, AUSTRALIA

mailto:sjenkin at canb.auug.org.au http://members.tip.net.au/~sjenkin