[clug] Story: Fijian Resort complex loses a single disk: business process stops for 1-2 days

Thu Jul 24 20:28:07 MDT 2014

One way (had it work this way) is that if a backup saved significantly more/less than
recent average then an exception mail is sent. Naturally this does not work well for
incremental backups, but is still usable.

Naturally, if everything is allowed to fail then so can this heuristic. This means
a human should review the process with some regularity (probably not each time a
backup is taken).

We also parsed the backup log to make sense of everything. This can be a separate
script than the actual backup, reducing the probability of a single point of failure.

cheers,
	Eyal

On 07/25/14 10:53, steve jenkin wrote:
> Another story I contributed last night about running �servers�, but only half-remembered :(
> The story, maybe provided by the Data Recovery business, of a Hero Admin dealing with a dead disk.
>
> I�d presume it was a Windows Server and their main DB was on a hardware RAID (3xRAID-5) and the �configuration� was stored on the system disk, provided using Software/LVM to mirror two drives. I don�t believe two drives failed together, I�ve seen failed RAID drives left operating, unnoticed, for many months, possibly over a year. It�d be more likely the Admins didn�t notice the original RAID-1 drive fail. [But correlated �bad batch� drive failures are known, not impossible. All fail within 2 weeks.]
>
> <http://www.zdnet.com/how-one-business-recovered-from-a-raid-failure-7000025177/>
>
> Backups had silently failed.
>
> This may have been �good practice� in 1995-2002, but not now.
> Feel free to contribute suggestions for what they should�ve done� Like dual servers, RAID-10 (4 drives not 3), and more.
>
> Enough people on this list run servers at home, work or at a hosting site for it to be relevant.
>
> The question posed but never answered at the meeting was:
>    �how do you create a backup system, at least for �2nd copy of precious data�, where you _know_ if it fails�.
>
> Just sending �it worked!� emails not only doesn�t work, you�ll end up automatically deleting them, or if they stop coming, you won't notice  their absence after years of working�
>
> Setting up a Great Big Alarm is attractive, but how do you _know_ the monitor is working and working properly? You�ve now also got to regularly test the Great Big Alarm.
> False/stuck instrument readings and burnt-out indicator lamps have brought down many planes, this is a widespread problem.
>
> What if the backup regime gets changed and the monitor treats errors or �zero copied� messages as �Proof of Life� or Completion? [Seen that too :(]
>
> If anyone has a good solution to this question, I�d love to hear.
> I know it has to be simple, because adding complexity will not address the root causes of failed monitoring.
>
> --
> Steve Jenkin, IT Systems and Design
> 0412 786 915 (+61 412 786 915)
> PO Box 48, Kippax ACT 2615, AUSTRALIA
>
> mailto:sjenkin at canb.auug.org.au http://members.tip.net.au/~sjenkin

-- 
Eyal Lebedinsky (eyal at eyal.emu.id.au)