[clug] Story: Fijian Resort complex loses a single disk: business process stops for 1-2 days
scott.ferguson.clug at gmail.com
Fri Jul 25 07:31:52 MDT 2014
On 25/07/14 10:53, steve jenkin wrote:
> Another story I contributed last night about running “servers”, but only half-remembered :(
> The story, maybe provided by the Data Recovery business, of a Hero Admin dealing with a dead disk.
> I’d presume it was a Windows Server and their main DB was on a hardware RAID (3xRAID-5) and the “configuration” was stored on the system disk, provided using Software/LVM to mirror two drives. I don’t believe two drives failed together, I’ve seen failed RAID drives left operating, unnoticed, for many months, possibly over a year. It’d be more likely the Admins didn’t notice the original RAID-1 drive fail. [But correlated “bad batch” drive failures are known, not impossible. All fail within 2 weeks.]
> Backups had silently failed.
> This may have been “good practice” in 1995-2002, but not now.
> Feel free to contribute suggestions for what they should’ve done… Like dual servers, RAID-10 (4 drives not 3), and more.
> Enough people on this list run servers at home, work or at a hosting site for it to be relevant.
> The question posed but never answered at the meeting was:
> “how do you create a backup system, at least for ‘2nd copy of precious data’, where you _know_ if it fails”.
Prove it. (this is engineering, right?)
That's especially simple if the file system/s being backed up are
encrypted - if you can't unencrypt the backup the backup is worthless
(and if you don't backup encrypted filesystems any disk error will make
you regret it).
If it's unencrypted data and it's important it probably should have been
encrypted in the first place (belt and suspenders) - but you can always
generate signatures of unencrypted data and check them against backups -
not difficult to automate. With dynamic data you may have to implement
transaction accounting, or at least some change control processes - but
still "do-able" (sometimes called "BMP" ;).
It's also good practise to test backup medium themselves - to see how
reliable they are at "keeping" verified backups.
The big question is why don't people already do this? *Particularly*
As far as I can tell the cause is human (over-investment in emotional
"trust" validation) - when ever I ask why people failed to verify they
seem prepared to invest more time and effort in ridiculous
justifications for not doing so - than what is required to implement
verification system would take.
Surely you haven't forgotten your ITIL so soon Steve?
More information about the linux