[clug] Story: Fijian Resort complex loses a single disk: business process stops for 1-2 days

Sun Jul 27 00:26:02 MDT 2014

hasty reply, will surely contain schoolboy howlers :/

On 27/07/14 07:56, Alex Satrapa wrote:
> On 26 Jul 2014, at 23:48, Scott Ferguson 
> <scott.ferguson.clug at gmail.com> wrote:
> 
>> Again - that verifies the backup, not the reliablity. Counting the 
>> bottles you store on the nature strip isn't laying down a cellar 
>> for your future ;p
> 
> So how do you verify the media when the media is, say, a USB hard 
> drive?

AFAIK you can't (or I have included verification method examples with
the others in the previous post).... you can only verify failure, which
is of limited use. Spinning magnetic media is only slightly less
problematic (SMART is not reliable for predicting failure with
small numbers of drives).
Given that, and if that backup is important*1 - then the only recourse
is to further distribute the risk. i.e. don't rely on just one USB Flash
drive (use 2, different brands), and, preferably, add off-site blueray
backups. Likely there is some data you're backing up (passwords,
password manager, critical documents, keys) that are very important*1,
so perhaps two backup schemes? (stuff I can't lose, stuff I can live
with losing)  The same applies to OP (server backups) i.e. partitioning,
keys, xen/vhost config, and packages lists are backed up separate from
server data. Not just for the purposes of risk management but also
because it often not practical to even snapshot an entire running
server, let alone a full image.

*1 how great an impact will failure make rather than what is the
probability of failure

> 
>> You snip out all of my original post bar the sentence about 
>> verifying the backup medium then state that you compare the backup
>>  to the original. Why?  (I also note that you could/should have 
>> checked the backup by using rsync with the checksum option - cp is
>>  a dangerous way to do backups)
> 
> Here is how you check the backup to verify that it’s working: 1) Keep
> MD5 sums 2) Check the backup against the MD5 sums

Good, then you're in agreement with the previous posts in this thread by
others and I.
And it will work with caveats*2. If those MD5sums were stored separate
from the system being backed up (possible with TM? unlikely). The OP
is about publicly exposed server systems so security is important -
md5sums have lower overhead than gpg at the cost of security. If your
server is worth backing up it should be using IDS so my preference is
gpg. PCI compliance and other security requirements may also be a
deciding factor depending on the user/administrator/client's idea of
doing a proper job.

*2 except in your stated use-case with that commercial closed product
,as it's unlikely you can checksum TM snapshots in any useful way - only
verify the sparse image structure is intact, and trust that fsevents and
the hardlinks have ensure all changed files are copied to the delta.

> 
> Nothing in there about cp.

I'm sorry you took that personally.

> I use Time Machine which is basically doing the same thing as “rsync 
> snapshot” backups.

In that it's a form of incremental copies. Yes, understood.

Snapshots are handy, sometimes legal requirements, and generally a good
addition to a backup procedure (one of the joys of btrfs).

NOTE: snapshots are not backups. A carton of eggs is 12 eggs, not
1 egg with 11 backups.

Ever tried to use snapshots to "backup" live databases?

> It copies-by-reference the entire file system then hard copies the 
> files that have changed.

By 'reference' I'm guessing you mean the proprietary fsevents and 'hard
linking'?

I'm biased against Apple/Closed source, and most of my experience with
Time Machine is when I've had to recover backups (using rsync) when it's
eaten clients backups :(    (see attached for the most common TM fail)

This thread has wandered from backups for servers, to backups, to
proprietary closed source systems. While interesting it may just confuse
those searching for a single step backup panacea.
In principle the problems are the same - IIRC TM doesn't verify the
backup against the original (it relies on fsevents to determine what's
changed), and when it does "verify" it's checking the backup sparse
image structure - I've no idea how you would tack on a checksumming
system to it (if it's possible). And it's reliant on a single media system.

> 
>> The discussion (see OP) is about a failure of a backup system. One
>>  of the most common is media failure due to a failure to practise 
>> BMP and actually prove the reliability of the medium (damaged array
>> *and* pp backup plan didn't backup critical data).
> 
> How does one verify the media more cheaply than checksumming the 
> content?

Content checksums do not verify the media - so "cost" is not a
factor.

You 'could' use SMART to monitor the disks on your NAS (but confusingly
you use a fast local backup system that takes all day to backup a few
TiB and you have a USB drive bottleneck - with a NAS - so who knows?)
But if it's one of those "consumer grade" NAS with a single point of
failure stored in the same place as the system being backed up, it maybe
more trouble to implement than it's worth. (e.g. Canberra has a high
burglary rate).

<snipped>
> 
> How does one verify the backup plan is working, other than doing a 
> full or partial restore of data?

Initially you can't (bootstrap problem).

Though you don't ask (for reference purposes):-
Even if the logic schema is sound you still need to 'prove' the practise
or the 'trust' is just the misuse of the word 'faith'. The
non-destructive method is simulate a catastrophe by 'restoring' from *a*
backup onto different hardware.
Having proven the procedure is useful bench-check the plan's built-in
verification and determine the change control requirements and archiving
rules. (I've always relied on others to do that for me)
Document everything in detail. After which it gets recursive :)

Kind regards