Estimating backup usage with dir-merge filter

Thu Oct 6 14:54:27 MDT 2011

On Thu, Oct 6, 2011 at 4:01 PM, Benjamin R. Haskell <rsync at benizi.com> wrote:
> It sounds like you missed the point of Kevin's message (in the other fork of this thread).  The point wasn't to use
> `du`, it was that you can run your stats against the backed-up files, not the source.  Then you're only running stats
> against the results of running the backup using the filters, so you don't need to filter them again.

I got that but neglected to respond to the whole group.  My mistake.
The backups are being performed using BackupPC to a central server
where compression and de-duplication is done.  While it's true that
the actual storage on the backup server being consumed by each user is
less because of these, I don't have any problem hiding this from them
and instead telling them what their uncompressed and duplicated usage
is instead.  It has more of an effect that way if you know what I
mean.

> If that doesn't make sense or isn't possible (backups are on some remote server), then just use your rsync command
> with '--list-only', and post-process that list.

I've been tinkering with using --verbose and --dry-run then parsing
the total size our of the last line of the output and I think I'm
close.  Curiously, when I don't include the --filter option as a
baseline, I'm not getting the same results as "du".

$ du -sb . | awk '{print $1}'
508625653

$ rsync --dry-run --verbose -a . /tmp/does_not_exist | tail -1 | awk
'{print $4}'
506037893

The difference is minimal and probably negligible for this purpose but
I'm still curious where it's coming from.  Maybe there are some sparse
files in there somewhere.

Paul