Estimating backup usage with dir-merge filter

Fri Oct 7 05:40:32 MDT 2011

On Thu, Oct 6, 2011 at 6:49 PM, Henri Shustak <henri.shustak at gmail.com> wrote:
>>> It sounds like you missed the point of Kevin's message (in the other fork of this thread).  The point wasn't to use
>>> `du`, it was that you can run your stats against the backed-up files, not the source.  Then you're only running stats
>>> against the results of running the backup using the filters, so you don't need to filter them again.
>>
>> I got that but neglected to respond to the whole group.  My mistake.
>> The backups are being performed using BackupPC to a central server
>> where compression and de-duplication is done.  While it's true that
>> the actual storage on the backup server being consumed by each user is
>> less because of these, I don't have any problem hiding this from them
>> and instead telling them what their uncompressed and duplicated usage
>> is instead.  It has more of an effect that way if you know what I
>> mean.
>>
>>> If that doesn't make sense or isn't possible (backups are on some remote server), then just use your rsync command
>>> with '--list-only', and post-process that list.
>>
>> I've been tinkering with using --verbose and --dry-run then parsing
>> the total size our of the last line of the output and I think I'm
>> close.  Curiously, when I don't include the --filter option as a
>> baseline, I'm not getting the same results as "du".
>>
>> $ du -sb . | awk '{print $1}'
>> 508625653
>>
>> $ rsync --dry-run --verbose -a . /tmp/does_not_exist | tail -1 | awk
>> '{print $4}'
>> 506037893
>>
>> The difference is minimal and probably negligible for this purpose but
>> I'm still curious where it's coming from.  Maybe there are some sparse
>> files in there somewhere.
>
> Do you have the same discrepancy if you use the --stats option?

Yes.  Using --stats, the last line of the output is the same as is the
earlier "Total file size:" line in the additional output.

Paul