file statistics collection using stat(2) data obtained by rsync

Sun Sep 16 15:43:28 GMT 2007

On 9/16/07, Hugo Connery <hmc at er.dtu.dk> wrote:
> Yes, uid/access time based statistics gathering is quite orthogonal to
> rsync's motivation.  But, rsync, as it backs up my data, it has access
> to all the statistics I need, so why not piggy back the stats gathering
> on rsync as a matter of efficiency?

The efficiency loss doesn't seem to be much.  On my computer, git can
traverse my kernel source tree and stat all ~22000 files in about half
a second, provided the stat info was in the kernel's cache.  If you
gather statistics right before running rsync, the statistics-gathering
will probably take longer because it has to bring stat info into
cache, but having done so will speed up rsync's stat calls by the same
amount, so overall the loss is not much.

> But, perhaps orthogonal extensions breaks one of the fundamental rules:
> do one thing well.

This rule is extremely important when deciding what functionality to
include in the standard rsync, but if making your own copy of rsync
with an orthogonal extension is the best way for you to accomplish a
specific task, I would say go for it.  I'm just not convinced it's the
best way, because a separate program has the advantages that rsync can
be upgraded without having to be re-patched, statistics-gathering can
be run/scheduled independently of rsync, and the program can be
written in Perl rather than C.

I'm just giving my advice/opinion; you implement the statistics
gathering however you like.

Matt