feedback on rsync-HEAD-20050125-1221GMT

Mon Jan 31 16:04:32 GMT 2005

Hi Chris,

Chris Shoemaker wrote:
> On Fri, Jan 28, 2005 at 03:42:25PM -0500, Alberto Accomazzi wrote:
> 
>>Chris Shoemaker wrote:
>>
>>
>>>If I understand Wayne's design, it would be possible to invent a
>>>(per-directory) "hook" rule, whose value is executed, and whose stdout
>>>is parsed as a [in|ex]clude file list.  E.g.:
>>>
>>>-R "cat .rsync-my-includes"
>>>
>>>or
>>>
>>>-R "find . -ctime 1 -a ! -fstype nfs -a ! -empty -o iname 'foo*'"
>>
>>This is certainly a very powerful mechanism, but it definitely should 
>>not be the only way we implement file filtering.  Two problems:
>>
>>1. Sprinkling rule files like these across directories would mean 
>>executing external programs all the time for each file to be considered. 
> 
> 
> No, only one execution per specified rule.  Most users of this feature
> would put specify one rule at the root directory.  But, if a user
> wanted to change the rules for every directory, they would have to
> specify a rule in each directory.  Then, yes, one execution per
> directory.  Presumably they would do this because they actually need
> to.  Never one execution per file.

Ok, I guess I had misunderstood your original suggestion.  One execution 
per directory is presumably not so bad, although it's hard to make 
assumptions about how one's data hierarchy is structured.

>> This would presumably slow down rsync's execution by an order of 
>>magnitude or so and suck the life out of a system doing a big backup job.
> 
> 
> If you're referring to process spawning overhead, it's no big deal.
> If you're referring to the actual work required to return the file
> list, what makes you think that rsync can do it more efficiently than
> 'cat' or 'find', or whatever tool the user chose?

I was referring to the overhead of spawning a process per file being 
considered.  But I think we all agree that this is not desirable nor 
necessary.

>>2. Who does actually need such powerful but yet hard-to-handle 
>>mechanism?  Most of rsync's users are not programmers, and even us few 
>>who are apparently still get confused with rsync's include/exclude 
>>logic, forget about even more complicated approaches.
> 
> 
> Do you mean include/exclude mechanism or filtering mechanism?  Well,
> IMO, parsing a file list is *less* complicated than rsync's custom
> pattern specification and include/exclude chaining.  Actually, I think
> rsync patterns are /crazy/ complicated and fully deserve the pages
> upon pages of documentation, explanation and examples that they get in
> the man page.
> 
> But, complexity is somewhat subjective, so I won't argue (much) about
> it.  In practice, /familiarity/ is far more important than complexity
> in a case like this.  Someone who looks at rsync for the first time
> has a _zero_ chance of having seen something like rsync's patterns
> before, because there is nothing else like them.  

I agree that exclude/include patters can be tricky, and you have a good 
point about familiarity versus complexity.  I think what makes them hard 
to handle is the fact that we are dealing with filename (and directory 
name) matching and recursion.  So matching only a subset of a file tree, 
while simple as a concept, is non-trivial once you sit down and realize 
that you need a well-defined syntax for it.  Can you write a find 
expression that is simpler or more familiar to the average user than an 
rsync's include/exclude?

> (The allusion to GNU
> tar's --exclude option which takes only a filename, not a pattern,
> isn't really helpful in understanding rsyncs --exclude option.)

Uh?  Tar does take patters for exclusion, and has its own quirky way of 
dealing with wildcards, directory matching and filename anchoring:
http://www.gnu.org/software/tar/manual/html_node/tar_100.html

> It's not that pattern matching for file selection isn't complex --
> it's just that it's such a well-defined, conceptually simple, common
> task that other tools (like 'find' and 'bash') handle better than
> rsync ever will.  And that's the way it should be: it's the unix way.

I agree that this is something we should be striving for as much as 
possible: pipeline and offload tasks rather than bloating applications.

>>If you really need 
>>complete freedom maybe the way to go is to do your file selection first 
>>and use --files-from.  
> 
> 
> Yes, --files-from is nice, and honestly, almost completely sufficient.
> But in some dynamic cases, you can't keep the list updated.

Well, maybe we should go back and see if the solution to all problems 
isn't making --files-from sufficient.  What exactly is missing from it 
right now?  The capability to delete files which are not in the 
files-from list?  Or the remote execution of a command that can generate 
the files-from list for an rsync server?  Maybe we ought to really 
figure out what things cannot be achieved with the current functionality 
before coming up with something new.

>>challenge is making this powerful without making it too complicated, 
>>because in that case nobody will use it.
> 
> 
> You see --filter as less complicated than --include/exclude, then?
> It's certainly more powerful.

Since --filter can support a superset of the file selection rules that 
--include/exclude supports, it's certainly more complicated than 
include/exclude, but not by much: I still think the trickiest part of 
the file selection rules for the average user will be pattern matching. 
  The other big issue looming is the logic used for 
nesting/inheriting/overriding file selection rules.  I'm really worried 
that those can easily get out of hand.

-- Alberto

********************************************************************
Alberto Accomazzi                      aaccomazzi(at)cfa harvard edu
NASA Astrophysics Data System                        ads.harvard.edu
Harvard-Smithsonian Center for Astrophysics      www.cfa.harvard.edu
60 Garden St, MS 31, Cambridge, MA 02138, USA
********************************************************************