Query re: rolling checksum algorithm of rsync

Chris Shoemaker c.shoemaker at cox.net
Fri Feb 11 18:03:28 GMT 2005


On Fri, Feb 11, 2005 at 11:08:45AM +0000, Alun wrote:
> Chris Shoemaker (c.shoemaker at cox.net) said, in message
>     <20050210190749.GA9297 at cox.net>:
> > 
> > > If the log file is e.g. 2Gbytes long and has only had 100Kbytes appended
> > > since the last rsync, then using --whole-file means 2GBytes of network
> > > traffic and 2GBytes of disk I/O at either end. Using the checksum means
> > > 2Gbytes of disk I/O at either end and 100Kbytes of network traffic (plus the
> > > checksum data). Neither is ideal.
> > 
> > use logrotate.
> 
> I'm aware of things like logrotate, but if I have to rotate the logs every
> hour on each of my webcache servers so that rsync will perform well, then I
> can't really afford to do it. I'd end up keeping 144 logfiles per day on
> the logging server just to make rsync efficient.

144 per day?  Oh dear!  :) I think your filesystem could handle it.
BTW, using multiple file has the additional benefit of time-stamping
periods of log -- like bookmarks in time.

> 
> Similarly, remote syslog wouldn't tackle it since not all the services for
> which we need to collate logs even use syslog.

Log onto a networked filesystem then?

> 
> At the moment, I have a script which runs every 10 minutes and just copies
> over the tail of the logfile, using the current size on the logging server
> as its start point. This works OK, but it's yet another custom service to
> maintain. 

Doesn't sound too bad, but I can see why you'd want to simplify it.

> 
> We already use rsync widely for other purposes on these servers and a patch
> like I mentioned would allow us to use it for this extra job too. 

Even though it pretends to be, rsync is not really a swiss army knife.
It's fundamentally suited for synchronization of two files.  But your
case only has one potential source of data.  No independent changes
are allowed at the destination so no synchronization is needed.
You're just copying data in one direction.  For that problem, the
simplest solution will never include rsync.

> 
> I know it's forcing rsync to do something that doesn't make sense in the
> general case, but in the specific case of files which are almost always
> appended, it could be a gain. 
> 
> > Probably not.  I suspect even what you describe wouldn't give you what
> > you want.  How would you reliably choose n?  
> 
> For my application, I could use:
> 
> n = max("current size of file on logging server minus 1Mbyte", 0)

IOW, files on logging server aren't changing.  See above.

-chris


More information about the rsync mailing list