Large files and symlinks

Thu Jul 31 06:53:04 EST 2003

On Wed, Jul 30, 2003 at 01:48:09PM +0100, Tim Shaw wrote:
> Hi,
> 
> I'm mirroring a single server to multiple clients. Currently I'm using 
> scp, but I (think I) want to use rsync.
> 
> The files I'm mirroring are large - c.4GB (video data)
> 
> Each client has a different set of these files.
> 
> The transfer is done over the internet, and may fail (regularly!).
> 
> I set up a separate directory for each client, and put in symlinks to 
> the actual files (maximum 35 per client). There is only one copy of each 
> file on the server (for obvious reasons). This changes weekly(ish).
> 
> I think I want to use --partial and --copy-links (or 
> --copy-unsafe-links) among other things ... but I'm a little concerned 
> about what I've read about the --partial option. From my reading, it 
> appears that, if the connection fails, the *source* file gets truncated. 
> Am I correct in this interpretation?
> 
> If I am correct, a failed connection to the client will cause the file 
> on the server to be corrupted - and the next client will get the corrupt 
> file. This strikes me as odd behaviour - which is why I'm questioning my 
> interpretation of the --partial option.

Your interpretation is incorrect.  Before i start, let's get
it clear that client-server is irrelevant once the
connection is established.  The only thing relevant is
sender-receiver.

No matter what the sender's file will not be modified by
rsync so corruption on the sender will not happen.

What rsync does is create a temporary file and build that
during transfer.  When the transfer is complete it renames
the temporary file to the permanent file name.  If the
transfer of the file is interrupted before incompletion it
removes the temporary file.  In this way users don't see the
incomplete file at any time.  There is just a quick switch
when complete.

With --partial if the transfer of a file has started but is
incomplete when rsync terminates instead of deleting the
temporary file it is renamed as though the transfer were
complete.  Users then do see a partial file.  This partial
file will then be used by rsync the next time it is run to
avoid retransmitting what was already received.

In many cases invoking --partial is worse than not.  If you
are rsyncing a 4GB file and transfer is interrupted after
500MB has been synced you get a 500MB file which now has
less in common with the source than the 4GB file did.  

> 
> Also, I think I should be issuing per-file rsync's to minimise the setup 
> time. This is not a problem as the commands are generated anyway. Is 
> this correct?

Not really.  In fact, doing it this way will mean you have
to handle file deletion separately which is more overhead.

If my suspicion is correct rsync isn't the best tool for
your purposes anyway.  I suspect that you aren't updating
any files but instead are transferring new ones and deleting
old.  Unless you are syncing edits of video files where large
portions of the file remain unchanged I doubt seriously that
there is any common data between two video files so rsync
will be less efficient than scp or any other transfer
technique.

If my suspicion is incorrect and you really are syncing
files where only portions have changed there will be
difficulties with any released versions of rsync up to and
including 2.5.6.  At this date unless you are running the
version of rsync in CVS (or >= 2.5.7 when released) you are
most likely to encounter block level checksum collision and
wind up syncing each file twice.

If you are having problems with scp failing with incomplete
transfers what you may want to do is write or locate
something that can work over ssh but gives you functionality
similar to the --continue option of wget.  That could even
be done by a shell script using dd with the seek= and skip=
options although i'd be more inclined to write something in
C or perl.

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw at pegasys.ws

		Remember Cernan and Schmitt