Renaming a directory results in an expensive retransmission

N.J. van der Horn (Nico) nico at vanderhorn.nl
Fri Oct 5 21:35:57 GMT 2007


We are using rsync for several years, but since a couple of months
we use it to backup remote servers, some with more than 200GB capacity.

Especially Windows users sometimes have the (bad) habit to change
the name of a directory with huge amounts of data below them.

We see the same nasty results as you are talking about:

* rsync "thinks" that the old directory name has disappeared, and deletes
  the directory on the target machine, throwing away the expensive
transmission
* the new directory name initiates a fresh / full (re)transmission,
  sometimes taking days.... while the "real work" would be done in
minutes...
* the servers we backup have between 20GB and 200GB capacity.
* all rsync's are run in parallel, average sync time is 1.5 hour for 900GB.
* when a "user" behaves as described, it takes days to a week to resync.

It is a tricky problem to deal with i think, it is tempting to keep a
checksum'd file/directory list on both sides with information like:

* a fingerprint/signature/checksum to identify each file or directory
* inode number
* timestamp
* filesize

In case a files appears to be deleted, because the name/path is changed,
it could possibly be identified by it's fingerprint and used to sync
cleverly ;-)
This in the thought of expanding --fuzzy, giving it more functionality
(hint).

For some time i am experimenting with a solution to this problem, by
some sort
of a "preprocessor", that tries to identify in the described way, creating
hardlinks (ln) to let rsync think the files are already in the new location.
I am traversing on both sides (remote and local) the directory trees,
producing
a file with the information described above, but it is still work in
progress...

The cost of keeping a database in this scenario would be truly justified
for me.

That rsync deletes the files in the old location is then no problem for
me anymore.

But.... i am just a user with needs... looking for a solution to a
problem also,
hoping this can be solved by the clever developers ;-)

Maybe there is already a solution available, and we are chasing shadows ?


Thanks, Nico


Frank Thomas schreef:
>
> Good day,
>
>  
>
> I’ve got a question regarding the usage of rsync that I just cannot
> figure out. I’ve done a fare hunt for the answer, but I’m stumped.
>
>  
>
> Here is the situation.
>
>  
>
> I have two pc’s running linux and using rsync to perform a backup from
> server1 to server2. For example: rsync -avzr -e 'ssh
> -i/root/.ssh/id_rsa' --delete /home/samba/admin/software
> www.some-server.com:/home/RemoteSystems/company/home/samba/admin
>
> Let’s say I have a directory within rsync’s scope to sync called
> directory1.
>
> Rsync is run and directory1 is sync’ed from server1 to server2. Also,
> a file named File1 is sync’ed because it is in the directory being
> sync’ed.
>
>  
>
> Server1                                                 server2
>
>   Directory1                                               Directory1
>
>      File1                                                        File1
>
>  
>
> Now, let’s say a user comes and changes the name of the Directory1 on
> server1 to DirectoryNew, rsync performs the following actions:
>
> 1.                   rsync recognizes that Directory 1 is not on
> server1, but it is on server2, so it flags it and it’s contents for
> deletion on server2.
>
> 2.                   rsync recognizes that DirectoryNew is on server1,
> but not on server2, so it flags it and it’s contents for copying to
> server2.
>
> 3.                   rsync performs these actions to make the two
> directories the same.
>
>  
>
> This action is the simplest method of performing an rsync, but it
> would be nice to have rsync to be intelligent enough to recognize a
> name change but not an inode change on the source. So the action
> performed would be,
>
> 1.                   rsync recognizes that Directory1 is not on
> server1, but it’s inode still is. Rsync reads the new directory name
> and flags the name change from Directory1 to DirectoryNew on server1.
>
> 2.                   Rsync reads server2 and sees that Directory1
> exists, and flags a pending name change on server2 from Directory1 to
> DirectoryNew.
>
> 3.                   Name is changed on server2. No files or
> directories are deleted and re-transferred from source to destination
> as the structure under the directory has not changed.
>
>  
>
> Why go through all this work? I’ve had personnel change a directory
> name that has several gigabytes of data in it without notifying me and
> at night, rsync tries to perform the directory and file dance and
> fails simply because the volume is so great. It would be nice to
> either, one, recognize a large discrepancy between the source and
> destination before anything occurs, by giving a message of amount of
> potential bytes that would be transferred, (this doesn’t work with
> dry-run option), or do the fancy dance by recognizing a name change
> over a deletion of a directory.
>
>  
>
> Thanks.
>
>  
>
> *Frank Thomas*
>
>  
>

-- 
Behandeld door / Handled by: N.J. van der Horn (Nico)
---
ICT Support Vanderhorn IT-works, www.vanderhorn.nl,
Voorstraat 55, 3135 HW Vlaardingen, The Netherlands,
Tel +31 10 2486060, Fax +31 10 2486061




More information about the rsync mailing list