keep rsync from removing unfinished source files?

Matt McCutchen matt at mattmccutchen.net
Sun Sep 7 20:03:10 GMT 2008


On Sun, 2008-09-07 at 10:59 -0400, Aaron Swartz wrote:
> I have two machines, speed and mass. speed has a fast Internet
> connection and is running a crawler which downloads a lot of files to
> disk. mass has a lot of disk space. I want to move the files from
> speed to mass after they're done downloading. Ideally, I'd just run:
> 
>     $ rsync --remove-source-files speed:/var/crawldir .
> 
> but I worry that rsync will unlink a source file that hasn't finished
> downloading yet. (I looked at the source code and I didn't see
> anything protecting against this.)

Yes, that could happen.

> Ideas I had were:
>  - a pause between downloading the file list and downloading the files

This approach would fail for very large files unless the pause is
correspondingly long.

>  - an exclude rule for recently modified files
>  - a check to not delete a file if its file size has changed since it
> was copied

Either of these would probably work, and they would not be hard to
implement by modifying rsync, but they seem hackish.

IMO, a proper solution is to have the crawler indicate somehow which
files are unfinished so rsync can avoid copying those.  E.g., the
crawler could name unfinished files according to a special pattern so
that you could exclude them with --exclude, or it could keep them in a
temporary directory that rsync doesn't visit.

Matt



More information about the rsync mailing list