keep rsync from removing unfinished source files?
Matt McCutchen
matt at mattmccutchen.net
Sun Sep 7 20:03:10 GMT 2008
On Sun, 2008-09-07 at 10:59 -0400, Aaron Swartz wrote:
> I have two machines, speed and mass. speed has a fast Internet
> connection and is running a crawler which downloads a lot of files to
> disk. mass has a lot of disk space. I want to move the files from
> speed to mass after they're done downloading. Ideally, I'd just run:
>
> $ rsync --remove-source-files speed:/var/crawldir .
>
> but I worry that rsync will unlink a source file that hasn't finished
> downloading yet. (I looked at the source code and I didn't see
> anything protecting against this.)
Yes, that could happen.
> Ideas I had were:
> - a pause between downloading the file list and downloading the files
This approach would fail for very large files unless the pause is
correspondingly long.
> - an exclude rule for recently modified files
> - a check to not delete a file if its file size has changed since it
> was copied
Either of these would probably work, and they would not be hard to
implement by modifying rsync, but they seem hackish.
IMO, a proper solution is to have the crawler indicate somehow which
files are unfinished so rsync can avoid copying those. E.g., the
crawler could name unfinished files according to a special pattern so
that you could exclude them with --exclude, or it could keep them in a
temporary directory that rsync doesn't visit.
Matt
More information about the rsync
mailing list