specifying a list of files to transfer

Andrew J. Schorr aschorr at telemetry-investments.com
Tue Jan 14 16:03:38 EST 2003


Hi,

I don't want to start another --files-from war, but I am attaching
an updated version of my patch to allow you to specify a list
of files to transfer.  The normal rsync syntax allows you to specify
a list of SRC files to transfer on the command line.  This patch
adds some new options to allow you to instead supply a file that
contains a list of files to transfer.

The previous version of the patch was against rsync-2.4.6; this version
works for rsync-2.5.5.  The only real changes relate to the use of
the popt option parsing library in 2.5 (not used in 2.4).  This had
the minor effect of removing the possibility of using "-" to indicate
stdin since the new library seems to interpret this as an option and
barfs.  So instead I allow the use of "/dev/stdin".

By the way, this patch should also work against rsync-2.5.6pre1 except
for a couple of changes relating to white space and comments.  So a couple
of patch hunks are rejected but are easy to fix by hand.  If there is a
need, I can post an updated patch.

Last time we discussed this, Dave Dykstra objected to this patch
for two reasons:

   1. This patch only works in a single direction: when sending from a local
      system to a remote system.  It does not handle the case where you
      are receiving from a remote system to a local system.
   
   2. This capability is possible to achieve by specifying a list
      of files with --include-from and then adding --exclude '*' to
      ignore other files.  While this is true, it turns out to be
      much slower.  I have finally run a performance test to demonstrate
      this.  Results are below.

The basic idea of the patch is to handle the case where you already know
a list of files that might need to be updated and don't want to use
rsync's recursive directory tree scanning logic to enumerate all files.
The patch adds the following options:

     --source-list           SRC arg will be a (local) file name containing a list of files, or /dev/stdin
     --null                  used with --source-list to indicate that the file names will be separated by null (zero) bytes instead of linefeed characters; useful with gfind -print0
     --send-dirs             send directory entries even though not in recursive mode
     --no-implicit-dirs      do not send implicit directories (parents of the file being sent)

The --source-list option allows you to supply an explicit list of filenames
to transport without using the --recursive feature and without playing
around with include and exclude files.  As discussed below, the same
thing can be done by combining --recursive with --include-from and --exclude,
but it's significantly slower and more arcane to do it that way.

The --null flag allows you to handle files with embedded linefeeds.  This
is in the style of gnu find's -print0 operator.

The --send-dirs overcomes a problem where rsync refuses to send directories
unless it's in recursive mode.  One needs this to make sure that even
empty directories get mirrored.

And the --no-implicit-dirs option turns off the default behavior in which
all the parent directories of a file are transmitted before sending the
file.  That default behavior is very inefficient in my scenario where I
am taking the responsibility for sending those directories myself.

And now for a performance test:

I have a directory tree containing 128219 files of which 16064 are
directories.

To start the test, I made a list of files that had changed in the
past day:

   find . -mtime -1 -print > /tmp/changed

(normally, my list of candidate files is generated by some other means,
this is just a test example).  There were 5059 entries in /tmp/changed.

I used my new options to sync up these files to another host
as follows:

      time rsync -RlHptgoD --numeric-ids --source-list \
        --send-dirs --no-implicit-dirs -xz --stats /dev/stdin \
        remotehost:/extra_disk/tmp/tree1 < /tmp/changed

Here were the reported statistics:

	 Number of files: 5059
         Number of files transferred: 5056
         Total file size: 355514100 bytes
         Total transferred file size: 355514100 bytes
         Literal data: 355514100 bytes
         Matched data: 0 bytes
         File list size: 139687
         Total bytes written: 154858363
         Total bytes read: 80916

         wrote 154858363 bytes  read 80916 bytes  364992.41 bytes/sec
         total size is 355514100  speedup is 2.29

And the time statistics:

      112.53u 8.82s 7:03.92 28.6%

I then ran the same command again (in which case there was nothing to
transfer).  Here's how long it took:

	0.54u 0.62s 0:08.61 13.4%

Now to compare with the recursive method using --include-from.  First, we
must create the list of files.  In the case of include-from, we need to
include all the parent directories as include patterns.  The following
gawk seems to do the job:

      gawk '$0 != "./" {sub(/^\.\//,"")} {while ((length > 0) && !($0 in already)) {print "/"$0; already[$0] = 1; sub(/\/[^\/]*$/,"")}}' /tmp/changed > /tmp/includes

This creates a file containing 11464 lines (the original 5059 files
plus all the parent directories).  Now we can run the normal rsync
command:

      time rsync -RlHptgoD --numeric-ids -r -xz \
        --include-from=/dev/stdin --exclude '*' --stats . \
        remotehost:/extra_disk/tmp/tree2 < /tmp/includes

Here were the reported statistics:

         Number of files: 11464
         Number of files transferred: 5056
         Total file size: 355520625 bytes
         Total transferred file size: 355520625 bytes
         Literal data: 355520625 bytes
         Matched data: 0 bytes
         File list size: 250242
         Total bytes written: 154970663
         Total bytes read: 80916

         wrote 154970663 bytes  read 80916 bytes  222935.41 bytes/sec
         total size is 355520625  speedup is 2.29

And the time statistics:

      218.45u 12.92s 11:34.48 33.3%

I then repeated the command to see how long it takes when there's nothing
to transfer:

      106.77u 6.72s 3:12.42 58.9%

I then did a recursive diff on the trees to make sure they were identical,
and they were.

As you can see, the --source-list feature results in a significant savings
in both CPU time on the sending host and real time to complete the transfer.
It should also result in some CPU time savings on the receive host since
there is no need to transfer all the directories that we were forced
to add to get the --include-from syntax to work properly.

I think that should cover everything, let me know if there are any questions.

Cheers,
Andy
-------------- next part --------------
--- flist.c.orig	Thu Mar 14 16:20:20 2002
+++ flist.c	Fri Jan 10 11:10:58 2003
@@ -41,6 +41,7 @@
 extern int cvs_exclude;
 
 extern int recurse;
+extern int send_dirs;
 
 extern int one_file_system;
 extern int make_backups;
@@ -662,8 +663,8 @@
 	if (noexcludes)
 		goto skip_excludes;
 
-	if (S_ISDIR(st.st_mode) && !recurse) {
-		rprintf(FINFO, "skipping directory %s\n", fname);
+	if (S_ISDIR(st.st_mode) && !recurse && !send_dirs) {
+		rprintf(FINFO, "make_file: skipping directory %s\n", fname);
 		return NULL;
 	}
 
@@ -856,14 +857,16 @@
  * I *think* f==-1 means that the list should just be built in memory
  * and not transmitted.  But who can tell? -- mbp
  */
-struct file_list *send_file_list(int f, int argc, char *argv[])
+static struct file_list *send_file_list_proc(int f, char *(*ffunc)(), void *opq)
 {
-	int i, l;
+	int l;
 	STRUCT_STAT st;
 	char *p, *dir, *olddir;
 	char lastpath[MAXPATHLEN] = "";
 	struct file_list *flist;
 	int64 start_write;
+	char *in_fn;
+	extern int implicit_dirs;
 
 	if (show_filelist_p() && f != -1)
 		start_filelist_progress("building file list");
@@ -876,10 +879,10 @@
 		io_start_buffering(f);
 	}
 
-	for (i = 0; i < argc; i++) {
+	while ((in_fn = (*ffunc)(opq)) != NULL) {
 		char *fname = topsrcname;
 
-		strlcpy(fname, argv[i], MAXPATHLEN);
+		strlcpy(fname, in_fn, MAXPATHLEN);
 
 		l = strlen(fname);
 		if (l != 1 && fname[l - 1] == '/') {
@@ -904,8 +907,8 @@
 			continue;
 		}
 
-		if (S_ISDIR(st.st_mode) && !recurse) {
-			rprintf(FINFO, "skipping directory %s\n", fname);
+		if (S_ISDIR(st.st_mode) && !recurse && !send_dirs) {
+			rprintf(FINFO, "send_file_list: skipping directory %s\n", fname);
 			continue;
 		}
 
@@ -922,7 +925,7 @@
 					dir = fname;
 				fname = p + 1;
 			}
-		} else if (f != -1 && (p = strrchr(fname, '/'))) {
+		} else if (f != -1 && (p=strrchr(fname,'/')) && implicit_dirs) {
 			/* this ensures we send the intermediate directories,
 			   thus getting their permissions right */
 			*p = 0;
@@ -1020,6 +1023,49 @@
 	return flist;
 }
 
+struct argv_data {
+   int argc;
+   char **argv;
+};
+
+static char *
+get_arg(struct argv_data *ad)
+{
+   return (ad->argc-- > 0) ? *(ad->argv++) : NULL;
+}
+
+struct file_list *send_file_list(int f, int argc, char *argv[])
+{
+   struct argv_data arg_info;
+
+   arg_info.argc = argc;
+   arg_info.argv = argv;
+   return send_file_list_proc(f,get_arg,&arg_info);
+}
+
+/* note that send_file_list_proc silently truncates the filename to fit
+   in a buffer of MAXPATHLEN characters, so we can safely truncate there */
+static char *
+get_stdio(FILE *fp)
+{
+   static char fnbuf[MAXPATHLEN];
+   char *s = fnbuf;
+   char *eob = &fnbuf[sizeof(fnbuf)-1];
+   int cc;
+   extern int list_rs;
+
+   while (((cc = getc(fp)) != list_rs) && (cc != EOF)) {
+      if (s < eob)
+	 *(s++) = cc;
+   }
+   *s = '\0';
+   return ((cc == EOF) && (s == fnbuf)) ? NULL : fnbuf;
+}
+
+struct file_list *send_file_list_fp(int f,FILE *fp)
+{
+   return send_file_list_proc(f,get_stdio,fp);
+}
 
 struct file_list *recv_file_list(int f)
 {
--- main.c.orig	Wed Mar 27 00:10:44 2002
+++ main.c	Fri Jan 10 10:59:51 2003
@@ -25,6 +25,8 @@
 
 struct stats stats;
 
+static FILE *src_list_fp;
+
 extern int verbose;
 
 static void show_malloc_stats(void);
@@ -559,7 +561,9 @@
 		if (delete_mode && !delete_excluded) 
 			send_exclude_list(f_out);
 		if (!read_batch) /*  dw -- don't write to pipe */
-		    flist = send_file_list(f_out,argc,argv);
+		    flist = (src_list_fp ?
+			     send_file_list_fp(f_out,src_list_fp) :
+			     send_file_list(f_out,argc,argv));
 		if (verbose > 3) 
 			rprintf(FINFO,"file list sent\n");
 
@@ -836,6 +840,7 @@
 	extern int dry_run;
 	extern int am_daemon;
 	extern int am_server;
+	extern int source_list;
 	int ret;
 	extern int write_batch;
 	int orig_argc;
@@ -872,6 +877,14 @@
                 /* FIXME: We ought to call the same error-handling
                  * code here, rather than relying on getopt. */
 		option_error();
+		exit_cleanup(RERR_SYNTAX);
+	}
+
+	if (source_list &&
+	    ((argc != 2) ||
+	     !(src_list_fp = (strcmp(argv[0],"/dev/stdin") ?
+			      fopen(argv[0],"r") : stdin)))) {
+		usage(FERROR);
 		exit_cleanup(RERR_SYNTAX);
 	}
 
--- options.c.orig	Tue Mar 19 15:16:42 2002
+++ options.c	Fri Jan 10 11:01:10 2003
@@ -66,6 +66,10 @@
 int module_id = -1;
 int am_server = 0;
 int am_sender = 0;
+int source_list=0;
+int list_rs='\n';
+int send_dirs=0;
+int implicit_dirs=1;
 int recurse = 0;
 int am_daemon=0;
 int do_stats=0;
@@ -266,6 +270,10 @@
   rprintf(F,"     --bwlimit=KBPS          limit I/O bandwidth, KBytes per second\n");
   rprintf(F,"     --write-batch=PREFIX    write batch fileset starting with PREFIX\n");
   rprintf(F,"     --read-batch=PREFIX     read batch fileset starting with PREFIX\n");
+  rprintf(F,"     --source-list           SRC arg will be a (local) file name containing a list of files, or /dev/stdin\n");
+  rprintf(F,"     --null                  used with --source-list to indicate that the file names will be separated by null (zero) bytes instead of linefeed characters; useful with gfind -print0\n");
+  rprintf(F,"     --send-dirs             send directory entries even though not in recursive mode\n");
+  rprintf(F,"     --no-implicit-dirs      do not send implicit directories (parents of the file being sent)\n");
   rprintf(F," -h, --help                  show this help screen\n");
 #ifdef INET6
   rprintf(F," -4                          prefer IPv4\n");
@@ -287,7 +295,8 @@
       OPT_DELETE_AFTER, OPT_EXISTING, OPT_MAX_DELETE, OPT_BACKUP_DIR, 
       OPT_IGNORE_ERRORS, OPT_BWLIMIT, OPT_BLOCKING_IO,
       OPT_NO_BLOCKING_IO, OPT_WHOLE_FILE, OPT_NO_WHOLE_FILE,
-      OPT_MODIFY_WINDOW, OPT_READ_BATCH, OPT_WRITE_BATCH, OPT_IGNORE_EXISTING};
+      OPT_MODIFY_WINDOW, OPT_READ_BATCH, OPT_WRITE_BATCH, OPT_IGNORE_EXISTING,
+      OPT_SOURCE_LIST, OPT_NULL, OPT_SEND_DIRS, OPT_NO_IMPLICIT_DIRS};
 
 static struct poptOption long_options[] = {
   /* longName, shortName, argInfo, argPtr, value, descrip, argDesc */
@@ -361,6 +370,10 @@
   {"hard-links",      'H', POPT_ARG_NONE,   &preserve_hard_links , 0, 0, 0 },
   {"read-batch",       0,  POPT_ARG_STRING, &batch_prefix, OPT_READ_BATCH, 0, 0 },
   {"write-batch",      0,  POPT_ARG_STRING, &batch_prefix, OPT_WRITE_BATCH, 0, 0 },
+  {"source-list",      0,  POPT_ARG_NONE,   &source_list, 0, 0, 0 },
+  {"null",             0,  POPT_ARG_NONE,   0,             OPT_NULL, 0, 0},
+  {"send-dirs",        0,  POPT_ARG_NONE,   &send_dirs, 0, 0, 0 },
+  {"no-implicit-dirs", 0,  POPT_ARG_NONE,   0,             OPT_NO_IMPLICIT_DIRS, 0, 0},
 #ifdef INET6
   {0,		      '4', POPT_ARG_VAL,    &default_af_hint,   AF_INET , 0, 0 },
   {0,		      '6', POPT_ARG_VAL,    &default_af_hint,   AF_INET6 , 0, 0 },
@@ -561,6 +574,14 @@
 		case OPT_READ_BATCH:
 			/* popt stores the filename in batch_prefix for us */
 			read_batch = 1;
+			break;
+
+		case OPT_NULL:
+			list_rs = '\0';
+			break;
+
+		case OPT_NO_IMPLICIT_DIRS:
+			implicit_dirs = 0;
 			break;
 
 		default:
--- proto.h.orig	Sun Mar 24 22:51:17 2002
+++ proto.h	Thu Jan  9 14:02:18 2003
@@ -82,6 +82,7 @@
 void send_file_name(int f, struct file_list *flist, char *fname,
 		    int recursive, unsigned base_flags);
 struct file_list *send_file_list(int f, int argc, char *argv[]);
+struct file_list *send_file_list_fp(int f,FILE *fp);
 struct file_list *recv_file_list(int f);
 int file_compare(struct file_struct **f1, struct file_struct **f2);
 int flist_find(struct file_list *flist, struct file_struct *f);


More information about the rsync mailing list