PROPOSAL: --link-hash-dest, additional linking of files to their HASH values

Drake Diedrich diedrich at usc.edu
Wed Jan 19 22:45:22 GMT 2005


   I'm using a few utilities to accomplish the same thing in a second pass
after rsync runs.  The utils all use a two-layer hash (256 directories of
256 subdirectories), which with our current backups puts a little over 100
files per directory.  Anywhere from hundreds of thousands to tens of
millions of files shouldn't waste too many inodes or put a gross number of
files into each directory.  The code to generate the hash directory name is
parameterized though, and could easily generate 1-4 layers.

static char hex_char[16] = "0123456789abcdef";

int prefixdirs = 2;
...

  int digestlength;
digestlength = MD4_DIGEST_LENGTH;

...
const char *hashdir

...
  hashdirlen = strlen(hashdir);
  hashpath = malloc(hashdirlen+digestlength*3+1);
  strcpy(hashpath,hashdir);
  hashpath[hashdirlen]='/';

  for (i=0;i<prefixdirs;i++) {
    hashpath[hashdirlen+i*3+1] = hex_char[md[i] >> 4];
    hashpath[hashdirlen+i*3+2] = hex_char[md[i] & 0xf];
    hashpath[hashdirlen+i*3+3] = '/';
  }
  for (i=prefixdirs;i<digestlength;i++) {
    hashpath[hashdirlen+i*2+prefixdirs+1] = hex_char[md[i] >> 4];
    hashpath[hashdirlen+i*2+prefixdirs+2] = hex_char[md[i] & 0xf];
  }
  hashpath[hashdirlen+prefixdirs+digestlength*2+1] = '\0';

 ...



   The three utilities (hashimplode, hashdelete, hashpurge) are at
http://www.cmb.usc.edu/people/dld/backuputils/.  hashimplode calculates the
hashes and hardlinks files.  hashdelete removes a snapshot, removing
orphaned hash files as it goes, and hashpurge is roughly "find /hashdir
-nlinks 1 | xargs rm", in case you haven't always used hasdelete.  All three
use rename() to avoid races on files so they can be interleaved and run in
parallel while sharing the same hash directory (which we're doing to utilize
the parallel seek capacity of RAIDs and to back up multiple smaller
fileservers at once).  hashimplode has an option to skip files that are
already hardlinked (basically the ones rsync already hardlinked to the
previous backup - assumes users aren't hardlinking their own files).

   Using this patch would save a pass through the inodes and be a big win
(30-50% faster?  Almost the same as upgrading to 10K RPM disks, keeping the
250G+ size of 7200s) The utilities may prove useful migrating existing
backups to this hash structure, recovering lost hash directories, and
pruning the hash directory.  I'd like to make these utilities use the same
hash structure and race-avoidance an rsync hashdir-patch uses.

-Drake <diedrich at usc.edu>


More information about the rsync mailing list