Does any rsync-based diff, rmdup, cvs software exist?

Eric Ziegast ziegast at vix.com
Thu May 16 17:19:02 EST 2002


> I'd like to be able to run GNU-diff type comparisons,
> but use R-sync technology to make it efficient to see what's 
> different without transmitting all that data.

Rsync is great at synchronizing data between a source and destination.
For diff-like comparisons, perhaps something like CVS might be more
apropriate.

> Another thing I like to do using rsync protocol, 
> is what I call rmdup -- remove duplicates.
> This would allow me to recursively (like diff -r) compare files in
> two (!!MUST BE!!) different directories and remove one (or the other)
> of the duplicates.

A shell script that does something similar to what you want without
using rsync....

  #!/bin/sh
  
  # Our md5 checksum program (rsync uses md4, but the concept is the same)
  MD5=md5sum	# On RedHat 7.1
  #MD5=md5	# In *bsd
  
  # Inventory the source directory
  cd $SOURCE_DIR
  src=/var/tmp/find.$$.src
  find -x -type f -print | xargs $MD5 | awk {print $2, $1} | sort > $src
  
  # Inventory the destination directory
  cd $DESTINATION_DIR
  dst=/var/tmp/find.$$.dst
  find -x -type f -print | xargs $MD5 | awk {print $2, $1} | sort > $dst
  
  # Remove duplicates in the destination directory
  cd $DESTINATION_DIR
  comm -12 $src $dst | sed -e 's/ .*//' | xargs rm -i

  # rm $src $dst

Note: "comm -12" does a line by line comparison of the two checksum
      lists.  The output is lines common to both files.  If a
      filename/checksum matchs for both the source and destination
      directory, the file in the destination directory is the
      "duplicate" (per the definition in the e-mail) and is piped
      to "xargs rm" for removal.

Note: Configuring for use with source or destination directory on
      a remote host would include the strategic use of rsh or ssh.
      The good news is that because only a list of checksums is
      needed for comparison, the bandwidth needed between servers
      is minimized (like rsync).

> Again, the rsync protocol could be useful in configuration management,
> for computing the "deltas" that must be stored.

CVS (or even RCS) is more useful for configuration management and
updates of text files.  It also archives changes over time.

As far as I'm aware (without looking at source code), rsync does
block-level comparisons, not line-by-line.

--
Eric Ziegast




More information about the rsync mailing list