[clug] some bash/awk scripting. use MD5's (m5sum) to find and collapse duplicate files

steve jenkin sjenkin at canb.auug.org.au
Mon Jan 23 21:54:57 MST 2012


I've needed to create this for myself, others may find it useful.
There may well be better and more well known solutions out there...
I didn't look.

The PERL mavens on-list may like to rewrite it.
Just the sort of job Larry invented it for :-)

<http://members.tip.net.au/~sjenkin/code/dedup.tar>

The README:

Dedup.tar
Steve Jenkin Tue 24 Jan 2012 15:30:54 EST
mailto:sjenkin at tip.net.au

http://members.tip.net.au/~sjenkin/code/dedup.tar

## A set of scripts that use MD5's to find duplicate files
   and then collapse duplicates with 'hard links'. [see POSIX filesystem
semantics]

   The 'gen_rm_ln' script is awk [#!/usr/bin/awk]
   The others use /bin/bash.
   You'll need 'rm_ln' in your path.

   These scripts are written to test for many errors and 'be safe',
   BUT - caveat emptor. I'm sure there are edge cases I've missed.

WARNING:   The scripts aren't "security hardened" - they could be abused
by hackers.
If you are *brave* then this pipeline will work:
  md5_dupfl <dir> | gen_rm_ln | xargs -L 1 bash

(where <dir> is the directory/directories you want to dedup)
  For options, see
        md5_dupfl -h

When I run this with xargs in OS/X, I get an error related to /bin/echo...
but it does the work.

What I use:
  md5_dupfl Nokia_Rigntones | gen_rm_ln >dedup.sh
  <inspect flist, rerun etc>
  sh dedup.sh

## Contents of tar file

-rw-r--r-- steve/steve    1279 2012-01-24 15:48 README
-rwxr-xr-x steve/steve    2118 2012-01-24 15:44 bin/md5_dupfl
-rwxr-xr-x steve/steve    1628 2012-01-24 13:30 bin/gen_rm_ln
-rwxr-xr-x steve/steve    3638 2012-01-21 19:13 bin/rm_ln


-- 
Steve Jenkin, Info Tech, Systems and Design Specialist.
0412 786 915 (+61 412 786 915)
PO Box 48, Kippax ACT 2615, AUSTRALIA

sjenkin at canb.auug.org.au http://members.tip.net.au/~sjenkin


More information about the linux mailing list