[clug] finding duplicate sections in a text file

Steve Jenkin sjenkin at canb.auug.org.au
Wed Jan 22 05:56:29 UTC 2020


Brett,

Wouldn’t have thought of that.

Lots of temp files not usually an issue these days - most POSIX file systems cope with large directories (In Ye Olden Times, some were notorious for performance issues above ‘a certain size’)

The suffix preserves original file order - neat.

‘fdupes’ - swiss army buzzsaw of DeDupe :)

thanks very much,

cheers
steve

> On 22 Jan 2020, at 16:42, Brett Worth <brett.worth at gmail.com> wrote:
> 
> On 22/1/20 2:51 pm, steve jenkin via linux wrote:
>> 	- Anyone have a better algorithm?
>> 		E.g.  using ‘git’ or another Version Control System
>> 		This is One Big File (done in sections), not many files in a directory, though I could try that next time.
> 
> Better?  Probably not.
> 
> Here's my 5 minute solution:
> 
> #!/bin/bash
> 
> INFILE=$1
> WORKDIR=`mktemp -d`
> 
> split --suffix-length=8 --lines=1 -d ${INFILE} ${WORKDIR}/
> fdupes -q -d -N ${WORKDIR} >/dev/null
> cat ${WORKDIR}/* > $INFILE.deduped
> rm -rf ${WORKDIR}
> 
> 
> Does use a lot of files. :-)
> 
> Brett
> 
> -- 
> --  /) _ _ _/_/ / / /__ _ _//
> -- /_)/</= / / (_(_//_//< ///
> 

--
Steve Jenkin, IT Systems and Design 
0412 786 915 (+61 412 786 915)
PO Box 38, Kippax ACT 2615, AUSTRALIA

mailto:sjenkin at canb.auug.org.au http://members.tip.net.au/~sjenkin



More information about the linux mailing list