[clug] finding duplicate sections in a text file
Steve Jenkin
sjenkin at canb.auug.org.au
Wed Jan 22 05:56:29 UTC 2020
Brett,
Wouldn’t have thought of that.
Lots of temp files not usually an issue these days - most POSIX file systems cope with large directories (In Ye Olden Times, some were notorious for performance issues above ‘a certain size’)
The suffix preserves original file order - neat.
‘fdupes’ - swiss army buzzsaw of DeDupe :)
thanks very much,
cheers
steve
> On 22 Jan 2020, at 16:42, Brett Worth <brett.worth at gmail.com> wrote:
>
> On 22/1/20 2:51 pm, steve jenkin via linux wrote:
>> - Anyone have a better algorithm?
>> E.g. using ‘git’ or another Version Control System
>> This is One Big File (done in sections), not many files in a directory, though I could try that next time.
>
> Better? Probably not.
>
> Here's my 5 minute solution:
>
> #!/bin/bash
>
> INFILE=$1
> WORKDIR=`mktemp -d`
>
> split --suffix-length=8 --lines=1 -d ${INFILE} ${WORKDIR}/
> fdupes -q -d -N ${WORKDIR} >/dev/null
> cat ${WORKDIR}/* > $INFILE.deduped
> rm -rf ${WORKDIR}
>
>
> Does use a lot of files. :-)
>
> Brett
>
> --
> -- /) _ _ _/_/ / / /__ _ _//
> -- /_)/</= / / (_(_//_//< ///
>
--
Steve Jenkin, IT Systems and Design
0412 786 915 (+61 412 786 915)
PO Box 38, Kippax ACT 2615, AUSTRALIA
mailto:sjenkin at canb.auug.org.au http://members.tip.net.au/~sjenkin
More information about the linux
mailing list