[clug] finding duplicate sections in a text file

Elena Williams ele.wil at gmail.com
Wed Jan 22 09:17:38 UTC 2020


Hmm, I am a programmer and (seeing as you mentioned perl was on the
table) I'd immediately go to python's `collections` library, it's designed
for exactly this problem.

I can't say it better than it's been said here:
https://stackoverflow.com/a/16348348

The question is what you want to do with the dups? Do you just want to
eyeball them, or do you want to cull them?

Github: elena <http://github.com/elena/>


On Wed, 22 Jan 2020 at 17:21, Kim Holburn via linux <linux at lists.samba.org>
wrote:

> I'm not providing a solution here, but this reminds me very much of how
> some compression algorithms work.
>
> > On 2020/Jan/22, at 2:51 pm, steve jenkin via linux <
> linux at lists.samba.org> wrote:
> >
> > I’ve a file ~50,000 lines that I want to check for repeated content.
> >
> > Decided to use a checksum, chose crc-32 because the lines are short,
> don’t need 128-bits of MD5 or 160-bits of SHA1 and don’t like the simple
> sum/cksum utils.
> >
> > script calculates checksum per line, then sorts them & count uniques
> (uniq -c),
> > looks for any counts >1, ignore blank lines and section separators.
> > Created a file of {line number in orig file, crc, line from original
> file} which I edited (starting at the bottom) together with original file,
> removing duplicates.
> >
> > Had a script to do this calling crc32 once per line - worked but took a
> while to grind through file.
> > Approach wouldn’t work for larger files (eg 1M lines).
> >
> > POSIX system I happen to be using has ‘crc32’ implemented as a PERL
> script.
> > Took this and modified to do some extra things (per line, hex, octal,
> decimal & binary + STDIN).
> > Much faster :)
> >
> >
> > I’m not a PERL programmer, so code works, but can’t say it’s great
> > [uploaded as .txt so browsers display it]
> > <http://members.tip.net.au/~sjenkin/crc32.txt>
> >
> > Three questions:
> >
> >       - Anyone have a better algorithm?
> >               E.g.  using ‘git’ or another Version Control System
> >               This is One Big File (done in sections), not many files in
> a directory, though I could try that next time.
> >
> >       - Is there a ‘line-by-line’ CRC or other checksum tool out there?
> >               I couldn’t find one STFW.
> >
> >       - Could my PERL solution be improved?
> >               Not a language I’m good  in.
> >
> > cheers
> > steve
> >
> > --
> > Steve Jenkin, IT Systems and Design
> > 0412 786 915 (+61 412 786 915)
> > PO Box 38, Kippax ACT 2615, AUSTRALIA
> >
> > mailto:sjenkin at canb.auug.org.au http://members.tip.net.au/~sjenkin
> >
> >
> > --
> > linux mailing list
> > linux at lists.samba.org
> > https://lists.samba.org/mailman/listinfo/linux
>
> --
> Kim Holburn
> IT Network & Security Consultant
> T: +61 2 61402408  M: +61 404072753
> mailto:kim at holburn.net  aim://kimholburn
> skype://kholburn - PGP Public Key on request
>
>
>
> --
> linux mailing list
> linux at lists.samba.org
> https://lists.samba.org/mailman/listinfo/linux
>


More information about the linux mailing list