[clug] finding duplicate sections in a text file
steve jenkin
sjenkin at canb.auug.org.au
Wed Jan 22 03:51:16 UTC 2020
I’ve a file ~50,000 lines that I want to check for repeated content.
Decided to use a checksum, chose crc-32 because the lines are short, don’t need 128-bits of MD5 or 160-bits of SHA1 and don’t like the simple sum/cksum utils.
script calculates checksum per line, then sorts them & count uniques (uniq -c),
looks for any counts >1, ignore blank lines and section separators.
Created a file of {line number in orig file, crc, line from original file} which I edited (starting at the bottom) together with original file, removing duplicates.
Had a script to do this calling crc32 once per line - worked but took a while to grind through file.
Approach wouldn’t work for larger files (eg 1M lines).
POSIX system I happen to be using has ‘crc32’ implemented as a PERL script.
Took this and modified to do some extra things (per line, hex, octal, decimal & binary + STDIN).
Much faster :)
I’m not a PERL programmer, so code works, but can’t say it’s great
[uploaded as .txt so browsers display it]
<http://members.tip.net.au/~sjenkin/crc32.txt>
Three questions:
- Anyone have a better algorithm?
E.g. using ‘git’ or another Version Control System
This is One Big File (done in sections), not many files in a directory, though I could try that next time.
- Is there a ‘line-by-line’ CRC or other checksum tool out there?
I couldn’t find one STFW.
- Could my PERL solution be improved?
Not a language I’m good in.
cheers
steve
--
Steve Jenkin, IT Systems and Design
0412 786 915 (+61 412 786 915)
PO Box 38, Kippax ACT 2615, AUSTRALIA
mailto:sjenkin at canb.auug.org.au http://members.tip.net.au/~sjenkin
More information about the linux
mailing list