[clug] finding duplicate sections in a text file

Brenton Ross rossb at fwi.net.au
Wed Jan 22 04:27:35 UTC 2020


The sort program on a modern machine should be able to handle files
that size almost instantly.I would skip the CRC stuff and just sort the
file  [unless I am missing something ?]
I would ask, though, are the line breaks guaranteed to be the same in
the repeated sections ?
Brenton

On Wed, 2020-01-22 at 14:51 +1100, steve jenkin via linux wrote:
> I’ve a file ~50,000 lines that I want to check for repeated content.
> Decided to use a checksum, chose crc-32 because the lines are short,
> don’t need 128-bits of MD5 or 160-bits of SHA1 and don’t like the
> simple sum/cksum utils.
> script calculates checksum per line, then sorts them & count uniques
> (uniq -c),looks for any counts >1, ignore blank lines and section
> separators.Created a file of {line number in orig file, crc, line
> from original file} which I edited (starting at the bottom) together
> with original file, removing duplicates.
> Had a script to do this calling crc32 once per line - worked but took
> a while to grind through file.Approach wouldn’t work for larger files
> (eg 1M lines).
> POSIX system I happen to be using has ‘crc32’ implemented as a PERL
> script.Took this and modified to do some extra things (per line, hex,
> octal, decimal & binary + STDIN).Much faster :)
> 
> I’m not a PERL programmer, so code works, but can’t say it’s great
> [uploaded as .txt so browsers display it]<
> http://members.tip.net.au/~sjenkin/crc32.txt>
> Three questions:
> 	- Anyone have a better algorithm? 		E.g.  using
> ‘git’ or another Version Control System		This is One Big
> File (done in sections), not many files in a directory, though I
> could try that next time.
> 	- Is there a ‘line-by-line’ CRC or other checksum tool out
> there?		I couldn’t find one STFW.
> 	- Could my PERL solution be improved?		Not a
> language I’m good  in.
> cheerssteve
> --Steve Jenkin, IT Systems and Design 0412 786 915 (+61 412 786
> 915)PO Box 38, Kippax ACT 2615, AUSTRALIA
> mailto:sjenkin at canb.auug.org.au http://members.tip.net.au/~sjenkin
> 
> 


More information about the linux mailing list