[clug] finding duplicate sections in a text file

Kim Holburn kim.holburn at gmail.com
Wed Jan 22 06:05:01 UTC 2020


I'm not providing a solution here, but this reminds me very much of how some compression algorithms work.  

> On 2020/Jan/22, at 2:51 pm, steve jenkin via linux <linux at lists.samba.org> wrote:
> 
> I’ve a file ~50,000 lines that I want to check for repeated content.
> 
> Decided to use a checksum, chose crc-32 because the lines are short, don’t need 128-bits of MD5 or 160-bits of SHA1 and don’t like the simple sum/cksum utils.
> 
> script calculates checksum per line, then sorts them & count uniques (uniq -c),
> looks for any counts >1, ignore blank lines and section separators.
> Created a file of {line number in orig file, crc, line from original file} which I edited (starting at the bottom) together with original file, removing duplicates.
> 
> Had a script to do this calling crc32 once per line - worked but took a while to grind through file.
> Approach wouldn’t work for larger files (eg 1M lines).
> 
> POSIX system I happen to be using has ‘crc32’ implemented as a PERL script.
> Took this and modified to do some extra things (per line, hex, octal, decimal & binary + STDIN).
> Much faster :)
> 
> 
> I’m not a PERL programmer, so code works, but can’t say it’s great 
> [uploaded as .txt so browsers display it]
> <http://members.tip.net.au/~sjenkin/crc32.txt>
> 
> Three questions:
> 
> 	- Anyone have a better algorithm? 
> 		E.g.  using ‘git’ or another Version Control System
> 		This is One Big File (done in sections), not many files in a directory, though I could try that next time.
> 
> 	- Is there a ‘line-by-line’ CRC or other checksum tool out there?
> 		I couldn’t find one STFW.
> 
> 	- Could my PERL solution be improved?
> 		Not a language I’m good  in.
> 
> cheers
> steve
> 
> --
> Steve Jenkin, IT Systems and Design 
> 0412 786 915 (+61 412 786 915)
> PO Box 38, Kippax ACT 2615, AUSTRALIA
> 
> mailto:sjenkin at canb.auug.org.au http://members.tip.net.au/~sjenkin
> 
> 
> -- 
> linux mailing list
> linux at lists.samba.org
> https://lists.samba.org/mailman/listinfo/linux

-- 
Kim Holburn
IT Network & Security Consultant
T: +61 2 61402408  M: +61 404072753
mailto:kim at holburn.net  aim://kimholburn
skype://kholburn - PGP Public Key on request 





More information about the linux mailing list