[clug] Substring repetition detection

James McNeill James at heague.com.au
Wed Jun 25 16:32:37 EST 2003


I wrote something like this ages ago. I'd never be able to find the code,
and I think it was in some disgusting visual language anyway.

The trouble was with things like:

abcdefghdefghdefghijk,abcdefghdefghdefghijk,abcdefghdefghdefghijk

should this come out as

abcdefghdefghdefghijk * 3    or

(a,b,c,defgh*3,i,j,k) * 3

you get the idea.

Is there a maximum length of the pattern?

Kim's right, all gzip does is find patterns in data. It'll find allot more
than just the linear repartitions that you want though.

you could try to cannibalize some of it's algorithm, but good luck to you.

>
> [ Before anyone asks, this isn't my school home work ]
>
> Imagine I have a string:
>
> abcdefghdefghdefghijk
>
> I would like an algorithm to turn this into
>
> a
> b
> c
> defgh * 3
> i
> j
> k
>
> In other words, I'm interested in an algorithm which can detect repetitive
> substrings. The repetitions are always consecutive (otherwise I don't
> care). The repetitions can however be of varying length.
>
> Anyways, I can't think of a good answer. Any hints?
>
> Thanks,
> Mikal
>
> --
>
> Michael Still (mikal at stillhq.com) | Stage 1: Steal underpants
> http://www.stillhq.com            | Stage 2: ????
> UTC + 10                          | Stage 3: Profit
>




More information about the linux mailing list