Verifying uniform distribution of random numbers

Tue Apr 9 16:19:41 EST 2002

Brad Hards wrote:
> Problem:
> I am writing unit tests for the zeroconf IP tool (zcip), and there is a spec
> requirement that says "When a host wishes to configure a link-local address,
> it selects an address using a pseudo-random number generator with a uniform
> distribution in the range from 169.254.1.0 to 169.254.254.255."
> 
> In addition to not paying much notice this time round, I was a lousy student
> of stats, and while I think that there is some statistical test
> (chi-squared?) to test hypothesis about distributions, I know nothing about
> it. I don't even have the textbook (which might explain why I nearly failed
> stats :)

Disclaimer: I am not a statistician.  Have a look at "man 3 rand", for further
references as well.

The uniform distribution on a finite space is exceptionally simple.  I doubt you
need to know about hypothesis testing arbitrary distributions; if your bin
counts end up roughly even after a few million runs or so you are probably as
uniform as anyone would reasonably expect.  I mean, a few % deviation on a
particular IP is unlikely to be an issue, is it?  It would seem to me that the
libc random number generator ought to be good enough to meet this spec at first
glance.  Of course, your mapping from this onto your 65024 bins needs to
preserve the uniformity.

That begs the question of how you seed the generator.  Does it matter whether a
malicious observer could guess the IP you create?   If you do this on boot,
there may not be much real new entropy in /dev/random, although if you count on
having alread restored the saved entropy pool that would be OK.

Assuming it doesn't, the system clock would be another choice.  I wonder,
though, if I had 60 identical computers and the power failed (too long for a UPS
to cope, but we'll assume an orderly shutdown).  When they come up, it is likely
to be at very much the same time, perhaps even to the microsecond in some
cases.  They can't all have the same IP.  I suppose the Ethernet collision
avoidance ultimately only allows one at a time to make an ARP request.

I had to get BOOTP working on some embedded PCs once, and it was more than
likely that a row of 60 would come on-line simultaneously.  In that case the
"identities" were determined by a set of external hardware jumpers and the
address could be set accordingly, and BOOTP required indefinitely many retries
with a randomly peturbed exponential backoff, so they would all manage
eventually.  I'm not sure how Etherboot seeded the generator for this.

As Jeremy points out, you could get a uniform distribution just by counting from
1 to N.  The problem is that you want to avoid conflicting with existing devices
on the network without querying first, and we assume they used the same
algorithm.  The spec writers have figured that a random guess will be the
quickest way of doing this, assuming you can come up with a different seed to
the other devices.  If it fails, I gather you just brute-force your way until
you succeed.

> A google search showed a lot of theoretical papers (and a surprising amount of
> code), but it all seems to be oriented to crypto, not simple stuff like this.

Yes, because in crypto a number generator that can be second-guessed or that
isn't quite uniform might seriously weaken the security.  That's unlikely to be
an issue here.  Only worry about details where you need it.  Remember that the
better the RNG, the slower it is likely to be, and the more code involved; ask
yourself whether you really ought to bother.

> The question is "how close is close enough for each bin?", given that there is
> no confidence levels in the IETF draft, so "something reasonable" would have
> to be assumed. If anyone has played with this, I'd be keen to hear some
> guidance. Code would be even more appreciated :) Code with comments about
> what the results mean would be ultra-appreciated.

A statistician could give you a better answer, but in the absence of confidence
levels I would have thought that "near enough was good enough", and I would
wager that was the intent of the spec writers.  

If the RNG is any good your bin counts should have a normal distribution, and
you can calculate the variance & std. deviation.  

The theoretical mean is n*p = num trials / num bins
Theoretical variance = n*p*(1-p) = trials * (1/bins) * (1-1/bins)
                                 ~= trials * 2e-5
Std. deviation = sqrt(Variance) = sqrt(trials) * 4e-3
so for 10 million (1e8) trials, you would expect a mean of 1537.9 and a std. dev
of about 4.

The chances of getting bins outside +-6*std dev should be very small, but about
98% should be inside 2*std dev, ie, 1538+-8.  Set a confidence interval of
+-100, say that's good enough, and anything outside that really would be
non-uniform.

In your units tests, one good result would just be to verify that each IP can
actually occur, (no bin is empty) hence your RNG has coverage; that's more
likely to be a problem than lack of uniformity in distribution where it does
cover.

Just my 10c,

Gordon

-- 
Gordon Deane  <Gordon.Deane at maths.anu.edu.au>
Mathematics Honours
Australian National University, Canberra
Rm JD104A  (02) 612-50003