hypergeometric distribution

In probability theory, the binomial distribution and hypergeometric distribution are two functions used to determine probabilities of repeated trials. The binomial distribution is used for determining probabilities where the probability of one event is independent of the next, ie. the first trial does not affect the next in any way. Examples include rolling a die, flipping a coin, or taking a jellybean out of a bag, noting its colour and replacing it. The hypergeometric distribution is used when events are dependent on the ones before it, such as taking a jellybean out of a bag and not replacing it.

Like the binomial distribution, the hypergeometric distribution only works when there are two outcomes to any trial: success or failure. However, success and failure can be broadly defined: a "success" can simply mean pulling a black jellybean out of a bag of jellybeans of many other colours, and "failure" can mean pulling any other colour out.

So, what do I do?

You memorise and apply a formula. The easy bit: remembering when to apply the formula; the hard bit: remembering how to apply the formula. Mind you, it could be easier or harder for you, dear reader, depending on whether you're a math geek or not. I am one myself so I managed to teach this to myself, using Wikipedia, in under an hour.

The formula:

/m\ /N-m\
\k/ \n-k/
---------
   /N\
   \n/

WTF are all those parameters?

I'll use the jellybean example again. Say you have N items - I'll say we have 50 jellybeans. Now, m of those are distinctly different to the others - "defective" - and we'll say that 10 of the jellybeans are red. (I like red. I hate black.) You draw/select, randomly, n items - or 20 jellybeans here - and do not replace them. If you want to replace them, go here. Now, what is the probability of drawing out k "defective" items? We'll say we want 5 red jellybeans. So I just whack the numbers into the formula and be done with it. ...well, it would be useful to evaluate the formula.

N = 50
m = 10
n = 20
k = 5
The formula thus becomes:
          /10\ /50-10\
          \ 5/ \20-5 /
Pr(k=5) = ------------
              /50\
              \20/
Now, /n\ is the same as saying nCr, so:
     \r/
          (10C5)(40C15)
Pr(k=5) = -------------
             (50C20)
where C means "combinations".

You'll need a calculator and/or a lot of time to figure this one out, so I'm going to be nice and figure this one out for you. The answer is 0.215085... which is 21.509%. A low percentage, but not impossible.

So there you have it. As an extra note, like all probability distributions, the sum of all the probabilities in this circumstance (ie. as k takes on every possible value), is 1, or 100%. The probabilities of every value of k in the jellybean example are as follows (to five significant figures):

Pr(k=0)=0.0029249
Pr(k=1) = 0.027856
Pr(k=2) = 0.10826
Pr(k=3) = 0.22593
Pr(k=4) = 0.28006
Pr(k=5) = 0.21509
Pr(k=6) = 0.10341
Pr(k=7) = 0.030639
Pr(k=8) = 0.0053344
Pr(k=9) = 0.00049052
Pr(k=10) = 0.000017986

If you want an example with some easier-to-understand fractions, try the case of there being five pairs of socks in your sock drawer, two of which are black and the other three are white. You want a pair of black socks, so what are the odds of drawing a pair of black socks when you pull out two pairs at random? Hence:

N = 5
m = 2
k = 1
n = 2
          /2\ /5-2\
          \1/ \2-1/
Pr(k=1) = ---------
              /5\
              \2/
          2 * 3
Pr(k=1) = -----
           30
= 6/30 = 1/5.

So how does it work?

For convenience, the formula is listed again:

/m\ /N-m\
\k/ \n-k/
---------
   /N\
   \n/

Basically, there are NCn samples overall. If you want a probability, you need the total number of favourable outcomes - successes - divided by the total number of outcomes, favourable or not. Hence, NCn becomes the denominator. There are mCk ways of drawing defective items and (N-m)C(n-k) ways of filling out the rest of the sample (non-defective items).

One final note: restrictions on each of the parameters are as follows.

N can be any natural number. For those non-math-geeks, that's every positive whole number: [1,2,3,4...]
m can be any natural number below N (or zero or N itself, but you'd be mad to calculate probabilities of drawing defective items from a sample with no distinctly different items...)
n can be any natural number below N as well. (Again, you'd be mad to make this parameter equal to zero or N.)
k can be any natural number below or equal to m, or zero. Otherwise the combinations rule doesn't work for mCk.

Think of me next time you eat a jellybean.

hypergeometric function	October 10, 2007	abbr	binomial distribution
Blanch	Zero	natural number	combination
jellybean