display | more...

Count Who?

A pseudocount is a sophisticated statistical technique or an ugly hack for probabilistic modeling, depending on how you look at it. It assigns a non-zero probability in the model for all combinations, even those which were never observed in the data set used to construct it! The basic technique is to add some small number (typically "1" or "0.001") either to all counts or to all zero counts before normalizing the counts to get probabilities. In practice, the precise technique used doesn't matter too much, and the exact value added isn't too important, either.

Why count?

Say you decide to build some classifier. Say you have a world X, divided into the "interesting cases" I⊂X and the "boring cases" X\I. You examine cases belonging to the world X. A classifier C is a statistical algorithm which assigns some value C(x) to every x∈X, such that C(x) is high for "most" x∈I and low for "most" x∈X\I. You construct a classifier by examining some (labelled) subsets of examples I'⊂I and J'⊂X\I. That is, you learn by constructing a probabilistic model of the world X and its "interesting" subset I. A classifier is good if there is high correlation between C(x) and the event x∈I (several definitions are possible, but all are roughly equivalent for our purposes here)

For purposes of this discussion, I shall use a particular example: Our world will consist of RNA sequences (elements of {A,C,G,U}*). We will want to distinguish between RNA sequences (which have a strong directionality) and their reverse complements. For this purpose, X will be the set of all naturally-occurring RNA sequences and their reverse complements, given by reversing the order of the letters and swapping A<->U and C<->G (which will be "approximated" by the above set of all possible RNA sequences). I will be the interesting set of all RNA sequences (uninverted). To be more in line with the actual technology of molecular biology which uses ESTs, we can consider just sequences of lengths 300-500, occurring anywhere within such an RNA sequence; this too doesn't matter too much. Thus, a classifier is a function C(x) which tells us if the cellular machinery reads x forwards or backwards.

Surprisingly, it turns out that a good probability distribution on our world is the uniform distribution on all sequences. That is, in every position of the sequence, every letter A,C,G,U occurs with equal probability and independently of all other positions. While decidedly unrealistic, this model doesn't prefer either of I and X\I (all 4 letters occur with roughly equal probabilities), and hence leads to a workable classifier.

A good model to take is a Markov chain with "memory" 6. That is, given the previous 6 letters (say ACGUGC), the model assigns a probability to each of the 4 possible "next" letters. For instance, a model might say that after ACGUGC, "A" appears with probability 0.3, "C" with probability 0.1, "G" with probability 0.25 and "U" with probability 0.35. We shall construct 2 such models: S for forward sequences (in I) and R for reverse sequences (in X\I). S(x) will be the probability that S assigns to getting x, while R(x) will be the probability that R assigns to that event. Given a sequence x∈X, the value C(x)=S(x)/R(x) turns out to be an excellent classifier for the problem! An even better classifier would take into account sequencing error. But this node isn't about computational biology.

How will we construct the model? We'll build it based on our list of examples! That is, for every statistical parameter in our model, we can compute an observed value from the examples. If our parameters are well-chosen, this immediately leads to a model!

In our case, we will scan every sequence in I. For every 6 letter subword, we will simply count how many times each letter appeared after the subword. Normalize the counts by dividing by the number of times the 6 letter subword appeared, and we have our model!

How to count

The problem with this approach is that we will always assign probability to events which we never saw in our training data! Indeed, our classifier may well end up dividing by zero which computing C(x); we may even divide by zero while attempting to construct the model! Clearly, something is very wrong here.

Suppose we never saw the sequence CCCCCC in our example set for I. Then we also saw 0 A's succeeding it, 0 C's, 0 G's and 0 U's. If we try to construct our Markov model, we will try such nonsense as dividing zero by zero. While there are many fanciful inventions to what zero/zero is, none of them is justified in this situation. Or suppose we saw GGGGGG in our example set for X\I, but that it was never followed by C. Then R(x) will always be 0 for any sequence containing GGGGGGC -- no matter how much the rest of the sequence appears to be the reverse complement of a real RNA sequence! Clearly, we cannot accept a model which claims -- on the basis of never having seen any evidence for or against -- that the sequence GCCCCCC never appears in real RNA, and therefore that GGGGGGC never appears in the reverse complement of RNA.

So counting as a way to build a model seems like a really good idea, but the amounts of data we require appear to be prohibitive -- we need to have enough data to move every parameter away from 0! The solution is the pseudocount. Never write down "0" appearances of any event; instead, assume it occurred ε times. Typical values for ε are 1 or 0.001; the secret of a good probabilistic model is that its cut-offs are so sharp that it doesn't matter too much which ε you choose. More on ε -- below.

We now have 2 choices how to count: one school of thought says to count "ε, 1, 2, ...", while the other "ε, 1+ε, 2+ε, ...". Which should you pick? If ε≈1, then the second. Otherwside -- you probably guessed -- it doesn't matter too much. ε will only make a difference in the absence of a "real" count.

In our example, we just add 0.01 to all counts (the second options). Now, the case of the never-appearing "CCCCCC" is resolved: if CCCCCC ever does appear, the probability for it to be followed by A is 0.01/0.04=0.25, and similarly for C,G,U. This is nice: in the absence of any data, we "correctly" picked the intuitive guess: the uniform distribution. For GGGGGG, say we saw it 10 times followed by A, 0 followed by C, and once by each of G,U. Then after GGGGGG the probability for A is 10.01/(12.04)=0.831, for C is 0.001, and for G and U 0.084. So C is extremely unlikely -- but never impossible!

What are we counting?

Why do we do this? The simple answer is: because it works. This is actually a much better answer than it might appear: probabilistic modeling always involves wildly unrealistic assumptions to simplify the model into something remotely tractable for computation, analysis and application. Another wild assumption doesn't seem to be that bad.

Which still leaves us with the question of why it works. Frequentists probably have a harder time here than Bayesians, so I'll pretend to be a Bayesian. Bayesians always hold some statistical belief about the world. Even before they've examined a single measurement, they "know" that the best probabilistic model -- based on the evidence seen so far, of course -- is the uniform distribution. Since this isn't an all-out attack on Bayesians, I'll refrain from noting how silly this belief is -- or indeed that what the uniform distribution is depends mostly on the type of model chosen, and that that too is done before examining evidence. There is always a prior distribution.

Before a Bayesian examines her first RNA sequence, she knows that the distribution of both RNA sequences and their reverse complements is the same: the uniform random distribution of the world. Obviously, she cannot use this model to decide whether a given sequence x is regular or reversed. But think about it: this is exactly in-line with our empirical leanings. Before we've seen any evidence regarding the world X, we cannot tell anything about x∈X!

Whenever a Bayesian receives a set of experimental results, she updates her prior distribution to get a posterior distribution. The experimental results have some relative degree of certainty p associated with them. If the a priori distribution is Dprior and the new experimental distribution is Dexperiment, then Dposterior = (1-p)⋅Dprior + p⋅Dexperiment. This process can continue for several rounds, but that's usually less important for pseudocounting. The important thing is the Bayesian's behaviour when the first experimental results are seen. So experimental results are added to the first prior distribution. When pseudocounting by adding ε to all counts, the chosen value of ε can easily be determined from p.

So a Bayesian will always choose pseudocounting by adding ε (the second method above), not pseudocounting by replacing zero by ε. Given that (for small ε) it doesn't really matter which you choose, we've probably justified the other pseudocounting method too.

In fact, the method described in this example really does work, and with similar parameters. The best way to determine ε is empirically (i.e. try 1, 0.1, 0.01, 0.001 and 0.0001, and see which one gives the best results). Without pseudocounting, the missing data really screw things up for sequences sufficiently different from the norm to have been completely excluded from the training set.

Log in or register to write something here or to contact authors.