display | more...
As Sam Clemens once wrote, "There are lies, damn lies, and statistics." In this HOWTO, we'll teach you how to make him roll over in his grave by abusing statistics.

There are a number of techniques to consider in lying with statistics:

Use a small, biased, sample
4 out of 5 doctors surveyed recommend using our product!
Of course, there are 4 doctors on our board. We asked them first. Then we called one at random from the AMA's member directory. Any correlation between our survey method and the results is pure serendipity.
Use a self-selecting population
4 out of 5 of our female subscribers who responded to our survey cheated on their husbands!
Sure, but you have two self-selections. First: You only surveyed your subscribers, not a random sample of all women. Second, since people had to respond, the ones most likely to respond were the ones who had cheated.
Tie one data set to an earlier one, implying causality
100% of all crack addicts drank water before becoming addicted to crack. Water kills!
The more interesting (and likely truthful) statistic is "How many water drinkers become addicted to crack?" By tying together two unrelated (or even semi-related) groups together, nearly anything can be proved. Try these out for size:
Use a lower confidence to gain a higher probability
Over 80% of America watches our show!
Sample sets are merely predictors of the population at large. If 80% of a sample set has some attribute, then there is a confidence level associated with saying, "at least 80% of the population has this attribute." By increasing the probability, you lower your confidence, and, by decreasing the probability, you increase the confidence.
Use obscure definitions and data sets
50% of Yankees are let go from their jobs at least once a year. The Yankee work ethic makes it hard to keep a job.
First, what's a Yankee? An American, a New Englander, a Vermonter, a woodchuck? What does "let go" mean? Perhaps this statistic started as "50% of rural Vermonters have a second job in a seasonal industry, which supplements their annual income.
Compare a statistic that affects most of the population to one that affects a small portion of the population
You are more likely to be hit by lightning thrice than attacked by a shark.
Well, let's see. Who is vulnerable to lightning strikes? Just about everyone in the world. Who is vulnerable to shark attack? Only those who swim in shark-infested waters. If you have a population, which 100% of its members have a 0.05% chance of event A happening to them, and 5% of its members have a 2% chance of event B happening, then the following three facts are true:
  1. Event B happens with greater frequency than event A.
  2. People in the 5% group are more likely to have event B happen than event A.
  3. People in the other 95% are more likely to have event A happen than event B.
Unless you know which group someone is in, you can't really predict which is more likely to happen
This is just a primer. There is a statistics textbook by the same name as this node* for some more suggestions. Or, next time you hear an implausible statistic, try to figure out the fact behind the fancy.

*This is a hint that maybe the next wu here should be a book review.

This node is meant to be a somewhat more technical retelling of the above, with a focus on how to avoid being duped.

The first way in which statistics can be distorted is through the type of sample used. Rarely is an entire population surveyed in a study. More often, a sample is taken and the data from that sample is extrapolated onto the rest of the population. It is thus vitally important that the sample be judiciously chosen.

Let’s imagine we are looking for the average height of Canadians. We choose to sample three random Canadians and we get their heights. The mean value we find for the height of Canadians is a random variable (do follow the linkpipe on that one, the term random variable has an important meaning here). The height of Canadians could be any number within a particular range but it is not equally probable that it would be any of them. Let’s imagine that we take 50 samples of three people each and plot the resulting data set on a histogram. The mean of this histogram (the average of the fifty averages) is the population mean as nearly as we can determine it.

Each of the fifty data sets could also be plotted on a histogram. The small size of the sample means that there is a relatively high chance of an unusually tall or short person turning up in our data set and thus making its mean dramatically different from the mean of the entire population. Therefore, as our sample size gets larger, the distribution of the sample averages will have less spread.

If we knew the true mean height of the Canadian population, we could put it on the histogram from one trial. It will usually be either too large or too small when compared with our estimated value. How far off it will be depends on the sample size of the trial. Since a larger sample represents the whole population more effectively, it makes sense that it would do a better job of estimating the true value.

95% of the time, the true value of the mean height of the Canadian population will be within two standard deviations of the estimated value. The standard deviation of the histogram will become smaller as the sample becomes larger. This means that the area in which the true value almost certainly lies on a histogram becomes smaller when a larger sample size is used. This concept may be more familiar than you think.

Consider polls. When a poll result is stated, it is usually in the form: “55% of Canadians say Jean Chretien should play more golf, plus or minus 5% 19 times out of 20.” The “19 times out of 20” is the same 95% from the above paragraph. This means that 5% represents twice the standard deviation for the set from which the 55% value is determined. The pollsters are giving you the standard deviation in disguise!

Another common method by which statistics are fudged is conditioning. This is the process of selecting specific sub samples within a data set for comparison. An example is the average male wage compared with the average female wage. The manner in which this is done affects the results you get.

Studies have shown that kids who go to private schools earn 10% more, on average, than those who go to public schools. What does this mean? If we change the conditioning to examine neighbourhood and background, we see the difference reduced to zero. This essentially means that the marginal impact of going to private school if you already live in a good area (high average income, low unemployment) is quite small. Contrarily, students coming from a poor area stand to gain 10% in their average income for going to private school. Such statistical evidence (keep in mind that this is just an example) can lead to government policy decisions. The above conclusion would support a proposal for vouchers allowing poor kids to go to private school, for example.

One final statistical trick I shall examine is that of scale. Somebody can call an increase from 2-3% inflation (as calculated by the Consumer Price Index, for example) a “50%” jump. In actuality, the change was rather small. Whenever percentage changes are used to examine changes in small values, alarmingly large percentage changes can result. For this reason, if you are presented with very large percentage changes you ought to keep in mind that they may simple represent small variations in small quantities. For the GDP of Luxemburg to grow by 10 or even 50% represents very little actual growth compared with the GDP of the United States growing even 1%.

Remember, people can only lie to you with statistics if you let them! Be aware of how they work and you will be a less gullible member of society.

Log in or register to write something here or to contact authors.