Measures of
central tendency are computable values on a distribution that discuss the behavior of the
center of a distribution. One basic rule of thumb on a measure of central tendency is that if all datapoints are pushed away from the measured point by an equal proportion, the measured point will not significantly change.
An analysis of various measures of central tendency in a skewed distribution
Many of the standard measures of central tendency are designed to be applied to a normal distribution, such as the Gaussian distribution (bell curve), or a random distribution. These concepts, such as arithmetic mean, median, and mode, are easy to grasp - in fact, in a bell curve, all three of these are equal. But when applied to a skew distribution, we discover some interesting facts about the way they interact. First, a sample data set:
Number of datapoints per score:
58
X
X
41 X 42
X X X
X X X 32 29
X X X X x
X X X X X
X X X X X 16 16
X X X X X X 12 X
5 X X X * X X X X 8 4 8 5
X X X X X X X X X X x 2 X X 1 2 1 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 24 25 31 32 33 39 46 47 50
| \Interquartile/ |
\----Interdecile-----------------/
Datapoints: 301
arithmetic mean: 5.76
median: 4 (bolded asterisk)
mode: 2
interquartile mean: 3.80
interdecile mean: 4.33
Sampled mean: decile: 4.77
Sampled mean: quartile: 4.33
This
right skew (or
right-tailed)
distribution) shifts the relationship between mode, median, and arithmetic mean in a
predictable fashion. The mode moves to the left of the median, and the arithmetic mean moves to the right. This is expected.
The arithmetic mean, in addition to being "the average of all points", is the point where the sum of the distances between each datapoint and that point is minimized. The arithmetic mean is fairly susceptible to the influence of outliers - weird values that fall far outside a regular pattern (For example, shifting the single 50 datapoint in the above distribution to 100 changes the arithmetic mean to 5.93 (+0.17). As a measure of central tendency, then, this may be poor (unless you are an insurance company - you need to know exactly what the average settlement is).
Median, as a measure of central tendency, is also susceptible to small changes.. First, it is tied rigidly to a single point (the bolded asterisk on the chart above). All other 300 values in this dataset could change, and this one point would be invariant, thus not capturing the effect of the change. Also, it is a discrete value - it can never assume a value outside the values in the dataset (in this case, integers). For an image processor, however, this may be the important feature.
The mode matches the peak of this curve, and is useful for capturing the usual behavior of the population. The mode is useful for targeting; if you're advertising, this may be your initial target market.
To find measures that aren't as affected by small changes, we can look to sample the central section, and take the mean of that area. Two obvious choices arise - the first is the Interquartile Mean (IQM), and the second is the Interdecile Mean (IDM). The IQM cuts of 25% from each end of the distribution (for a total of 50%), and the IDM cuts off 10% from each end (for a total of 20%). What do we see about the relationships of these two numbers?
IDM captures the effect of the tail of the skew more than IQM
By limiting itself to the central 50%, the IQM is not affected by the existence of right skew, and is more affected by the "hump" at mode 2.
Another way to measure the central tendency is to take a sampled mean - that is, sample the points at a regular interval, and average those points. The median is, in fact, a single-point sampled mean. If we sample at each quartile (3 points), we get a sampled mean of 4.33, while the decile sampling (9 points) gives us a sampled mean of 4.77 -- this is consistent, as the more we sample, the closer we should be to the arithmetic mean (which is derived by sampling the entire dataset. Like median, sampled means rely on a small number of values, making this more susceptible to manipulation.
Coming soon! A multivariate skew distribution!