display | more...
Take a list of statistical data: measurements, quantities, something like that. Pretty much anything that's a real number with units. An atlas is a good source of these: populations, areas, river lengths, rainfall, you name it. Do not use percentages (which are bounded), phone numbers (which are digit strings) or number-theoretic sequences (which may be weirdly distributed).

Now look at all the first (leftmost) digits. (If numbers are less than 1, take the first significant figure, that is, the first digit of the mantissa.) With a sufficiently large sample, what proportion of those first digits will be 1?

If you said 1/10, guess again.

If you said 1/9, because no number begins with zero, you're nearly as far off. The correct answer is over 30%. Don't believe me? Get an atlas and try. Or read on. Or both.

The reason this works is a simple mathematical consequence of the way the numbers are distributed. Because they are calculated relative to an arbitrary measuring unit, the limiting distribution (for which all statistics would be precisely equal to their theoretical values) must have the property of invariance under multiplication by constants: if you multiply every sample value by the same constant c, you get the same data measured relative to a unit c times smaller, so it has the same distribution. (Your naïve idea that there would be as many values between 300 and 400 as between 400 and 500 was based upon the notion of invariance under addition of constants, which doesn't hold here.)

For the data to be invariant under multiplication means precisely that the logs of the data are invariant under addition. What's more, if our logs are to base 10 then adding 1 to the log simply corresponds to multiplying by 10, which doesn't affect the first digit. So we only need look at the fractional part of the logs: additive invariance then requires that these fractional parts are uniformly distributed over the interval [0, 1). The part of range that corresponds to numbers beginning with 1 is the subinterval [log 1, log 2)=[0,0.30103...).

So the proportion of numbers that begin with 1 is 30.1%.

Similarly, the proportion of numbers that begin with n (0<n<10) is p(n) = log (n+1) - log n. This number gets steadily smaller as n goes from 1 up to 9: for example p(1) = p(2) + p(3) = p(4) + p(5) + p(6) + p(7).

For numbers expressed in a different base, just change the base of your logarithms to match.

Log in or register to write something here or to contact authors.