display | more...

In many different classes of problems, probabilistic analysis is a useful technique to discover properties and solutions. One such set of problems is the analysis of the running times of algorithms. However, to use this analysis technique (really, this class of techniques), you must know or assume something about your input distribution. Even when you do know something about the distribution of the inputs, it may or may not be effectively modeled. Also, the assumptions you make may be incorrect or misleading.

An assumption about distribution that is often made is that there is a uniform distribution of inputs. The following simple problem shows how such an assumption might be reached, and shown valid.

Start with the set of positive integers between 1 and n. During each round, pick some number out of that set that has not already been picked. If the number you pick is higher than any number yet picked, add the number to a 'chosen' set. Otherwise throw it away. Do this until all numbers are exhausted. In the average case, how many numbers will have been 'chosen'?

The smallest final set would contain just the integer n. This would happen with any sequence of choices where n was the first to be picked. The largest set contains all of the integers between 1 and n, occuring only when you pick the numbers in the order {1, 2, 3, ..., n-1, n}. The probability that you will pick the number n at any ith pick (1st pick, 2nd pick, 3rd pick, etc) is 1/n*. Note the use of 'any'. Restated, the chance that you will choose n at a specific pick is uniform across all of the possible rounds to pick. Also, see that the chance of picking a specific element (number) p between 1 and n in the first round is also 1/n*. In fact, the chance that you pick any specific element during any specific round is the same: 1/n. Thus when you pick the numbers yourself, the input distribution is uniform.

This problem, perhaps, seems trivial and contrived. In the specific application, it is. However, many algorithms keep a 'running sum' or 'running set' to determine the solution.

Suppose, instead of picking the numbers between 1 and n yourself, someone else gives you a list of numbers. Instead of picking abritrarily from the set between 1 and n, you must pick starting at the first number in the list given you, and on towards the last. You cannot make any assumptions about the order the numbers were given you, and so cannot make any assumption about the distribution of the input. However, if you could blindly change the order of the list given you so that there was an even chance that any ordering could occur, the average number of numbers in the final 'set' after you have processed the list given you would be the same as the original problem.

Two common methods to 'randomly' reorder lists in such a fashion are as follows:

• You could randomly assign each number between 1 and n a priority (that is, the priority is not necessarily the same as the number, and in fact is most likely different), and sort the numbers according to the priority assigned. Choose a large enough range of priorities such that there is only a low possibility there will be non-unique priorities (speaking of assigning possibilities and probabilities in solving a problem of probabilities is perhaps unwise. Deal with it).
• You could, starting at the begining of the list, swap each number with a number appearing later in the list. On the ith iteration through the list, the element i spots from the first element will be swapped with some element between itself and the end of the list.

Many algorithms use these or similar methods of reordering inputs to achieve a desired probability distribution of inputs. Others use randomness by choosing an arbitrary position in a list to be a pivot (the randomized versions of quicksort use this), or by randomly choosing how many links will be outgoing from a specific element in the list (skip lists achieve the good qualities of both linked lists and binary search trees using this).

Ah, and the average number of elements in the final set, as described by the problem earlier? It is the ceiling of the binary logarithm of n - ceil(log2(n)). The proof of this is left as an exercise for the student.

* To see this, note that the probability that the number will be picked at some point is unity, or 1. As you are picking blindly (assuming you have no bias or preference towards specific numbers), the chance that you picked n first is 1 divided by the number of elements you are choosing from -- n. Thus, 1/n.

Log in or register to write something here or to contact authors.