An important consideration in the design of parallel algorithms is cost optimality.

The amount of work an algorithm performs is the run time of the algorithm multiplied by the number of processors it uses. A conventional (sequential) algorithm may be thought of as a parallel algorithm, designed for one processor. An algorithm is said to be cost optimal if the amount of work it does is the same as the best known sequential algorithm.

It is easy to adapt a cost optimal algorithm to run on P processors in time N to run on P/n processors in time Nn (The most basic way being to simulate the running of P processors, assigning each real processor a (P/n sized) pool of virtual processors to simulate). Pathologically, it may be run on a single processor machine with the same asymptotic runtime as the best sequential algorithm.

It is possible for an algorithm to be faster than the best sequential algorithm without being cost optimal, for example we can sort a sequence of length n by using n! processors, each of which permute the sequence based on their processor id (constant time), and then check to see if it is sorted (On), for an asymptotic run time of On. The best possible linear sorting algorithm has a run time of On.log(n), making this algorithm log(n) faster than the best sequential algorithm. Unfortunately, this parallel algorithm requires n! processors1 (as opposed to one for the sequential algorithm), meaning that it does n!/log(n) more work than then sequential algorithm. If n! processors are not available (which is pretty much guaranteed for all but tiny values of n), this algorithm will run significantly more slowly than the conventional one.

1 - There exist algorithms that sort in On time on n2 processors, but the are much harder to explain. Rather than blindly permuting, they compare each entry to each other entry (in n2), store the results, and use them to build a sorted list in constant time.