Stragglers are one of the main reasons why the actual speed-up of a parallelized computation is always worse than the theoretical speed-up predicted by Amdahl's Law (this writeup assumes that the reader is familiar with Amdahl's Law at least to the extent covered by the various Amdahl's Law writeups). A straggler is a processor which is still working on its share of the parallelized computation when most or all of the other processors have completed their shares.

There are a number of different types of straggler problems:

  • the nature of the computation is such that it can't be distributed perfectly fairly across all of the processors.

    A classic example of this type of straggler problem is when there are 100 units of work of equal size which must be distributed across 8 processors. Some processors will do 12 units of work and others will do 13 units of work. The processors which end up having to do 13 units of work will be stragglers.

  • the nature of the computation is such that it isn't possible to predict in advance how much effort will be required to compute each unit of work and, as a consequence, the original distribution of work across the processors is potentially unbalanced.

    An example of this type of straggler problem is when each unit of work involves searching through a data structure for a value. The ease with which a target value is found for a given unit of work determines the effort required to perform (i.e. compute) the unit of work. Since the effort can't be known until the computation has been completed, it may not possible to distribute the work fairly across some number of parallel processors.

    The classic solution to this type of straggler problem is to redistribute the unstarted units of work across the processors once it becomes apparent that the current work distribution is unfair. This technique is called load leveling or load balancing. It can be very effective but it can also be quite complicated to implement in such a way that the computational cost of the load leveling isn't prohibitive. In addition, the presence of the load leveling code might disrupt the flow of the program (i.e. it can be a maintenance problem).

    Load leveling isn't always required even if the cost of a unit of work can't be predicted in advance. For example, suppose that there are one million of units of work to be performed in parallel and the cost of each unit of work is reasonably random (i.e. no hidden patterns like the first 100,000 units of work happen to be much more expensive than the last 100,000 units of work). Stragglers can probably be avoided by simply assigning each processor the same number of units of work and relying on good old fashioned statistics theory to ensure that each processor does the same amount of computation (e.g. if one million units of work are spread across 100 processors then each processor will be assigned 10,000 units of work and the computation required to perform each of these sets of 10,000 units of work will be roughly the same).

  • A different but related type of straggler problem is when the cost of each unit of work varies but is predictable in advance. A simple yet classic example of this type of straggler problem is an algorithm working on a triangular matrix:
    DO I = 1, N
       DO J = 1, I
          computation dependent upon I and J
       ENDDO
    ENDDO
    
    In this example, the amount of work for iteration I = 1 is considerably less than the amount of work for iteration I = N. With this sort of distribution of unit of work costs, it may be fairly easy to distribute the units of work such that each processor has a different number of units of work but the total effort required for each processor's units of work is roughly the same.

    If the cost of each unit of work is predictable but doesn't follow any useful pattern then things can get rather ugly. The programmer is typically faced with deciding between implementing some sort of load leveler (i.e. treat this as a situation in which the cost of a unit of work is unpredictable) or figuring out how to distribute the work fairly. It's quite possible that neither alternative ends up being particularily attractive (one additional possibility worth considering is to deliberately distribute the work randomly across the processors in the hope that the random distribution will flatten out the "lumps"; the catch with this approach is that it can be computationally expensive to implement).

  • The final type of straggler problem occurs when one or more of the processors is delayed during the period of time that the computation is being performed in parallel with the result that an initially fair distribution of work results in stragglers because some of the processors aren't able to "concentrate" on their assigned work as much as other processors.

    This type of straggler problem is a distinct possibility if the system is running a conventional operating system like, for example, Unix, Linux or MS Windows. Some of the processors might simply be unlucky in the sense that they happen to be "bothered" by unrelated activity (e.g. network based system load queries) more than other processors. Alternatively, some of the processors might have duties not shared with other processors (e.g. one of the processors might be running a parallel cluster-support daemon which doesn't operate on the other processors). Even in the absence of a "troublesome" or "distracted" operating system, one of the processors may have responsibilities within the application (e.g. distributing the work to other processors and/or gathering results from other processors) which cause it to fall behind other processors.

    Fortunately, most "interruptions" tend to be fairly random and tend to delay each of the processors by roughly the same amount. In the event that this type of straggler becomes a problem, about the only solution is to implement some sort of load leveler as described above (or move to quieter computational nodes).

Cost of the straggler problem

Naturally, it is the cost of the straggler problem for a given application which determines whether the application's developer needs to concern themselves with the straggler problem. For example, the cost of the straggler problem in the case where 100 equal sized units of work are being distributed across 8 processors is almost certainly so small as to be irrelevant. On the other hand, the cost of the straggler problem in the case where 100 units of work are being distributed across 64 processors or 80 processors is another matter entirely as, in these two cases, some of the processors will end up doing twice as much work as the other processors. The net result is likely to be that using anything more than about 50 processors but less than 100 processors is a waste of hardware.

Things can start to get really ugly in the other types of straggler problems. For example, if the developer doesn't realize that the cost of each unit of work varies considerably then they might naively distribute the work evenly across the processors. If one of the processors ends up with a significant amount of work still to be completed when all of the other processors are finished their work then the end result is that the program is essentially running serially until the last processor is finished. It is quite conceivable that the amount of elapsed time (i.e. seconds on your wristwatch) that are used by the straggler is roughly similar to the amount of time that was actually spent doing parallel computation. In such a situation, the computation is being performed much less effectively than the programmer expected.

An example is in order:

Assume that the programmer believes that they've parallelized 99% of an application's work. The application, which runs for 100 seconds on a single processor, is to be run on 40 processors. Unknown to the programmer, there's a straggler that consumes 1 second of CPU time after all the other processors have completed their parallel work (i.e. this straggler effectively adds one second to the execution time of the program).

Amdahl's Law states that the maximum theoretical speed-up of a program which has been 99% parallelized on 40 processors is 1/(0.01+(1-0.01)/40) or 29 times faster than the serial version. Expressed differently, if the program takes 100 seconds to run serially then it will take a minimum of 100/29 or 3.4 seconds to run in parallel. The straggler adds 1 second to the theoretical minimum execution time of 3.4 seconds for a new minimum execution time of 4.4 seconds (i.e. the cost of the straggler is almost 25% of the parallel execution time of the program even though it's only 1% of the program's serial execution time).

Sidebar: Note that the actual execution time is likely to be longer than 4.4 seconds as the straggler problem is unlikely to be the only reason why the program's execution time is worse than the theoretical minimum predicted by Amdahl's Law.
Things get much worse if the program is run on 100 processors - Amdahl's Law predicts a maximum theoretical speedup of 1/(0.01+(1-0.01)/100) or a speed-up of about 50 (i.e. a theoretical minimum execution time of about 2 seconds). Unfortunately, the straggler increases this to 3 seconds for a cost of 50% over the theoretical peek speed of the program.

Conclusions

The potential impact of a straggler on the speed of a parallelized application can be quite significant (the example given above is far from a "worst case scenario"). A thorough understanding of the application and/or careful measurement of how the application performs in parallel is essential if a parallel software developer is to avoid the potential costs associated with stragglers.

N.B. This writeup should not be interpreted to suggest that the straggler problem is the most important reason why parallelized applications fail to perform as well as an inadequate understanding of Amdahl's Law might suggest that they should. For example, the cost of communicating between the parallel processors can and often is a much more important factor.


Sources