display | more...
Reader alert: this w/u assumes a fairly deep background in operating system performance in general and virtual memory related performance in particular.

A system is suffering from a "convoy effect" if it is experiencing increased system load as a result of the time that it is taking to handle (i.e. act upon) some class of requests. i.e. a "convoy effect" is a second order effect which is the result of an existing convoy-related issue.

At least in this context, the term "convoy" derives from the notion that a process waiting in a request queue, like the disk i/o request queue, is somewhat like a truck sitting in a long convoy of trucks. i.e. a process in a request queue is "in a convoy". A system which is suffering because processes are in a convoy is said to be suffering from a "convoy effect".

Just to really drive the point home: a system which is slow because of a long queue of requests isn't necessarily suffering a "convoy effect". A system is suffering a "convoy effect" if the fact that it is experiencing long delays due to a long queue of requests is itself causing other performance related problems (i.e. the delays are causing more delays).

Gather 'round the fire everyone as I'm going to tell you an old war story . . .

I can recall a situation which I encountered on a mainframe computer system in the early 1980s. It was a dark and stormy night and the system was experiencing quite heavy paging and disk i/o loads. A process which took a page-fault or which issued a (non-paging related) disk i/o request would suffer quite a long delay before the page-fault or the disk i/o operation could be completed.

A process waiting for a disk i/o request to complete would often lose memory-resident pages (i.e. suffer page-outs). Once the disk i/o operation completed, the process might then almost immediately have to take a page-fault to bring the just-paged-out-page back in. i.e. the process was suffering extra page-faults because it was taking too long to process the disk i/o operations.

If processes were suffering extra page-faults then the paging subsystem was having to handle extra page-faults and the extra page-outs that corresponded to them. i.e. the heavy disk i/o load on the system was increasing the already fairly heavy paging load of the system.

The system was on the verge of thrashing if not already thrashing and things were definitely getting ugly.

The system was suffering from a classic "convoy effect" as the time that it took processes to complete disk i/o operations was increasing the paging load.

The solution which was implemented (this was on a system for which we had full operating system source code) was to make a very minor change to how disk i/o requests were processed. Instead of maintaining a strict FIFO queue of requests (i.e. always processing the requests in the order in which they were received), new disk i/o requests were randomly inserted at either the front or the back of the disk i/o request queue.

Now, obviously, this was grossly unfair as some requests would get "instant response" while others would languish on the request queue. "Grossly unfair" though it may have been, the net result of this change was a very noticeable improvement in system performance.

The reason for the improvement in system performance was quite simple: by ensuring that at least some of the disk i/o requests got processed quickly, the number of "unnecessary" page-faults dropped (quite substantially as it turned out). The drop in paging load meant that page-faults could be processed more quickly. It even had a second order impact that improved performance even more . . .

There was a second convoy effect tied into the paging and disk i/o load - processes would often be holding kernel locks when they made a disk i/o request. The delays caused by the long disk i/o request queue coupled with the additional delays caused by the "unnecessary" page-faults meant that the average time spent holding these kernel locks increased. This meant that processes waiting for one of these kernel locks would have to wait longer which meant that they were losing memory-resident pages (i.e. suffering page-outs) which they would often have to recover with extra page-faults once they held the kernel lock. This further increased the average time spent holding these kernel locks which improved the probability of suffering a page-fault which . . . (you get the picture).

By randomly placing disk i/o requests at the front or the end of the disk i/o request queue, the paging load on the system dropped. The drop in the paging load meant that the time spent holding these kernel locks while doing disk i/o dropped which, in turn, meant that fewer page-outs were suffered while holding the kernel locks which not only reduced the load on the paging subsystem but it decreased the time spent waiting for these kernel locks which even further reduced the number of "wasted" page-outs.

One fairly simple kernel change ended up having a pretty noticeable impact on performance. Even better yet, nobody complained about the "grossly unfair" algorithm for handing disk i/o requests (possibly because practically nobody knew about the change and those that did understood that it worked in their favour even when it treated them unfairly).


Notes

  • this w/u is entirely based on personal experience. That said, please don't conclude that I was the "hero" of this story as I was actually just a bystander "on the inside".

  • those of you who can actually remember what they learned in their queuing theory course can probably supply a "more correct" term for what I'm calling the "convoy effect". Please feel free to tell me what the "more correct" term is but please don't hold your breath waiting for this dinosaur to change his vocabulary (grin).

  • one could argue that more conventional mechanisms for dealing with thrashing would have solved the problem. The catch with the "more conventional mechanisms for dealing with thrashing" is that they're generally at least as "grossly unfair" as the solution described above AND they tend to be much more disruptive to the system as a whole.

  • the "war story" in this w/u is a classic example of a situation in which understanding what's going on is the hard part. Actually implementing "the fix" was a pretty trivial exercise.

Log in or register to write something here or to contact authors.