(I feel I need to write a rebuttal to Rollo's node here)

Branch prediction is hardly based on a "lucky guess" -- most programs consist of loops that are repeated several times. The looping construct will usually exhibit a strong bias to taken or not-taken, depending on the loop's sense, which means that a simple 2-bit predictor will capture most of the information about the branch and yield very high prediction rates (over 90%, typically).

Branch prediction in any modern CPU yields extraordinary performance benefits. As pipeline lengths increase, the penalty for misprediction becomes longer and longer. Furthermore, a modern out-of-order issue CPU needs a reasonable supply of instructions to dispatch to the functional units -- if you allowed the scheduler to block at every branch, no instruction-level parallelism could ever be exploited. Branch prediction is being folded directly into the caches in newer processors -- that's essentially what a trace cache is.

Finally, as far as multitasking and branch prediction, cycles "risked" on meaningless work would otherwise be wasted, unless you are talking about cooperative multitasking -- it's pretty rare that a thread of execution will yield back to the scheduler in preemptive multitasking. It makes no sense to claim that branch prediction is only helpful for single-process systems, since it was originally deployed on the large timeshare machines of the 1960's (IBM's System 360 to be specific).

(It sure feels good to flex some almost-atrophied computer architecture mental muscles...)