loop unrolling - Everything2.com

Loop unrolling is a speed optimization. It falls within the size vs speed trade off that you often encounter in programming. It has its most dramatic effect on very tight loops that are doing something extremely simple thousands of times.

The best way to describe loop unrolling is with an example. Lets say you have an array of integers a million in length, and want to initialize it to 0 to 999,999 (I know its fairly useless, but will illustrate the point)

int *randoms;
int *ptr;
int i;
int len = 1000000;

randoms = (int *)malloc(sizeof(int) * len);
ptr = randoms;

for(i = 0; i < len; i++)
{
    *ptr++ = i;
}

(For those that are confused about the '*ptr++', we are simply walking the pointer.)

Now if we think about what this loop is doing on the assembly level, things become a bit more interesting. Paraphrasing the assembly a bit (note that this is fairly generic and could be significantly different on different CPU architectures due to their specific instruction sets. Also how a specific compiler generates this code would have an effect):

jump past end of loop if(i >= len)
assign i to *ptr
increment ptr
increment i
jump to step 1

Of the 5 steps in the loop, 3 of them are the work that we want to do, and 2 are just the loop doing its thing. That means that approximately 40% of the CPU during this particular loop is just checking bounds and jumping, not doing our work.

Well, how can we get the CPU to do more of our work and less of the loop? We can unroll the loop!

for(i = 0; i < len; )
{
    *ptr++ = i++;
    *ptr++ = i++;
    *ptr++ = i++;
    *ptr++ = i++;
    *ptr++ = i++;
    *ptr++ = i++;
    *ptr++ = i++;
    *ptr++ = i++;
    *ptr++ = i++;
    *ptr++ = i++;
}

This puts more work in each iteration of the loop, which causes the jumps and bounds checks that the loop performs to be a smaller percentage of the overall work. Here we have done 10 iterations of our work per loop. Each iteration has 3 steps (assign i to *ptr, increment ptr, and increment i). So our work has 30 steps per loop. The loop still has 2 (bounds check and jump). So, the loop has a total of 32 steps now, but now our work is 93.75% (30/32) for each loop iteration. We are only wasting 6% on loop operations. Not bad huh? Now granted, my numbers are quite generalized and theoretical, but you can see what is going on here.

Things to note about loop unrolling is that my example works because I know the length of my array and that it is evenly divisible by 10. That is not always the case. You may end up with an array length that is not divisible by how many iterations you have hard coded into your loop. You will need to detect this and handle the remainder in a special case to avoid accessing memory beyond the length of the array.

I personally have used this technique to significantly increase speed performance when dealing with graphics and working with bitmaps. I'm sure there are other situations where this could be applied with superb results.

walking the pointer	Duff's device	Schwartzian Transform	Comba multiplication
Dupont Circle	Blowfish	reference counting	compile time
Outbleat	generational trauma	Crown prince	hoop-rolling
blitz++	Unroll	Superb	Plump
cache	Ruby	PIC