In mathematics, optimisation is the study of how to find the highest or lowest point of a function ranging over numbers.

In practice, the function is constructed as a measure of "goodness" for some procedure, and we want to find out the best way to perform that procedure. In this case we are performing global optimisation, which is much harder than local optimisation. Local optimisation is the procedure of finding a local optimum, which can easily be accomplished iteratively by a greedy algorithm. (The construction of such an algorithm is left as an exercise).

Optimisation can be performed analytically or numerically. Numerical methods may often be stochastic. A simple analytic optimisation of a polynomial can be conducted by first finding the roots of the first derivative: these are the turning points xi. One can then sort the points by the y-coordinate to find the largest.

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." -- Donald Knuth

"The first principle of optimization is don't." -- Brian Kernighan/Rob Pike, The Practice of Programming

When talking about computer programming, optimization refers to improving a program. This is actually a misnomer, because generating an optimal solution is far too complex a problem. Instead, we have to settle for a solution that is better than the old one. In terms of code generation, better means code that executes faster, although there are cases where code that occupies less space is more valuable (e.g. embedded platforms with limited memory available).

Optimization occurs at two levels. The programmer can manually optimize a program or a compiler program can automatically optimize it. As Knuth, Kernighan, and Pike suggest above, though, manual optimization is best left until a performance problem occurs; after all there is no sense in spending days making your program run seconds faster if nobody is waiting for it. Making the code go faster also has a tendency to make it more complicated, too, so premature optimization is likely to leave you with more buggy and less maintainable code.

When it is necessary to make it faster, the best results usually come from the programmer finding the key performance bottlenecks and fixing those areas. The programmer should first locate those key bottlenecks; with simple programs, it is often quite obvious where to start, with larger programs a profiler program may need to be used. One the key areas are identified, here are some techniques programmers can use:

  • Faster algorithms The classic examples of this are for sorting and searching. Replacing a bubble sort, or any O(n^2) sort for that matter, with a speedy O(n log n) quicksort or a specialized radix sort will make a big difference if many items are being sorted. Likewise, binary searchs, O(log n), leave linear searchs, O(n) in the dust.
  • Lookup tables This technique precalculates time-consuming values and stores them ahead of time in a table that can be used later. An example: if sine and cosine of angles are calculated repeatedly in a loop, but the angles are always rounded to the nearest degree, you could precalculate the sine and cosine for every angle up to 360 degrees. A variation of this technique skips the precalculation part but caches values as they are used.
  • Better data structures This is related to faster algorithms, as data structures and the algorithms used with them are often two sides of the same coin. Examples here are replacing lists or arrays with hash tables or trees to enable faster lookups.
  • Specialized instructions Intel introduced the MMX instructions to speed up cases where the same arithmetic operation is being performed over and over on large quantities of data. A programmer can often speed up certain types of operations by simply using the specialized instructions. Unfortunately, compilers are not very good at detecting the situations where these instructions would be useful, so it is up to the programmer to figure it out.

While it might be a waste of time to hand optimize the code for smaller gains, compilers can automatically do some simple optimizations. This is usually worthwhile since the main trade-off is typically a slower compile. Here are some examples some common compilers optimizations:

  • Constant folding This is one of the simplest forms of optimization. It find expressions that can be calculated at compile time, usually because all the operands are themselves constant, and puts the constant result into the program rather than the code to generate it. An example:
    int len = strlen("some constant string");
    In this case, the compiler knows the string is 20 characters long and it knows what strlen does, so it can just assign 20 to len without calling strlen. Of course, it can only do this for functions with no side effects, and will always return the same value at compile time and at runtime.
  • Instruction ordering Modern processors can execute more than one instruction per clock cycle, but usually only if they use different parts of the CPU. For example, a multiplication might be able to execute at the same time as an address calculation, but two multiplications would have to be done one after the other. Code can be optimized by rearranging the instructions to be close to other instructions that use other parts of the CPU. Of course, instructions that depend on each others results can only be rearranged so much.

    Another aspect of instruction ordering is that of branch timings. Most CPUs execute code faster if a branch is not taken. Thus, it makes sense to make the most common case fall through the branch, and the exceptional cases take the branch. Some more modern compilers can take runtime profile data as input, and use this to best order the branches. In particular, just-in-time (JIT) compilers like Sun's Hotspot are ideally suited to this, as they have complete access to runtime profile data.
  • Register Allocation A CPU will generally have a small amount of very fast internal memory called the register set. Most operations are faster when performed on registers than when performed directly on memory, and there are usually certain operations that can only be performed on registers. Loading important variables into registers can have a huge effect on performance, but since there is a very limited amount of space, trade-offs are necessary. This is especially true with CPU architectures with small register sets like the Intel x86.
  • Aligning data Another quirk of most CPUs is that they access aligned data faster. For example, a variable with an even address might be accessed faster than one with an odd address. Most often, it is related to the word size of the processor: if the CPU has a 32-bit data bus, it can load a whole 32-bit variable at once only if the address is a multiple of four. Some CPUs even require aligned access and throw exceptions if asked to look at unaligned data. Since few languages allow explicit alignment of data, the compiler usually has complete control over the data alignment.
  • Strength reduction This is the process of taking a complicated operation and replacing it with a less complicated operation. A common example is replacing a multiplication with an addition. Most computers can do an addition faster than a multiplication, so it is said to be less strong. The strength reduction compilers can do automatically is usually pretty simple. The following example shows a strength reduction, where a multiplication in a loop is replaced by an addition by introducing a temporary variable to store intermediate results:
    unoptimized version:
    x = someval();
    for (i = 0; i < 100; i++) {
        foo[i] = i * x;
    }
    
    optimized version:
    x = someval();
    tmp = 0;
    for (i = 0; i < 100; i++) {
        foo[i] = tmp;
        tmp += x;
    }
    
  • Loop unrolling The typical way to execute a statement a given number of times is to enclose it in a for loop instead of simply repeating the statement. While this may be more understandable, flexible, and readable, it has a few disadvantages, too. On each iteration, a counter variable must be accessed and tested against the termination condition. In addition, the computer must perform a branch: when it reaches the bottom of the loop, it has to stop running instructions in a nice sequence and go back to the top of the loop. Branches are one of the worst offenders for slowing down CPUs; on older processors, the instructions that were pre-fetched from memory had to be thrown away, newer processors get stalled pipelines and cache misses.

    Loop unrolling simply gets rid of the loop construct and repeats the statement. Compilers are pretty good at unrolling loops. Here is an example where the loop is partially unrolled; every forth iteration, it must check timeToExit(). In the unoptimized example, it will run (i % 4) and branch once for every time through the loop. The unrolled version avoids (i % 4) altogether and branches every forth time through the loop.
    unoptimized version:
    for (i = 0; i < 1000; i++) {
        if ((i % 4) == 0 && timeToExit())
            break;
        doSomething();
    }
    
    optimized version:
    for (i = 0; i < 1000; i+=4) {
        if (timeToExit())
            break;
        doSomething();
        doSomething();
        doSomething();
        doSomething();
    }
    
  • Inlining functions Just like there is overhead with loops that can be optimized away by unrolling, functions have overhead that can be optimized away by inlining. Function inlining is where the call to the function is replaced by the function itself; it saves the cost of setting up the stack frame, the function call itself, the destruction of the stack frame, and the return from the function. It can also allow further optimization, like constant folding and better instruction ordering.

    Inlining in usually very much a speed versus size trade-off because each time a function is inlined, another copy of itself must be added to the final program. Exceptions to this are very small functions where the function body takes up less space than the calling code, and when a function is called only once so the inlined copy is the only one needed.

There's an example from back in the 80's that still probably serves as a good engineering reference for people working on hardware/software driver issues.

In those days of yore (only in the computer industry can one refer to something 20 years ago as "yore"...) there was the Commodore 64. It retains its place as a pioneering home computer in that it offered very good (for the time) graphics and sound capability in an inexpensive unit. But then came its bastard son...

The 1541 floppy disk drive. It became the storage option for a home user once they became infuriated enough with the capabilites of cassette-tape backup to pony up for storage on a real medium. Unfortunately, the 1541 was slow. Unbelievably slow. Slow enough to think, just maybe, there were little dwarven people in your serial interface cable running your bits back and forth by hand.

Now, a very unique attribute of the 1541 drive was that it had its own 6502 processor and firmware. Plausibly, having in effect a "disk-drive-coprocessor" would accelerate your data transfer. It did not. Not remotely. Running through a disassembly of the 6502 firmware revealed endless, meandering code to provide what would appear, on the surface, to be a pretty straightforward piece of functionality: send data bits over the data pin and handshake it over the handshake signal pin.

As the market forces of installed base and demand for faster speed imposed themselves, solutions to the 1541 speed problem were found by third party companies. Software was released which performed such functions as loading from disk and backing up floppies as speeds that were many, many times faster than the 1541's base hardware and firmware could offer.

The top of this particular speed-enhancement heap was a nice strategy involving utilizing both the Commodore 64's and the 1541's processors, and the serial connection, optimally. Literally optimally. Assembly routines were written to run on the 64 and the 1541 side to exactly synchronize the sending and receiving of bits on a clock-cycle by clock-cycle basis. Taking advantage of the fact both 6502's were running at 1 Mhz, the 1541's code would start blasting the data across the serial line to the corresponding 64 code, which would pull it off the serial bus within a 3-clock-cycle window (you could not write the two routines to be any more in sync than a couple 6502 instructions). This method used no handshaking whatsoever for large blocks of data being sent from the drive to the computer, and so, in an added speed coup, the handshaking line was also used for data, doubling the effective speed.

The 1541 still seems pertinent as an example of a computer function that one would probably think would best be done primarily on a software level (running on the Commodore 64), but was engineered instead to utilize a more-hardware approach (on the 1541), only to be rescued by better software to utilize the hardware (on both).

There's probably still a few design lessons from the "ancient" 1541, for both the hardware and the software guys.

(Adapted for e2 from a Slashdot posting of mine)

Log in or register to write something here or to contact authors.