Limited-range MAX without branches

An alternative to the traditional definition of MAX, in C or C++, which avoids the use of a branch instruction, potentially useful in eg. DSP applications where performance is important:

 
#define MYMAX(a,b) ( (a) - ( (((a)-(b))>>(8*sizeof(a)-1)) \
  & ((a)-(b))) )

This generally compiles, with a little CSE, to something similar to:

sub t1, a, b
shr t2, t1, #31
and t2, t2, t1
sub result, a, t2

Restrictions

The major caveat with this is that the range of inputs must be limited to ensure that the difference between a and b doesn't overflow the arithmetic range. Also it's slower if your target architecture has conditional move or conditional execution that your compiler can exploit effectively.

Also, this only works for fixed point or integer data types. However, any performance increase a similar method might yield for floating point numbers would be far smaller, since the saving of a branch instruction is less significant in comparison with the cost of floating point intructions than is in comparison with the simple integer instructions.

Explanation and proof

This method first computes the difference between the two numbers, and creates a bitmask by extracting the sign bit of the difference. The mask will be either all bits clear (a-b is positive <=> a>=b) or all bits set (a-b is negative <=> a<b). Using this to predicate the difference gives a-b when a < b, 0 when a >= b.

a>=b => max == a && a < b => max == b (defines max)
a>=b => max == a +0 && a < b => max == a +(b-a)
a>=b => max == a -0 && a < b => max == a -(a-b)
a>=b => bitmask == 0 && a < b => bitmask == (a-b)
max == a-bitmask

Performance Profile

This algorithm is not universally faster than the traditional way of calculating MAX; it will vary depending on your target processor and your data sets. Since the main cost of a branch instruction in modern processors is in branch misprediction penalty, the conventional MAX algorithm performs poorly when the first and second arguments are not evenly balanced: where P(a > b) is close to neither 0 nor 1.

Where it works best is on dumb architectures with long branch latency, eg. AMD K6, and unrolled loops working with independent data sets: there's no parallelism in the calculation of an individual MYMAX, but by removing the need for a branch, a compiler's scheduler can have a better crack at finding enough parallelism to keep a superscalar busy. Pentium 4's large branch mispredict penalty would make this a big performance win, but the crippled barrel shifter structure of the P4 might make this more expensive. YMMV.

Another advantage of this algorithm is that the performance is entirely independent of the data set. As an example, in an audio signal processing application, a sample value may be saturated to some maximum value using either algorithm:

 
short samples[N];
int i;
for (i=0; i<N; i++)
  samples[i] = MAX(samples[i], MAX_OUTPUT);

In general, this may be quicker than using MYMAX, however when the data set in samples is 'noisy' and saturation occurs frequently, the execution time will be longer than the 'clean' signal case. Since DSP applications are inherently real-time systems, the constant execution time of MYMAX is desirable to ensure that processing time requirements can always be met, even though in the general case it may be slower for most data sets.

Performance Analysis

To profile the performance of the two algorithms over a random data set, I compiled and timed the following:

 
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#define N_INTS 1024

#define ITERATIONS 128

#define MAX(a,b) ((a)>(b)?(a):(b))

#define MYMAX(a,b) ( (a) - ( (((a)-(b))>>(8*sizeof(a)-1)) \
           & ((a)-(b))) )

int data[N_INTS];
int max_data[N_INTS-1];


int main (int argc, char *argv[]) {
  int i, j, k;

  assert(argc==2);      /* be user friendly */
  srand(0);             /* ensure the same sequence each time */
  for (i=0; i<N_INTS; i++)
    data[i]=rand()/2;   /* random data within range */

  /* Define a looping construct and re-use for brevity. */
  #define LOOP for (k=0; k<ITERATIONS; k++) \
    for (j=0; j<N_INTS; j++) for (i=0; i<N_INTS-1; i++)

  switch (argv[1][0]) {
  case 'm':             /* MYMAX */
      LOOP max_data[i] = MYMAX(data[i], data[i+1]);
      break;
  case 't':             /* Traditional max */
      LOOP max_data[i] = MAX(data[i], data[i+1]);
      break;
  case 'r':             /* Addition for reference */
      LOOP max_data[i] = data[i] + data[i+1];
      break;
  }
  return EXIT_SUCCESS;
}

This was timed on the following platforms:

Pentium Pro 166MHz, Linux 2.2
UltraSPARC 450Mhz, Solaris 6
Pentium 4, 1.7GHz, Linux 2.4
PowerPC G3 (7500), 600MHz, Mac OS X

Just for fairness, different levels of optimisation were tried too. All execution times in seconds.

 
arch/opt       MAX   (-ref)   MYMAX (-ref)   reference
P6 -O4          8.45 (4.64)   5.33  (1.52)   3.81 (0.00)
P6 -O           8.79 (5.70)   4.58  (1.49)   3.09 (0.00)    
P6             20.22 (8.62)  33.93 (22.31)  11.62 (0.00)
Ultra -O4       3.5  (2.0)    2.4   (0.9)    1.5  (0.0)
Ultra -O        3.8  (2.0)    2.7   (0.9)    1.8  (0.0)
Ultra          10.7  (1.6)   15.1   (6.0)    9.1  (0.0)
P7 -O4          1.44 (1.09)   0.50  (0.15)   0.35 (0.00)
P7 -O           1.53 (1.21)   0.53  (0.21)   0.32 (0.00)
P7              2.79 (0.87)   5.43  (3.51)   1.92 (0.00)
PPC -O4         2.13 (1.26)   1.78  (0.91)   0.87 (0.00)
PPC -O          2.36 (1.02)   2.01  (0.67)   1.34 (0.00)
PPC             6.67 (1.50)  10.43  (5.26)   5.17 (0.00)

On P6, with all optimisations enabled, execution time is only 63% of the traditional MAX. Factoring out the loop overhead etc (neglecting the single addition of the reference loop) gives timing of 1.52s vs 6.46s, MYMAX using only 24% of the CPU time. Similar reasoning for UltraSPARC gives 45% of the execution time (note: large error margin due to low-resolution timer on Solaris) and Pentium 4 gives 14% of the execution time.

Significant performance gains in this case. Pentium 4's massive saving in execution time can be attributed to the extremely large branch misprediction penalty associated with the extremely long pipeline of the P7 microarchitecture.

Note that without any optimisations enabled, MYMAX yields significantly poorer results than the conventional algorithm on all architectures.

Thanks have to go to C-Dawg for pointing out an error in the original writeup and reviving my original interest enough to put some more info in.

Proof by handwaving	CSE	Barrel shifter	Multiple Sleep Latency Test
Pentium 4	Smallest number greater than 0	UltraSparc	MAX
Predicate	DSP	SIMD	branch
overflow	Pentium Pro	Linux	conditional execution
Pioneer Courthouse Square	sizeof	galoshes	Doomsday Cult
TriMet	caveat	K6	QED