Let's say b is of some unsigned integer type, and we want to count how many of b's bits are set. This is just plain boring:
  for(n=0; b; b >>= 1)
    if (b & 1) n++;

Or in Pascal (yuck):

  n := 0;
  while (b) do begin
    if (odd(b)) then n:=n+1;
    b = b div 2

It's really just converting b to binary, and counting how many 1's it would need for that. Disgusting, really. Here's a nicer way:

  for(n=0; b; n++)
    b &= b-1;

Simple, elegant and concise, this hack is guaranteed to make the uninitiated want to take up gardening.

This hack is extremely useful on job interviews, but you'll need to know how it works, over on the counting 1 bits SPOILER.

Late addition:

A friend showed me a source for this amazing idea. It appears in almost exactly the same form in K&R2 (The C Programming Language, 2nd edition, aka the New Testament). And it's exercise 2-9 (on page 47) of K&R (The C Programming Language, aka the Old Testament).

I once gave the b &= b-1 solution in an interview, but my interviewer wanted a faster solution (which is not nearly as slick):

First, do some preprocessing. Create a lookup table (an array) that maps a number between 0 and 255 (the index) to the number of bits in the binary representation of that number (you can use b &= b-1). Then, for each byte in the unsigned int, use the byte as an index into the table, and sum the results. This runs in constant time, which is especially good if you have lots of bytes to count.

Now, I have nothing against lookup tables, except for the fact that you actually have to do a lookup, which will make use of the computer's cache, or even (*gasp*) main memory.. And why do that, when there is a perfectly reasonable, constant-time (more accurately, log(word_length)-time - it's constant time, relative the number of 1-bits in the word.) method that uses only registers! (well, on a RISCy CPU that has loads of them lying around, anyhow... Almost nothing runs entirely in registers on CISC (Intel's, anyway), alas.)

Anyway, onward.. the code, in C, of course, for machines with 32-bit unsigneds:

MASK1  = 01010101010101010101010101010101
MASK2  = 00110011001100110011001100110011
MASK4  = 00001111000011110000111100001111
MASK8  = 00000000111111110000000011111111
MASK16 = 00000000000000001111111111111111

unsigned numbits(unsigned i)
    const unsigned MASK1  = 0x55555555;
    const unsigned MASK2  = 0x33333333;
    const unsigned MASK4  = 0x0f0f0f0f;
    const unsigned MASK8  = 0x00ff00ff;
    const unsigned MASK16 = 0x0000ffff;

    i = (i&MASK1 ) + (i>>1 &MASK1 );
    i = (i&MASK2 ) + (i>>2 &MASK2 );
    i = (i&MASK4 ) + (i>>4 &MASK4 );
    i = (i&MASK8 ) + (i>>8 &MASK8 );
    i = (i&MASK16) + (i>>16&MASK16);

    return i;
Guaranteed to amaze...

How does it work?


It starts off treating the word as 32 1-bit counters. (ie, 32 entities that each can count from 0 to 1.) It then pairs off adjacent counters and adds each pair, the result being 16 pairs of two-bit counters. (This is ok, because the maximum value in each counter is 1, 1+1=2, and 2 fits comfortably in the range of a 2-bit counter, 0 to 3.) Continuing, it takes these 16 2-bit counters, and pairs them again, again adding each pair, to produce 8 4-bit counters, and again, we are ok, because each 2-bit counter has a max of 2, 2+2=4, and 4 fits in the range 0 to 16. We keep going pairing counters and adding, until there is only a single, 32-bit counter remaining, and it, of course, contains the result.

An example:

(My apologies to those of you using a console-mode browser like Lynx; the first step's columns are wider than 80 characters.. not much, but wider: 82 characters)

Find the number of bits in 10010111011111010101101110101111. (the answer's 22, for those of you keeping score.)

  1. break the word into 32 1-bit counters, pair, and add, to get 2-bit counters
     1 0 0 1 0 1 1 0 0 0 1 1 1 1 1 1 =  1  0  0  1  0  1  1  0  0  0  1  1  1  1  1  1
    +0 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 =  0  1  1  1  1  1  1  1  1  1  0  1  0  0  1  1
     1 1 1 2 1 2 2 1 1 1 1 2 1 1 2 2 = 01 01 01 10 01 10 10 01 01 01 01 10 01 01 10 10
  2. break the word into 16 2-bit counters, pair, and add, to get 4-bit counters
     1 1 1 2 1 1 1 2 =   01   01   01   10   01   01   01   10
    +1 2 2 1 1 2 1 2 =   01   10   10   01   01   10   01   10
     ---------------   ---------------------------------------
     2 3 3 3 2 3 2 4 = 0010 0011 0011 0011 0010 0011 0010 0100
  3. break the word into 8 4-bit counters, pair, and add, to get 8-bit counters
     3 3 3 4 =     0010     0011     0010     0010
    +2 3 2 2 =     0011     0011     0011     0100
     -------   -----------------------------------
     5 6 5 6 = 00000101 00000110 00000101 00000110
  4. break the word into 4 8-bit counters, pair, and add, to get 16-bit counters
      5  5 =         00000101         00000101
    + 6  6 =         00000110         00000110
     -----   ---------------------------------
     11 11 = 0000000000001011 0000000000001011
  5. And finally, break the word into 2 16-bit counters, pair, and add, to get the answer, 22, in a single 32-bit counter.
     11 =                 0000000000001011
    +11 =                 0000000000001011
     --   --------------------------------
     22 = 00000000000000000000000000010110

I hope you had as much fun reading this, as I had writing it!

The algorithm in bis' writeup above can be further optimised as the result of a couple of observations

  1. The size of the sum is proportional to the log2 of the number of bits being counted. This sum is continuously stored in an n-bit space within the 32-bit integer range.
  2. On most architectures simple 8-bit constants are easier to construct than 32- or 16-bit ones.
On this basis, the final stage in bis's code:
i = (i&MASK16) + (i>>16&MASK16);
Can be reduced to:
i = (i + (i>>16))&MASK6;

Where MASK6 is (1<<6)-1 or 0x1F, and eliminates the need for MASK16.

Typically this will save two operations: one to set up the larger constant, and one AND operation.

A small optimsation, but useful if you're relying on your compiler to generate your code, in tight timing constraints, and with no hardware population count instruction.

Yet Another Way to count the bits is to use a language for which that is a goal. intercal is such a language (although "goal" might be a bit strong here). With the select operator, which looks like a ~, this operation becomes simple.
The way the operator works is to extract from the first operand just the bits that correspond to high bits in the second operand.

So, if we do this ("#" signifies a constant):

in binary we have this:
10111001 ~ 11101101

So, the second operand will select bits 8,7,6,4,3 and 1 (the 1s). In the first operand those bits are 101101, or 45.

This means that if you select a number against itself, you end up with a result composed of just the ones.

This is 31, or in 11111 in binary.

If our result is N, then log2(N+1) tells us the number of 1s. In this case, log2(31+1) = log2(32) = 5.
Of course, solving for can be N pretty darn difficult. factoring large numbers isn't easy..yet...

#define BITCOUNT(x) (((BX_(x)+(BX_(x)>>4)) & 0x0F0F0F0F) % 255)
#define BX_(x) ((x) - (((x)>>1)&0x77777777) - (((x)>>2)&0x33333333) - (((x)>>3)&0x11111111))

From the fortune files.
About Riddle 7 from hard interview questions.

This riddle is about the order of an algorithm. As posed it is actually either a joke or a trick question, and a detail is unintentionally wrong. Maybe you can score extra points with some interviewers if you can point out the problems.

The riddle is posed like this:

How do you quickly count the number of set bits in a 32-bit integer in linear time (with respect to the number of set bits)? In constant time?

Now, you can always do it in constant time, since the integer never has more than 32 bits, you need just up to 32 operations, and all the different answers above just change the number of operations, but not that it is always constant time. Actually, every algorithm that finishes does finish in constant time if you keep the problem small(<32!). The order O of an algorithm only makes sense when the size of the problem (usually given by integers named n and m)does vary. For instance, look at Compute Fibonacci numbers FAST!.

Another example(this is a bit inaccurate), breaking an RSA key of 128 size does take constant time, maybe a month, but approximately every 4th bit you add to the key doubles the time needed to break the key. This makes the order greater than the fourth root of the logarithm(=key length) of the key space, as you need only twice the effort for 16 times more keys.

This means the riddle at best can pass as a trick question, but not as it was intended. Fixing this, then consider that the riddle had been posed like this:

How do you quickly count the number of set bits in a N-bit integer in linear time (with respect to the number of set bits)? In constant time?

This is better, but the answers given above still are not giving the complexity of with respect to the number of set bits, as the algorithm doesn't finish any faster if only one bit is set(e.g. the highest bit).

So, the riddle should read:

How do you quickly count the number of set bits in a N-bit integer in linear time ? Can you give a fast implementation ?

This makes sense.

You can speed up the process only to apparently linear to the number of bits set if you have a CPU that can check lots of bits for zero-ness practically at once, like using a FPGA for memory. The problem is as difficult as adding up all N bits, which gives an addition tree with a number of nodes(each node stands for an addition) equal to the number of additions needed. So you can solve it in logarithmic time( log N ) by using N logic gates or in linear time ( N ) using a CPU which does the adding. This means you trade software loops for hardware gates.

Using a lookup table is very smart, but does not reduce the complexity when N gets greater than the the lookup table size. If you follow the excellent suggestion of bis (, which is to start with field size one bit, then repeat to mask every 2nd field with a bitmask, then shift and add to the adjacent field all in one CPU operation, then proceed with double field width until the field is as wide as the number ), you reduce the number of CPU adding operations, but increase the number of copying and shifting operations. This gives a speed increase, since you have the hardware to handle several bits quickly. However, in theory, the complexity is unchanged, since eventually you will run out of hardware that can do masking/shifting/addding operations in parallel. Still, at least bis' suggestion makes full use of the CPU because in every CPU adding step every bit is significant and used.

You can optimise bis's solution even further. The first two lines are the same:

i = (i & 0x55555555) + (i>>1 & 0x55555555);
i = (i & 0x33333333) + (i>>2 & 0x33333333);
Now at this point we have 8 4-bit counters and wish to add them in pairs to get 4 8-bit counters. The maximum number in each counter will be 8, because it's the sum of 8 bits. The trick is that the number 8 fits into 4 bits, which means we can do the mask operation after the add. This is the same as call's optimization, except we're doing it earlier.
i = i + (i >> 4) & 0x0F0F0F0F
We now have our 4 8-bit counters. Here's the clever part: We can add these together in a single operation.
i = i * 0x01010101
How does this work? Imagine the multiplication done by hand.
      x 01010101
Notice that the A is the sum of the four digits. Actually, I lied a bit, we still have to shift.
i = i >> 24
the whole thing:
i = (i & 0x55555555) + (i>>1 & 0x55555555);
i = (i & 0x33333333) + (i>>2 & 0x33333333);
i = i + (i >> 4) & 0x0F0F0F0F
i = (i * 0x01010101) >> 24
64 bit version:
i = (i & 0x5555555555555555) + (i>>1 & 0x5555555555555555);
i = (i & 0x3333333333333333) + (i>>2 & 0x3333333333333333);
i = (i & 0x0F0F0F0F0F0F0F0F) + (i>>4 & 0x0F0F0F0F0F0F0F0F);
i = (i * 0x0101010101010101) >> 56
I didn't invent this - search for bit population count on Google.

I was confused by /Neodymion2's explanation until I realized that the multiplication relies on the C language (and I presume other's) ignoring overflow on unsigned multiplication. That is, everything to the left of the 'a' doesn't fit in a word, and in effect, is simply dropped. Right shifting gets rid of everything to the right of the 'a', so what you're left with is simply 'a'.

I proved it to myself using algebra, and to make it easy, tried it just on a 16 bit word-size, which means we multiply by 0x0101 = 2^8 + 1 (where '^' means 'to the power of'), and we shift by 8 to the right, and we have only two eight-bit counters, call them 'a' and 'b'. So the word we are counting is represented as an integer by 2^8*a + b, so multiplying we get:

(2^8*a + b) * (2^8 + 1) =
    2^8*2^8*a + 2^8*a + 2^8*b + b =
    2^16*a + 2^8*(a+b) + b
However 2^16*a doesn't fit in a 16-bit word so it gets dropped, so we are left with
2^8*(a+b) + b.
We then shift right by 8 bits, which gets rid of the lone 'b' and divides by 2^8, and that leaves the result as a+b

Other places on the web say that some compilers will implement this multiply by shifting and adding, in which case it may be faster to not use the multiply, at least on 32-bit words.

Log in or register to write something here or to contact authors.