The algorithm in bis' writeup above can be further optimised as the result of a couple of observations

  1. The size of the sum is proportional to the log2 of the number of bits being counted. This sum is continuously stored in an n-bit space within the 32-bit integer range.
  2. On most architectures simple 8-bit constants are easier to construct than 32- or 16-bit ones.
On this basis, the final stage in bis's code:
i = (i&MASK16) + (i>>16&MASK16);
Can be reduced to:
i = (i + (i>>16))&MASK6;

Where MASK6 is (1<<6)-1 or 0x1F, and eliminates the need for MASK16.

Typically this will save two operations: one to set up the larger constant, and one AND operation.

A small optimsation, but useful if you're relying on your compiler to generate your code, in tight timing constraints, and with no hardware population count instruction.