I recently was using wrapping counters to implement a circular buffer. While doing this, I found a doc on circular buffers in the Linux kernel which stated:
Calculation of the occupancy or the remaining capacity of an arbitrarily sized circular buffer would normally be a slow operation, requiring the use of a modulus (divide) instruction. However, if the buffer is of a power-of-2 size, then a much quicker bitwise-AND instruction can be used instead.
Although that is talking about capacity, you can also use increment/bitwise-AND instead of increment/mod or increment/compare. That made me wonder how much faster bitwise-AND is than other methods I've used.
Increment a counter x, using mod to set it to 0 when it reaches n.
x = (x + 1) % n;
Increment a counter x, resetting it to 0 when it reaches n.
x = x + 1;
if (x == n) {
x = 0;
}
Increment a counter x, mod n by using n-1 as a mask for the bitwise-AND. This only works if n is a power of 2.
x = (x + 1) & (n - 1);
To compare these methods, I made a test. I built
these tests using clang++ (Apple LLVM version 7.0.2) and g++ (5.3.0). Both compilers gave
similar runtime results with optimization level set to -O3
, even though the generated assembly
was a bit different. For example, clang++ used partial loop unrolling while g++ did not and clang++
used the INC instruction while g++ used ADD + 1.
The Intel 64 assembly listing below show the increment/wrap loops generated by g++, which was a bit simpler than clang++ since it didn't unroll loops.
The C++ loop:
for(i = end; i != 0; --i) {
j = (j+1) % divisor;
}
The generated assembly using ADD and IDIV instructions:
100000ac9: xor %eax,%eax
100000acb: nopl 0x0(%rax,%rax,1)
100000ad0: add $0x1,%eax
100000ad3: cltd
100000ad4: idiv %ebx
100000ad6: mov %edx,%r14d
100000ad9: mov %edx,%eax
100000adb: sub $0x1,%ecx
100000ade: jne 100000ad0 <_main+0x50>
The C++ loop:
for(i = end; i != 0; --i) {
if (++l == divisor) {
l = 0;
}
}
The generated assembly using LEA (load effective address) as a way to increment the counter and then CMP and CMOVE
to implement the if
branch without a jump:
100000ae5: xor %ebp,%ebp
100000ae7: xor %edx,%edx
100000ae9: mov %rax,%r12
100000aec: mov $0x3b9aca00,%eax
100000af1: data16 data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
100000b00: lea 0x1(%rbp),%esi
100000b03: cmp %esi,%ebx
100000b05: cmove %edx,%esi
100000b08: mov %esi,%ebp
100000b0a: sub $0x1,%eax
100000b0d: jne 100000b00 <_main+0x80>
The C++ loop:
const int and_divisor = divisor-1;
for(i = end; i != 0; --i) {
k = (k+1) & and_divisor;
}
The generated assembly using ADD and AND instructions:
100000b14: xor %ebx,%ebx
100000b16: mov $0x3b9aca00,%edx
100000b1b: mov %rax,%r13
100000b1e: xchg %ax,%ax
100000b20: add $0x1,%ebx
100000b23: and %r15d,%ebx
100000b26: sub $0x1,%edx
100000b29: jne 100000b20 <_main+0xa0>
Running the test on my machine produced:
./wrapping_counters_test 128
mod : 7659829us
compare: 1476831us
and : 728178us
So, Method 3, bitwise-AND is easily the fastest. Using it's runtime as the baseline index, we can measure the relative performance of the other methods.
Method | Relative Performance |
---|---|
Mod | 10.5 |
Compare/Reset | 2.0 |
Bitwise-AND | 1 |
Don't use mod to implement simple incrementing/wrapping counters. Use bitwise-AND, if you are wrapping around a power of 2, and use compare/reset otherwise.
Last updated April 18, 2016