I recently was using wrapping counters to implement a circular buffer. While doing this, I found a doc on circular buffers in the Linux kernel which stated:

Calculation of the occupancy or the remaining capacity of an arbitrarily sized circular buffer would normally be a slow operation, requiring the use of a modulus (divide) instruction. However, if the buffer is of a power-of-2 size, then a much quicker bitwise-AND instruction can be used instead.

Although that is talking about capacity, you can also use increment/bitwise-AND instead of increment/mod or increment/compare. That made me wonder how much faster bitwise-AND is than other methods I’ve used.

## Method 1: Increment and Mod

Increment a counter *x*, using mod to set it to *0* when it reaches *n*.

```
x = (x + 1) % n;
```

## Method 2: Increment, Compare, and Reset

Increment a counter *x*, resetting it to *0* when it reaches *n*.

```
x = x + 1;
if (x == n) {
x = 0;
}
```

## Method 3: Increment and bitwise-AND

Increment a counter *x*, mod *n* by using *n-1* as a mask for the bitwise-AND.
This only works if *n* is a power of 2.

```
x = (x + 1) & (n - 1);
```

# Assembly Level Comparison

To compare these methods, I made a test. I built
these tests using clang++ (Apple LLVM version 7.0.2) and g++ (5.3.0). Both compilers gave
similar runtime results with optimization level set to `-O3`

, even though the generated assembly
was a bit different. For example, clang++ used partial loop unrolling while g++ did not and clang++
used the INC instruction while g++ used ADD + 1.

The Intel 64 assembly listing below show the increment/wrap loops generated by g++, which was a bit simpler than clang++ since it didn’t unroll loops.

## Method 1: Increment and Mod

The C++ loop:

```
for(i = end; i != 0; --i) {
j = (j+1) % divisor;
}
```

The generated assembly using ADD and IDIV instructions:

```
100000ac9: xor %eax,%eax
100000acb: nopl 0x0(%rax,%rax,1)
100000ad0: add $0x1,%eax
100000ad3: cltd
100000ad4: idiv %ebx
100000ad6: mov %edx,%r14d
100000ad9: mov %edx,%eax
100000adb: sub $0x1,%ecx
100000ade: jne 100000ad0 <_main+0x50>
```

## Method 2: Increment, Compare, and Reset

The C++ loop:

```
for(i = end; i != 0; --i) {
if (++l == divisor) {
l = 0;
}
}
```

The generated assembly using LEA (load effective address) as a way to increment the counter and then CMP and CMOVE
to implement the `if`

branch without a jump:

```
100000ae5: xor %ebp,%ebp
100000ae7: xor %edx,%edx
100000ae9: mov %rax,%r12
100000aec: mov $0x3b9aca00,%eax
100000af1: data16 data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
100000b00: lea 0x1(%rbp),%esi
100000b03: cmp %esi,%ebx
100000b05: cmove %edx,%esi
100000b08: mov %esi,%ebp
100000b0a: sub $0x1,%eax
100000b0d: jne 100000b00 <_main+0x80>
```

## Method 3: Increment and bitwise-AND

The C++ loop:

```
const int and_divisor = divisor-1;
for(i = end; i != 0; --i) {
k = (k+1) & and_divisor;
}
```

The generated assembly using ADD and AND instructions:

```
100000b14: xor %ebx,%ebx
100000b16: mov $0x3b9aca00,%edx
100000b1b: mov %rax,%r13
100000b1e: xchg %ax,%ax
100000b20: add $0x1,%ebx
100000b23: and %r15d,%ebx
100000b26: sub $0x1,%edx
100000b29: jne 100000b20 <_main+0xa0>
```

# Runtime Comparison

Running the test on my machine produced:

```
./wrapping_counters_test 128
mod : 7659829us
compare: 1476831us
and : 728178us
```

So, Method 3, bitwise-AND is easily the fastest. Using it’s runtime as the baseline index, we can measure the relative performance of the other methods.

Method | Relative Performance |
---|---|

Mod | 10.5 |

Compare/Reset | 2.0 |

Bitwise-AND | 1 |

# Summary

Don’t use mod to implement simple incrementing/wrapping counters. Use bitwise-AND, if you are wrapping around a power of 2, and use compare/reset otherwise.