Strength Reduction in Critical Paths

2025-08-10 • ~4 min read

Strength reduction swaps expensive ops for cheap ones. Division becomes multiplication. Modulo becomes bitwise AND. Your compiler does this automatically—when it can.

The Basics

Check if number is odd: x % 2 becomes x & 1. One instruction instead of dozens. Free performance.

Divide by 8: x / 8 becomes x >> 3. Bit shift vs division unit. No contest.

Constant vs Variable Divisors

Compiler sees x % 1245? Optimizes to multiply-shift sequence. Sees x % divisor? Can't help you—emits idiv.

// This gets optimized
for (int i = 0; i < n; i++) {
    if (data[i] % 16 == 0) process(data[i]);
}

// This doesn't 
void func(int divisor) {
    for (int i = 0; i < n; i++) {
        if (data[i] % divisor == 0) process(data[i]);
    }
}

Real Numbers

Benchmark: 1M modulo operations
Constant divisor (1245): 4.6μs
Variable divisor (1245): 36μs
Speedup: 7.8x

Same input, same divisor value. Only difference: compiler knowledge at build time.

Manual Strength Reduction

Sometimes you know better than the compiler:

// Instead of x % (1 << k)
result = x & ((1 << k) - 1);

// Instead of x / (1 << k) 
result = x >> k;

// Replace expensive modulo in loops
// when you know the pattern
int mask = size - 1;  // size must be power of 2
for (int i = 0; i < n; i++) {
    buffer[i & mask] = data[i];
}

Division by Multiplication

Classic trick: replace x / d with x * (magic_number) >> shift. Compiler does this for constants. You can do it for known runtime values:

// Precompute magic constants for fast division
struct FastDiv {
    uint32_t multiplier;
    uint8_t shift;
};

FastDiv make_fast_div(uint32_t d) {
    // Magic number computation
    int shift = 32 + __builtin_clz(d - 1);
    uint64_t tmp = ((1ULL << shift) + d - 1) / d;
    return {tmp, shift - 32};
}

uint32_t fast_divide(uint32_t x, FastDiv fd) {
    return ((uint64_t)x * fd.multiplier) >> fd.shift;
}

Branch-Free Conditionals

Replace branches with arithmetic when possible:

// Instead of
if (x < 0) x = -x;

// Use
int mask = x >> 31;  // -1 if negative, 0 if positive
x = (x ^ mask) - mask;

SIMD Considerations

Vector units hate division. Design algorithms around this:

// Bad: division in inner loop
for (int i = 0; i < n; i += 8) {
    __m256i v = _mm256_load_si256(&data[i]);
    v = _mm256_div_epi32(v, divisor);  // Slow
}

// Good: reciprocal multiplication
float reciprocal = 1.0f / divisor;
for (int i = 0; i < n; i += 8) {
    __m256 v = _mm256_load_ps(&data[i]);
    v = _mm256_mul_ps(v, _mm256_set1_ps(reciprocal));
}

Profile Everything

Strength reduction isn't always worth it. Measure:

# Check what compiler actually emits
gcc -O3 -S your_code.c
objdump -d your_binary

# Profile hotspots
perf record -g ./your_program
perf report

Sometimes the "slow" operation isn't the bottleneck. Sometimes your clever optimization makes code unreadable for 2% gain. Pick your battles.

Bottom Line

Compiler handles obvious cases. You handle the subtle ones. Know your tools, profile your assumptions, and remember—premature optimization kills more projects than slow division ever will.

← Back