Strength reduction swaps expensive ops for cheap ones. Division becomes multiplication. Modulo becomes bitwise AND. Your compiler does this automatically—when it can.
Check if number is odd: x % 2 becomes x & 1. One instruction instead of dozens. Free performance.
Divide by 8: x / 8 becomes x >> 3. Bit shift vs division unit. No contest.
Compiler sees x % 1245? Optimizes to multiply-shift sequence. Sees x % divisor? Can't help you—emits idiv.
// This gets optimized
for (int i = 0; i < n; i++) {
if (data[i] % 16 == 0) process(data[i]);
}
// This doesn't
void func(int divisor) {
for (int i = 0; i < n; i++) {
if (data[i] % divisor == 0) process(data[i]);
}
}
Same input, same divisor value. Only difference: compiler knowledge at build time.
Sometimes you know better than the compiler:
// Instead of x % (1 << k)
result = x & ((1 << k) - 1);
// Instead of x / (1 << k)
result = x >> k;
// Replace expensive modulo in loops
// when you know the pattern
int mask = size - 1; // size must be power of 2
for (int i = 0; i < n; i++) {
buffer[i & mask] = data[i];
}
Classic trick: replace x / d with x * (magic_number) >> shift. Compiler does this for constants. You can do it for known runtime values:
// Precompute magic constants for fast division
struct FastDiv {
uint32_t multiplier;
uint8_t shift;
};
FastDiv make_fast_div(uint32_t d) {
// Magic number computation
int shift = 32 + __builtin_clz(d - 1);
uint64_t tmp = ((1ULL << shift) + d - 1) / d;
return {tmp, shift - 32};
}
uint32_t fast_divide(uint32_t x, FastDiv fd) {
return ((uint64_t)x * fd.multiplier) >> fd.shift;
}
Replace branches with arithmetic when possible:
// Instead of if (x < 0) x = -x; // Use int mask = x >> 31; // -1 if negative, 0 if positive x = (x ^ mask) - mask;
Vector units hate division. Design algorithms around this:
// Bad: division in inner loop
for (int i = 0; i < n; i += 8) {
__m256i v = _mm256_load_si256(&data[i]);
v = _mm256_div_epi32(v, divisor); // Slow
}
// Good: reciprocal multiplication
float reciprocal = 1.0f / divisor;
for (int i = 0; i < n; i += 8) {
__m256 v = _mm256_load_ps(&data[i]);
v = _mm256_mul_ps(v, _mm256_set1_ps(reciprocal));
}
Strength reduction isn't always worth it. Measure:
# Check what compiler actually emits gcc -O3 -S your_code.c objdump -d your_binary # Profile hotspots perf record -g ./your_program perf report
Sometimes the "slow" operation isn't the bottleneck. Sometimes your clever optimization makes code unreadable for 2% gain. Pick your battles.
Compiler handles obvious cases. You handle the subtle ones. Know your tools, profile your assumptions, and remember—premature optimization kills more projects than slow division ever will.