Threading, Parallelism and Concurrency

2025-05-12 • ~9 min read

Threading is not free. Context switches cost cycles. False sharing kills performance. Too many threads create more overhead than parallelism. Understanding when and how to parallelize separates fast code from slow code.

Threading Overhead Reality

Every thread has costs—memory, context switching, synchronization:

// Thread creation overhead
Thread stack: ~1MB default (Linux/Windows)
Thread control block: ~8KB kernel structures  
Context switch: 1000-10000 cycles typical

// Small task parallelization - usually wrong
void bad_parallel_sum(std::vector& data) {
    const int num_threads = std::thread::hardware_concurrency();
    const int chunk_size = data.size() / num_threads;
    
    std::vector threads;
    std::vector partial_sums(num_threads);
    
    for (int i = 0; i < num_threads; i++) {
        threads.emplace_back([&, i]() {
            int sum = 0;
            int start = i * chunk_size;
            int end = (i == num_threads - 1) ? data.size() : start + chunk_size;
            for (int j = start; j < end; j++) {
                sum += data[j];  // Simple operation - overhead exceeds benefit
            }
            partial_sums[i] = sum;
        });
    }
    
    for (auto& t : threads) t.join();  // Context switch overhead
}

// Rule of thumb: Task should take >10,000 cycles to benefit from threading

Context Switching Costs

Frequent context switches can make parallel code slower than serial:

// Context switch overhead
Direct costs:
- Save/restore registers: ~100 cycles
- TLB flush: ~1000 cycles  
- Pipeline flush: ~500 cycles
- Cache pollution: Varies greatly

Indirect costs:
- Cache misses after switch: 10,000+ cycles
- Branch predictor state lost: 1000+ cycles
- Prefetcher state lost: Variable

// Pathological case: too many threads
void context_switch_thrashing() {
    const int too_many_threads = 100;  // On 8-core machine
    std::vector threads;
    
    for (int i = 0; i < too_many_threads; i++) {
        threads.emplace_back([i]() {
            // Short-lived computation
            volatile int sum = 0;
            for (int j = 0; j < 1000; j++) {
                sum += j * i;
            }
        });
    }
    
    // Result: OS scheduler thrashes switching between 100 threads
    // Serial version would be faster
}

Thread Pool Patterns

Reuse threads to amortize creation costs:

class ThreadPool {
    std::vector workers;
    std::queue> tasks;
    std::mutex queue_mutex;
    std::condition_variable condition;
    bool stop = false;

public:
    ThreadPool(size_t threads = std::thread::hardware_concurrency()) {
        for (size_t i = 0; i < threads; ++i) {
            workers.emplace_back([this] {
                while (true) {
                    std::function task;
                    {
                        std::unique_lock lock(queue_mutex);
                        condition.wait(lock, [this] { return stop || !tasks.empty(); });
                        
                        if (stop && tasks.empty()) return;
                        
                        task = std::move(tasks.front());
                        tasks.pop();
                    }
                    task();  // Execute without holding lock
                }
            });
        }
    }
    
    template
    void enqueue(F&& f) {
        {
            std::unique_lock lock(queue_mutex);
            tasks.emplace(std::forward(f));
        }
        condition.notify_one();
    }
};

// Usage - no thread creation overhead per task
ThreadPool pool(8);
for (int i = 0; i < 1000; i++) {
    pool.enqueue([i] { process_chunk(i); });
}

False Sharing

Threads accessing different variables in same cache line cause performance disaster:

// False sharing example
struct BadCounters {
    volatile int counter0;  // Same cache line (64 bytes)
    volatile int counter1;  // Threads ping-pong cache line
    volatile int counter2;
    volatile int counter3;
};

void false_sharing_demo() {
    BadCounters counters = {0, 0, 0, 0};
    
    std::thread t0([&] { for (int i = 0; i < 1000000; i++) counters.counter0++; });
    std::thread t1([&] { for (int i = 0; i < 1000000; i++) counters.counter1++; });
    
    // Cache line bounces between cores on every increment
    // Performance worse than single threaded
}

// Solution: cache line padding
struct alignas(64) GoodCounters {  // Force 64-byte alignment
    volatile int counter;
    char padding[60];              // Fill rest of cache line
};

void no_false_sharing() {
    GoodCounters counters[4];      // Each counter in separate cache line
    
    std::vector threads;
    for (int i = 0; i < 4; i++) {
        threads.emplace_back([&, i] {
            for (int j = 0; j < 1000000; j++) {
                counters[i].counter++;  // No cache line contention
            }
        });
    }
}

Lock Overhead and Contention

Locks have significant costs even when uncontended:

// Lock overhead (typical x86-64)
Uncontended mutex lock/unlock: ~25 cycles
Contended mutex (no waiting): ~1000 cycles  
Contended mutex (with waiting): ~10,000+ cycles

// High contention example
std::mutex global_mutex;
int shared_counter = 0;

void high_contention() {
    std::vector threads;
    for (int i = 0; i < 8; i++) {
        threads.emplace_back([&] {
            for (int j = 0; j < 1000000; j++) {
                std::lock_guard lock(global_mutex);
                shared_counter++;  // Tiny critical section, huge contention
            }
        });
    }
    // All threads serialize on single mutex - no parallelism
}

// Better: per-thread counters + final reduction
void reduced_contention() {
    const int num_threads = 8;
    std::vector thread_counters(num_threads, 0);
    std::vector threads;
    
    for (int i = 0; i < num_threads; i++) {
        threads.emplace_back([&, i] {
            for (int j = 0; j < 1000000; j++) {
                thread_counters[i]++;  // No synchronization needed
            }
        });
    }
    
    for (auto& t : threads) t.join();
    
    int total = 0;
    for (int count : thread_counters) total += count;  // Single-threaded reduction
}

Work-Stealing and Load Balancing

Uneven work distribution kills parallel efficiency:

// Bad: uneven work distribution
void uneven_workload() {
    std::vector work_sizes = {1000, 5000, 100, 8000, 200};  // Highly variable
    std::vector threads;
    
    for (int i = 0; i < work_sizes.size(); i++) {
        threads.emplace_back([&, i] {
            for (int j = 0; j < work_sizes[i]; j++) {
                expensive_computation();
            }
        });
    }
    // Thread with 8000 units becomes bottleneck - others wait idle
}

// Better: work-stealing queue
class WorkStealingQueue {
    std::deque> queue;
    mutable std::mutex mutex;
    
public:
    void push(std::function task) {
        std::lock_guard lock(mutex);
        queue.push_back(std::move(task));
    }
    
    bool try_pop(std::function& task) {
        std::lock_guard lock(mutex);
        if (queue.empty()) return false;
        task = std::move(queue.front());
        queue.pop_front();
        return true;
    }
    
    bool try_steal(std::function& task) {
        std::lock_guard lock(mutex);
        if (queue.empty()) return false;
        task = std::move(queue.back());  // Steal from back
        queue.pop_back();
        return true;
    }
};

// Work-stealing scheduler
void work_stealing_example() {
    const int num_threads = std::thread::hardware_concurrency();
    std::vector queues(num_threads);
    std::vector threads;
    
    // Populate work queues
    for (int i = 0; i < 10000; i++) {
        queues[i % num_threads].push([i] { process_work_item(i); });
    }
    
    // Worker threads with stealing
    for (int thread_id = 0; thread_id < num_threads; thread_id++) {
        threads.emplace_back([&, thread_id] {
            std::function task;
            while (true) {
                // Try own queue first
                if (queues[thread_id].try_pop(task)) {
                    task();
                    continue;
                }
                
                // Try stealing from others
                bool found = false;
                for (int i = 0; i < num_threads; i++) {
                    if (i != thread_id && queues[i].try_steal(task)) {
                        task();
                        found = true;
                        break;
                    }
                }
                
                if (!found) break;  // No more work
            }
        });
    }
}

Lock-Free Data Structures

Avoid locks entirely with compare-and-swap primitives:

// Lock-free stack using CAS
template
class LockFreeStack {
    struct Node {
        T data;
        Node* next;
    };
    
    std::atomic head{nullptr};
    
public:
    void push(const T& data) {
        Node* new_node = new Node{data, nullptr};
        new_node->next = head.load();
        
        while (!head.compare_exchange_weak(new_node->next, new_node)) {
            // CAS failed, retry with updated head value
            // new_node->next already updated by compare_exchange_weak
        }
    }
    
    bool pop(T& result) {
        Node* old_head = head.load();
        while (old_head && !head.compare_exchange_weak(old_head, old_head->next)) {
            // CAS failed, retry with updated head value
        }
        
        if (old_head) {
            result = old_head->data;
            delete old_head;  // Memory management issue - see ABA problem
            return true;
        }
        return false;
    }
};

// ABA Problem solution: hazard pointers or epochs
template
class SafeLockFreeStack {
    // Implementation requires complex memory management
    // Usually better to use proven libraries like libcds
};

Wait-Free vs Lock-Free

Different progress guarantees for different use cases:

// Progress guarantees (strongest to weakest)
Wait-free:    Every thread makes progress in finite steps
Lock-free:    At least one thread makes progress  
Obstruction-free: Thread makes progress if runs alone
Blocking:     Threads can block indefinitely

// Wait-free counter (simple case)
class WaitFreeCounter {
    std::atomic count{0};
public:
    int increment() {
        return count.fetch_add(1, std::memory_order_relaxed);  // Always completes
    }
    
    int get() const {
        return count.load(std::memory_order_relaxed);  // Always completes
    }
};

// Lock-free queue (Michael & Scott algorithm)
template
class LockFreeQueue {
    struct Node {
        std::atomic data{nullptr};
        std::atomic next{nullptr};
    };
    
    std::atomic head;
    std::atomic tail;
    
public:
    LockFreeQueue() {
        Node* dummy = new Node;
        head.store(dummy);
        tail.store(dummy);
    }
    
    void enqueue(T item) {
        Node* new_node = new Node;
        T* data = new T(std::move(item));
        new_node->data.store(data);
        
        while (true) {
            Node* last = tail.load();
            Node* next = last->next.load();
            
            if (last == tail.load()) {  // Consistency check
                if (next == nullptr) {
                    if (last->next.compare_exchange_weak(next, new_node)) {
                        tail.compare_exchange_weak(last, new_node);  // Help advance tail
                        break;
                    }
                } else {
                    tail.compare_exchange_weak(last, next);  // Help advance tail
                }
            }
        }
    }
    
    // Progress guarantee: at least one thread always makes progress
    // But individual threads may retry indefinitely under high contention
};

Memory Ordering in Concurrent Code

Choose appropriate memory ordering for performance:

// Relaxed ordering for counters
std::atomic stats_counter{0};
void increment_stats() {
    stats_counter.fetch_add(1, std::memory_order_relaxed);  // Fastest
}

// Acquire-release for producer-consumer
std::atomic ready{false};
int payload = 0;

void producer() {
    payload = 42;                                    // Regular store
    ready.store(true, std::memory_order_release);    // Release barrier
}

void consumer() {
    while (!ready.load(std::memory_order_acquire));  // Acquire barrier
    assert(payload == 42);                           // Guaranteed visible
}

// Sequential consistency for complex synchronization
std::atomic x{0}, y{0};
int r1, r2;

void thread1() { x = 1; r1 = y; }  // Default: memory_order_seq_cst
void thread2() { y = 1; r2 = x; }

// Guarantee: !(r1 == 0 && r2 == 0) with sequential consistency

NUMA Awareness

Thread and memory placement affects performance on multi-socket systems:

// NUMA topology example
Node 0: CPU 0-7,   Memory Bank 0  (local access: 70ns)
Node 1: CPU 8-15,  Memory Bank 1  (remote access: 140ns)

// Bad: cross-NUMA memory access
void numa_unfriendly() {
    // Allocate on node 0
    int* data = (int*)numa_alloc_onnode(SIZE * sizeof(int), 0);
    
    std::thread worker([&]() {
        // Thread runs on node 1 - all memory accesses are remote
        for (int i = 0; i < SIZE; i++) {
            data[i] = process(data[i]);  // 2x memory latency
        }
    });
    
    // Pin thread to node 1 CPUs
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    for (int i = 8; i < 16; i++) CPU_SET(i, &cpuset);
    pthread_setaffinity_np(worker.native_handle(), sizeof(cpuset), &cpuset);
}

// Good: co-locate thread and memory
void numa_friendly() {
    std::thread worker([&]() {
        // Allocate on local NUMA node
        int* data = (int*)numa_alloc_local(SIZE * sizeof(int));
        
        for (int i = 0; i < SIZE; i++) {
            data[i] = process(data[i]);  // Local access - 70ns
        }
    });
    
    // Let OS scheduler place thread optimally
}

Thread Affinity and CPU Topology

Pin threads to specific cores for consistent performance:

// CPU topology awareness
void set_thread_affinity(std::thread& t, int core_id) {
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(core_id, &cpuset);
    
    int result = pthread_setaffinity_np(t.native_handle(), 
                                       sizeof(cpu_set_t), &cpuset);
    if (result != 0) {
        throw std::runtime_error("Failed to set thread affinity");
    }
}

// Avoid hyperthreading interference for CPU-bound tasks  
void pin_to_physical_cores() {
    const int num_physical_cores = std::thread::hardware_concurrency() / 2;
    std::vector threads;
    
    for (int i = 0; i < num_physical_cores; i++) {
        threads.emplace_back([i] {
            cpu_intensive_work();
        });
        
        // Pin to physical core (skip hyperthreads)
        set_thread_affinity(threads.back(), i * 2);
    }
}

Parallel summation (1M integers)
Single threaded: 2.1ms
Bad parallelization (100 threads): 15.3ms (context switching overhead)
Thread pool (8 threads): 0.3ms (7x speedup)
Lock-free atomic: 8.9ms (atomic contention)
Per-thread + reduce: 0.28ms (7.5x speedup)

When NOT to Use Threading

Threading isn't always the answer:

// Cases where threading hurts:
// 1. Task granularity too small
for (int i = 0; i < 1000; i++) {
    simple_operation(data[i]);  // 10 cycles each - threading overhead dominates
}

// 2. Memory-bound workloads
void memory_bound() {
    // Bottleneck is memory bandwidth, not CPU
    // More threads just increase cache pressure
    for (int i = 0; i < huge_array.size(); i++) {
        result[i] = huge_array[i] * 2;  // Limited by memory, not computation
    }
}

// 3. Heavy synchronization requirements
std::mutex critical_mutex;
void mostly_synchronized() {
    std::vector threads;
    for (int i = 0; i < 8; i++) {
        threads.emplace_back([&] {
            for (int j = 0; j < 1000; j++) {
                std::lock_guard lock(critical_mutex);
                // 99% of time spent in critical section
                lots_of_shared_work();
            }
        });
    }
    // All threads serialize - no parallelism benefit
}

Bottom Line

Threading overhead is real. Context switches cost thousands of cycles. False sharing destroys cache performance. Lock contention serializes parallel code. Measure first—many workloads don't benefit from threading. When parallelizing: use thread pools, avoid false sharing, minimize lock contention, consider NUMA placement. Lock-free helps with contention but adds complexity. Profile with thread-aware tools, not just wall-clock time.

← Back