733 Fundamental

733-profile-guided-patterns — Profile-Guided Patterns

Functional Programming

Tutorial

The Problem

Writing fast code without profiling data is guesswork. Profile-guided optimization (PGO) uses runtime measurements to inform the compiler about which branches are hot, which functions should be inlined, and how data should be laid out in memory. Even without a full PGO pipeline, developers can apply the same principles manually: annotate hot/cold branches, choose Struct-of-Arrays over Array-of-Structs for SIMD-friendly access, and use black_box to ensure benchmarks measure real work.

🎯 Learning Outcomes

• Apply #[cold] and #[inline(never)] to error handlers to keep them out of the hot path

• Understand the Struct-of-Arrays (SoA) pattern and why it improves cache utilization vs. Array-of-Structs (AoS)

• Use std::hint::black_box correctly to prevent the compiler from optimizing away benchmark subjects

• Recognize how branch likelihood affects branch predictor accuracy

• Know when to use likely/unlikely intrinsics (nightly) vs. structural code changes

Code Example

#![allow(clippy::all)]
// ── black_box usage ───────────────────────────────────────────────────────────

/// Without black_box, the compiler may constant-fold this entire call.
#[inline(never)] // ensures the function appears in profiler output
fn sum_squares(n: u64) -> u64 {
    (0..n).map(|i| i * i).sum()
}

// ── Hot / Cold path annotation ────────────────────────────────────────────────

/// Mark rare error-handling code as cold to keep it out of the hot path.
#[cold]
#[inline(never)]
fn handle_overflow(a: u64, b: u64) -> u64 {
    eprintln!("overflow: {} + {}", a, b);
    u64::MAX
}

fn checked_add_hot(a: u64, b: u64) -> u64 {
    // Compiler infers that the success branch is hot
    a.checked_add(b).unwrap_or_else(|| handle_overflow(a, b))
}

// ── Struct-of-Arrays (SoA) vs Array-of-Structs (AoS) ─────────────────────────

/// AoS: poor cache use when accessing only one field
#[allow(dead_code)]
struct PointAoS {
    x: f32,
    y: f32,
    z: f32,
}

/// SoA: excellent cache use — each array is contiguous
struct PointsSoA {
    x: Vec<f32>,
    y: Vec<f32>,
    z: Vec<f32>,
}

impl PointsSoA {
    fn new(n: usize) -> Self {
        PointsSoA {
            x: (0..n).map(|i| i as f32).collect(),
            y: (0..n).map(|i| i as f32 * 2.0).collect(),
            z: (0..n).map(|i| i as f32 * 3.0).collect(),
        }
    }

    /// Only touches the `x` array — minimal cache footprint.
    fn sum_x(&self) -> f32 {
        self.x.iter().sum()
    }
}

fn aos_sum_x(points: &[PointAoS]) -> f32 {
    // Loads all 3 floats even though we only want x → cache waste
    points.iter().map(|p| p.x).sum()
}

// ── Measurement discipline ────────────────────────────────────────────────────

fn measure_ns<F: FnOnce() -> R, R>(f: F) -> (R, u128) {
    let t0 = std::time::Instant::now();
    let result = f();
    let elapsed = t0.elapsed().as_nanos();
    (result, elapsed)
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn sum_squares_correct() {
        // 0² + 1² + 2² + 3² = 14
        assert_eq!(sum_squares(4), 14);
        assert_eq!(sum_squares(0), 0);
    }

    #[test]
    fn checked_add_no_overflow() {
        assert_eq!(checked_add_hot(3, 4), 7);
    }

    #[test]
    fn checked_add_overflow_returns_max() {
        assert_eq!(checked_add_hot(u64::MAX, 1), u64::MAX);
    }

    #[test]
    fn soa_sum_x_correct() {
        let soa = PointsSoA::new(5); // x = [0,1,2,3,4]
        assert_eq!(soa.sum_x(), 10.0);
    }

    #[test]
    fn aos_sum_x_correct() {
        let aos: Vec<PointAoS> = (0..5u32)
            .map(|i| PointAoS {
                x: i as f32,
                y: 0.0,
                z: 0.0,
            })
            .collect();
        assert_eq!(aos_sum_x(&aos), 10.0);
    }
}

(* 733: Profile-Guided Patterns — OCaml *)

(* OCaml's Sys.opaque_identity is the black_box equivalent *)
let black_box x = Sys.opaque_identity x

(* Array-of-Structs (AoS) — poor cache locality for component-wise ops *)
type point_aos = { x: float; y: float; z: float }

(* Struct-of-Arrays (SoA) — excellent cache locality for SIMD-like ops *)
type points_soa = {
  xs: float array;
  ys: float array;
  zs: float array;
  len: int;
}

let sum_x_aos (points: point_aos array) =
  Array.fold_left (fun acc p -> acc +. p.x) 0.0 points

let sum_x_soa (pts: points_soa) =
  (* Only xs array is touched — perfect for prefetcher *)
  let acc = ref 0.0 in
  for i = 0 to pts.len - 1 do
    acc := !acc +. pts.xs.(i)
  done;
  !acc

let () =
  let n = 1000 in
  let aos = Array.init n (fun i ->
    { x = float_of_int i; y = float_of_int (i*2); z = float_of_int (i*3) }) in
  let soa = {
    xs = Array.init n float_of_int;
    ys = Array.init n (fun i -> float_of_int (i*2));
    zs = Array.init n (fun i -> float_of_int (i*3));
    len = n;
  } in
  Printf.printf "AoS sum_x = %.0f\n" (black_box (sum_x_aos (black_box aos)));
  Printf.printf "SoA sum_x = %.0f\n" (black_box (sum_x_soa (black_box soa)))

Key Differences

Explicit branch hints: Rust's #[cold] is stable; OCaml's equivalents require Flambda2 or Jane Street extensions.

SoA ergonomics: Rust's struct fields make SoA feel natural with named field vectors; OCaml uses tuples or separate modules for the pattern.

Profiler visibility: Rust's #[inline(never)] guarantees a symbol in perf/Instruments; OCaml's compiler may inline across modules unpredictably.

SIMD layout: Rust's Vec<f32> fields in SoA are directly SIMD-accessible via std::simd or packed_simd; OCaml requires Bigarray for equivalent alignment guarantees.

OCaml Approach

OCaml does not expose hot/cold annotations in the standard language. Flambda2 (Jane Street's OCaml fork) provides [@likely] and [@unlikely] hints. The SoA pattern is expressible in OCaml using parallel arrays (Array.t * Array.t * Array.t) but is less ergonomic than the idiomatic record array. OCaml's Bigarray provides C-contiguous flat memory for SIMD-compatible layouts used in scientific computing (Owl, Lacaml).

Full Source

#![allow(clippy::all)]
// ── black_box usage ───────────────────────────────────────────────────────────

/// Without black_box, the compiler may constant-fold this entire call.
#[inline(never)] // ensures the function appears in profiler output
fn sum_squares(n: u64) -> u64 {
    (0..n).map(|i| i * i).sum()
}

// ── Hot / Cold path annotation ────────────────────────────────────────────────

/// Mark rare error-handling code as cold to keep it out of the hot path.
#[cold]
#[inline(never)]
fn handle_overflow(a: u64, b: u64) -> u64 {
    eprintln!("overflow: {} + {}", a, b);
    u64::MAX
}

fn checked_add_hot(a: u64, b: u64) -> u64 {
    // Compiler infers that the success branch is hot
    a.checked_add(b).unwrap_or_else(|| handle_overflow(a, b))
}

// ── Struct-of-Arrays (SoA) vs Array-of-Structs (AoS) ─────────────────────────

/// AoS: poor cache use when accessing only one field
#[allow(dead_code)]
struct PointAoS {
    x: f32,
    y: f32,
    z: f32,
}

/// SoA: excellent cache use — each array is contiguous
struct PointsSoA {
    x: Vec<f32>,
    y: Vec<f32>,
    z: Vec<f32>,
}

impl PointsSoA {
    fn new(n: usize) -> Self {
        PointsSoA {
            x: (0..n).map(|i| i as f32).collect(),
            y: (0..n).map(|i| i as f32 * 2.0).collect(),
            z: (0..n).map(|i| i as f32 * 3.0).collect(),
        }
    }

    /// Only touches the `x` array — minimal cache footprint.
    fn sum_x(&self) -> f32 {
        self.x.iter().sum()
    }
}

fn aos_sum_x(points: &[PointAoS]) -> f32 {
    // Loads all 3 floats even though we only want x → cache waste
    points.iter().map(|p| p.x).sum()
}

// ── Measurement discipline ────────────────────────────────────────────────────

fn measure_ns<F: FnOnce() -> R, R>(f: F) -> (R, u128) {
    let t0 = std::time::Instant::now();
    let result = f();
    let elapsed = t0.elapsed().as_nanos();
    (result, elapsed)
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn sum_squares_correct() {
        // 0² + 1² + 2² + 3² = 14
        assert_eq!(sum_squares(4), 14);
        assert_eq!(sum_squares(0), 0);
    }

    #[test]
    fn checked_add_no_overflow() {
        assert_eq!(checked_add_hot(3, 4), 7);
    }

    #[test]
    fn checked_add_overflow_returns_max() {
        assert_eq!(checked_add_hot(u64::MAX, 1), u64::MAX);
    }

    #[test]
    fn soa_sum_x_correct() {
        let soa = PointsSoA::new(5); // x = [0,1,2,3,4]
        assert_eq!(soa.sum_x(), 10.0);
    }

    #[test]
    fn aos_sum_x_correct() {
        let aos: Vec<PointAoS> = (0..5u32)
            .map(|i| PointAoS {
                x: i as f32,
                y: 0.0,
                z: 0.0,
            })
            .collect();
        assert_eq!(aos_sum_x(&aos), 10.0);
    }
}

(* 733: Profile-Guided Patterns — OCaml *)

(* OCaml's Sys.opaque_identity is the black_box equivalent *)
let black_box x = Sys.opaque_identity x

(* Array-of-Structs (AoS) — poor cache locality for component-wise ops *)
type point_aos = { x: float; y: float; z: float }

(* Struct-of-Arrays (SoA) — excellent cache locality for SIMD-like ops *)
type points_soa = {
  xs: float array;
  ys: float array;
  zs: float array;
  len: int;
}

let sum_x_aos (points: point_aos array) =
  Array.fold_left (fun acc p -> acc +. p.x) 0.0 points

let sum_x_soa (pts: points_soa) =
  (* Only xs array is touched — perfect for prefetcher *)
  let acc = ref 0.0 in
  for i = 0 to pts.len - 1 do
    acc := !acc +. pts.xs.(i)
  done;
  !acc

let () =
  let n = 1000 in
  let aos = Array.init n (fun i ->
    { x = float_of_int i; y = float_of_int (i*2); z = float_of_int (i*3) }) in
  let soa = {
    xs = Array.init n float_of_int;
    ys = Array.init n (fun i -> float_of_int (i*2));
    zs = Array.init n (fun i -> float_of_int (i*3));
    len = n;
  } in
  Printf.printf "AoS sum_x = %.0f\n" (black_box (sum_x_aos (black_box aos)));
  Printf.printf "SoA sum_x = %.0f\n" (black_box (sum_x_soa (black_box soa)))

✓ Tests Rust test suite

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn sum_squares_correct() {
        // 0² + 1² + 2² + 3² = 14
        assert_eq!(sum_squares(4), 14);
        assert_eq!(sum_squares(0), 0);
    }

    #[test]
    fn checked_add_no_overflow() {
        assert_eq!(checked_add_hot(3, 4), 7);
    }

    #[test]
    fn checked_add_overflow_returns_max() {
        assert_eq!(checked_add_hot(u64::MAX, 1), u64::MAX);
    }

    #[test]
    fn soa_sum_x_correct() {
        let soa = PointsSoA::new(5); // x = [0,1,2,3,4]
        assert_eq!(soa.sum_x(), 10.0);
    }

    #[test]
    fn aos_sum_x_correct() {
        let aos: Vec<PointAoS> = (0..5u32)
            .map(|i| PointAoS {
                x: i as f32,
                y: 0.0,
                z: 0.0,
            })
            .collect();
        assert_eq!(aos_sum_x(&aos), 10.0);
    }
}

Exercises

Refactor PointsSoA to store x: Box<[f32]> instead of Vec<f32> to eliminate the capacity word and make the layout more cache-friendly.

Benchmark sum_x on AoS vs. SoA for 1M points. Measure L1 cache misses using perf stat -e cache-misses.

Add a sum_xyz function to PointsSoA that computes x[i] + y[i] + z[i] for all i using iterator zip — compare its performance to the AoS equivalent.

Open Source Repos

functional-rust

View the source for this example on GitHub — OCaml and Rust side by side in the repo.

Rust