733-profile-guided-patterns — Profile-Guided Patterns
Tutorial
The Problem
Writing fast code without profiling data is guesswork. Profile-guided optimization (PGO) uses runtime measurements to inform the compiler about which branches are hot, which functions should be inlined, and how data should be laid out in memory. Even without a full PGO pipeline, developers can apply the same principles manually: annotate hot/cold branches, choose Struct-of-Arrays over Array-of-Structs for SIMD-friendly access, and use black_box to ensure benchmarks measure real work.
🎯 Learning Outcomes
#[cold] and #[inline(never)] to error handlers to keep them out of the hot pathstd::hint::black_box correctly to prevent the compiler from optimizing away benchmark subjectslikely/unlikely intrinsics (nightly) vs. structural code changesCode Example
#![allow(clippy::all)]
// ── black_box usage ───────────────────────────────────────────────────────────
/// Without black_box, the compiler may constant-fold this entire call.
#[inline(never)] // ensures the function appears in profiler output
fn sum_squares(n: u64) -> u64 {
(0..n).map(|i| i * i).sum()
}
// ── Hot / Cold path annotation ────────────────────────────────────────────────
/// Mark rare error-handling code as cold to keep it out of the hot path.
#[cold]
#[inline(never)]
fn handle_overflow(a: u64, b: u64) -> u64 {
eprintln!("overflow: {} + {}", a, b);
u64::MAX
}
fn checked_add_hot(a: u64, b: u64) -> u64 {
// Compiler infers that the success branch is hot
a.checked_add(b).unwrap_or_else(|| handle_overflow(a, b))
}
// ── Struct-of-Arrays (SoA) vs Array-of-Structs (AoS) ─────────────────────────
/// AoS: poor cache use when accessing only one field
#[allow(dead_code)]
struct PointAoS {
x: f32,
y: f32,
z: f32,
}
/// SoA: excellent cache use — each array is contiguous
struct PointsSoA {
x: Vec<f32>,
y: Vec<f32>,
z: Vec<f32>,
}
impl PointsSoA {
fn new(n: usize) -> Self {
PointsSoA {
x: (0..n).map(|i| i as f32).collect(),
y: (0..n).map(|i| i as f32 * 2.0).collect(),
z: (0..n).map(|i| i as f32 * 3.0).collect(),
}
}
/// Only touches the `x` array — minimal cache footprint.
fn sum_x(&self) -> f32 {
self.x.iter().sum()
}
}
fn aos_sum_x(points: &[PointAoS]) -> f32 {
// Loads all 3 floats even though we only want x → cache waste
points.iter().map(|p| p.x).sum()
}
// ── Measurement discipline ────────────────────────────────────────────────────
fn measure_ns<F: FnOnce() -> R, R>(f: F) -> (R, u128) {
let t0 = std::time::Instant::now();
let result = f();
let elapsed = t0.elapsed().as_nanos();
(result, elapsed)
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn sum_squares_correct() {
// 0² + 1² + 2² + 3² = 14
assert_eq!(sum_squares(4), 14);
assert_eq!(sum_squares(0), 0);
}
#[test]
fn checked_add_no_overflow() {
assert_eq!(checked_add_hot(3, 4), 7);
}
#[test]
fn checked_add_overflow_returns_max() {
assert_eq!(checked_add_hot(u64::MAX, 1), u64::MAX);
}
#[test]
fn soa_sum_x_correct() {
let soa = PointsSoA::new(5); // x = [0,1,2,3,4]
assert_eq!(soa.sum_x(), 10.0);
}
#[test]
fn aos_sum_x_correct() {
let aos: Vec<PointAoS> = (0..5u32)
.map(|i| PointAoS {
x: i as f32,
y: 0.0,
z: 0.0,
})
.collect();
assert_eq!(aos_sum_x(&aos), 10.0);
}
}Key Differences
#[cold] is stable; OCaml's equivalents require Flambda2 or Jane Street extensions.#[inline(never)] guarantees a symbol in perf/Instruments; OCaml's compiler may inline across modules unpredictably.Vec<f32> fields in SoA are directly SIMD-accessible via std::simd or packed_simd; OCaml requires Bigarray for equivalent alignment guarantees.OCaml Approach
OCaml does not expose hot/cold annotations in the standard language. Flambda2 (Jane Street's OCaml fork) provides [@likely] and [@unlikely] hints. The SoA pattern is expressible in OCaml using parallel arrays (Array.t * Array.t * Array.t) but is less ergonomic than the idiomatic record array. OCaml's Bigarray provides C-contiguous flat memory for SIMD-compatible layouts used in scientific computing (Owl, Lacaml).
Full Source
#![allow(clippy::all)]
// ── black_box usage ───────────────────────────────────────────────────────────
/// Without black_box, the compiler may constant-fold this entire call.
#[inline(never)] // ensures the function appears in profiler output
fn sum_squares(n: u64) -> u64 {
(0..n).map(|i| i * i).sum()
}
// ── Hot / Cold path annotation ────────────────────────────────────────────────
/// Mark rare error-handling code as cold to keep it out of the hot path.
#[cold]
#[inline(never)]
fn handle_overflow(a: u64, b: u64) -> u64 {
eprintln!("overflow: {} + {}", a, b);
u64::MAX
}
fn checked_add_hot(a: u64, b: u64) -> u64 {
// Compiler infers that the success branch is hot
a.checked_add(b).unwrap_or_else(|| handle_overflow(a, b))
}
// ── Struct-of-Arrays (SoA) vs Array-of-Structs (AoS) ─────────────────────────
/// AoS: poor cache use when accessing only one field
#[allow(dead_code)]
struct PointAoS {
x: f32,
y: f32,
z: f32,
}
/// SoA: excellent cache use — each array is contiguous
struct PointsSoA {
x: Vec<f32>,
y: Vec<f32>,
z: Vec<f32>,
}
impl PointsSoA {
fn new(n: usize) -> Self {
PointsSoA {
x: (0..n).map(|i| i as f32).collect(),
y: (0..n).map(|i| i as f32 * 2.0).collect(),
z: (0..n).map(|i| i as f32 * 3.0).collect(),
}
}
/// Only touches the `x` array — minimal cache footprint.
fn sum_x(&self) -> f32 {
self.x.iter().sum()
}
}
fn aos_sum_x(points: &[PointAoS]) -> f32 {
// Loads all 3 floats even though we only want x → cache waste
points.iter().map(|p| p.x).sum()
}
// ── Measurement discipline ────────────────────────────────────────────────────
fn measure_ns<F: FnOnce() -> R, R>(f: F) -> (R, u128) {
let t0 = std::time::Instant::now();
let result = f();
let elapsed = t0.elapsed().as_nanos();
(result, elapsed)
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn sum_squares_correct() {
// 0² + 1² + 2² + 3² = 14
assert_eq!(sum_squares(4), 14);
assert_eq!(sum_squares(0), 0);
}
#[test]
fn checked_add_no_overflow() {
assert_eq!(checked_add_hot(3, 4), 7);
}
#[test]
fn checked_add_overflow_returns_max() {
assert_eq!(checked_add_hot(u64::MAX, 1), u64::MAX);
}
#[test]
fn soa_sum_x_correct() {
let soa = PointsSoA::new(5); // x = [0,1,2,3,4]
assert_eq!(soa.sum_x(), 10.0);
}
#[test]
fn aos_sum_x_correct() {
let aos: Vec<PointAoS> = (0..5u32)
.map(|i| PointAoS {
x: i as f32,
y: 0.0,
z: 0.0,
})
.collect();
assert_eq!(aos_sum_x(&aos), 10.0);
}
}#[cfg(test)]
mod tests {
use super::*;
#[test]
fn sum_squares_correct() {
// 0² + 1² + 2² + 3² = 14
assert_eq!(sum_squares(4), 14);
assert_eq!(sum_squares(0), 0);
}
#[test]
fn checked_add_no_overflow() {
assert_eq!(checked_add_hot(3, 4), 7);
}
#[test]
fn checked_add_overflow_returns_max() {
assert_eq!(checked_add_hot(u64::MAX, 1), u64::MAX);
}
#[test]
fn soa_sum_x_correct() {
let soa = PointsSoA::new(5); // x = [0,1,2,3,4]
assert_eq!(soa.sum_x(), 10.0);
}
#[test]
fn aos_sum_x_correct() {
let aos: Vec<PointAoS> = (0..5u32)
.map(|i| PointAoS {
x: i as f32,
y: 0.0,
z: 0.0,
})
.collect();
assert_eq!(aos_sum_x(&aos), 10.0);
}
}
Exercises
PointsSoA to store x: Box<[f32]> instead of Vec<f32> to eliminate the capacity word and make the layout more cache-friendly.sum_x on AoS vs. SoA for 1M points. Measure L1 cache misses using perf stat -e cache-misses.sum_xyz function to PointsSoA that computes x[i] + y[i] + z[i] for all i using iterator zip — compare its performance to the AoS equivalent.