Cache-Friendly Iteration
Tutorial
The Problem
A CPU cache operates on 64-byte cache lines. When a program accesses memory sequentially (stride-1 access), the hardware prefetcher loads ahead and nearly every access hits cache. When a program accesses memory with large stridesβskipping N bytes between each elementβthe prefetcher cannot keep up and every access is a cache miss costing 100β300 cycles on modern hardware.
Matrix transposition is the canonical example: iterating row-major over a row-major
matrix is fast; iterating column-major is slow because each column element is separated
by cols * sizeof(element) bytes. For a 1024Γ1024 float matrix, column-major access
produces ~1M cache misses vs ~16K for row-majorβa 60Γ difference. Tiled (blocked)
transposition trades the miss pattern by fitting tiles into L1/L2 cache, enabling
sequential reads and writes within each tile.
🎯 Learning Outcomes
cargo bench / criterion to attribute performance differences to memory accessCode Example
pub struct Matrix {
data: Vec<f32>,
pub rows: usize,
pub cols: usize,
}
impl Matrix {
// Row-major sum: iterates the backing Vec sequentially β optimal
pub fn sum_row_major(&self) -> f32 {
self.data.iter().sum()
}
// Column-major: stride = cols * sizeof(f32) bytes β cache-unfriendly
pub fn sum_col_major(&self) -> f32 {
(0..self.cols)
.flat_map(|c| (0..self.rows).map(move |r| (r, c)))
.map(|(r, c)| self.get(r, c))
.sum()
}
}Key Differences
| Aspect | Rust | OCaml |
|---|---|---|
| Float array layout | Unboxed Vec<f64>, row-major | Flat float array or Bigarray |
| Auto-vectorization | LLVM vectorizes stride-1 loops | Limited; needs Bigarray + hints |
| Tiling strategy | Nested loops with step_by | Same approach, no stdlib support |
| Cache profiling | perf stat -e cache-misses | Same external tools |
| Abstraction cost | Zero-cost iterators | Closure calls add minor overhead |
Both languages expose identical cache-miss pathologies. Rust's LLVM backend is more likely to auto-vectorize the tiled inner loop.
OCaml Approach
OCaml arrays are flat and unboxed for float array, so the access pattern issues are
identical. The standard library offers Bigarray for C-layout (row-major) or
Fortran-layout (column-major) control:
open Bigarray
let make_matrix rows cols =
Array2.create float64 c_layout rows cols
(* Row-major: inner loop over columns β cache-friendly *)
let sum_row_major mat =
let rows = Array2.dim1 mat and cols = Array2.dim2 mat in
let acc = ref 0.0 in
for r = 0 to rows - 1 do
for c = 0 to cols - 1 do
acc := !acc +. mat.{r, c}
done
done;
!acc
OCaml's optimizer does not auto-vectorize as aggressively as LLVM/Clang; SIMD
benefits require manual Bigarray with Lacaml or Owl.
Full Source
#![allow(clippy::all)]
//! 720: Cache-Friendly Iteration and Data Access Patterns
//!
//! Demonstrates row-major vs column-major access, tiled transposition,
//! and how sequential iterator chaining produces cache-friendly, auto-vectorisable code.
//!
//! Key insight: a CPU cache line is 64 bytes = 16 f32s. Sequential access lets
//! the hardware prefetcher load the next cache line before you need it.
//! Column-major access on a row-major matrix jumps `cols * 4` bytes between
//! elements β each access is a potential cache miss.
// ββ Flat row-major matrix ββββββββββββββββββββββββββββββββββββββββββββββββββββ
/// A flat, row-major 2-D matrix backed by a single `Vec<f32>`.
///
/// Element `(r, c)` lives at `data[r * cols + c]`, so iterating over a
/// row is sequential β identical to iterating over a slice.
pub struct Matrix {
data: Vec<f32>,
pub rows: usize,
pub cols: usize,
}
impl Matrix {
/// Allocate a matrix filled with `init`.
pub fn new(rows: usize, cols: usize, init: f32) -> Self {
Self {
data: vec![init; rows * cols],
rows,
cols,
}
}
/// Allocate a matrix where element `(r, c)` is computed by `f(r, c)`.
pub fn from_fn(rows: usize, cols: usize, f: impl Fn(usize, usize) -> f32) -> Self {
// Sequential nested push: identical memory access order to the flat_map
// version, but avoids closure ownership issues when `f` is not `Copy`.
let mut data = Vec::with_capacity(rows * cols);
for r in 0..rows {
for c in 0..cols {
data.push(f(r, c));
}
}
Self { data, rows, cols }
}
#[inline]
pub fn get(&self, r: usize, c: usize) -> f32 {
self.data[r * self.cols + c]
}
#[inline]
pub fn set(&mut self, r: usize, c: usize, v: f32) {
self.data[r * self.cols + c] = v;
}
/// Sum all elements: iterates the backing `Vec` sequentially β optimal.
///
/// The compiler can auto-vectorise this into SIMD because the memory
/// layout is contiguous and the access pattern is fully predictable.
pub fn sum_row_major(&self) -> f32 {
self.data.iter().sum()
}
/// Sum all elements column-by-column (cache-unfriendly reference path).
///
/// Between two successive accesses the stride is `cols * sizeof(f32)` bytes.
/// On a 1024-column matrix that is 4 096 bytes β 64 cache lines skipped.
/// Every access is likely a cache miss once the matrix exceeds L1 capacity.
pub fn sum_col_major(&self) -> f32 {
(0..self.cols)
.flat_map(|c| (0..self.rows).map(move |r| (r, c)))
.map(|(r, c)| self.get(r, c))
.sum()
}
/// Row iterator: yields `&[f32]` slices β zero overhead, sequential.
pub fn rows_iter(&self) -> impl Iterator<Item = &[f32]> {
self.data.chunks(self.cols)
}
/// Transpose using tiled (blocked) access.
///
/// A naΓ―ve transpose writes column-by-column into the output (cache-unfriendly
/// on writes). Tiling processes `TILE Γ TILE` sub-blocks that fit in L1 cache,
/// so both reads and writes stay warm.
pub fn transpose_tiled(&self, tile: usize) -> Self {
let mut out = Self::new(self.cols, self.rows, 0.0);
for r_base in (0..self.rows).step_by(tile) {
for c_base in (0..self.cols).step_by(tile) {
let r_end = (r_base + tile).min(self.rows);
let c_end = (c_base + tile).min(self.cols);
for r in r_base..r_end {
for c in c_base..c_end {
out.set(c, r, self.get(r, c));
}
}
}
}
out
}
/// NaΓ―ve (non-tiled) transpose for comparison.
pub fn transpose_naive(&self) -> Self {
Self::from_fn(self.cols, self.rows, |r, c| self.get(c, r))
}
}
// ββ Structure of Arrays (SoA) vs Array of Structures (AoS) ββββββββββββββββββ
/// Array of Structures β common OOP layout, cache-unfriendly for field-only ops.
///
/// If you only need `x` values, you still load `y` and `z` into cache.
pub struct Vec3Aos {
pub x: f32,
pub y: f32,
pub z: f32,
}
/// Structure of Arrays β all `x` values are contiguous.
///
/// A loop over `xs` touches only the `x` cache lines; `ys` and `zs` are never
/// loaded. SIMD auto-vectorisation is straightforward because the data is already
/// in the right layout for `_mm256_add_ps` etc.
pub struct Vec3SoA {
pub xs: Vec<f32>,
pub ys: Vec<f32>,
pub zs: Vec<f32>,
}
impl Vec3SoA {
pub fn new(n: usize) -> Self {
Self {
xs: vec![0.0; n],
ys: vec![0.0; n],
zs: vec![0.0; n],
}
}
/// Sum only the x-components: touches one contiguous Vec β minimal cache pressure.
pub fn sum_x(&self) -> f32 {
self.xs.iter().sum()
}
/// Compute squared magnitudes for all vectors.
///
/// Three sequential passes, each over a single contiguous slice.
/// The compiler can vectorise each `.zip` independently.
pub fn magnitudes_sq(&self) -> Vec<f32> {
self.xs
.iter()
.zip(self.ys.iter())
.zip(self.zs.iter())
.map(|((x, y), z)| x * x + y * y + z * z)
.collect()
}
}
// ββ Prefetch-friendly sequential scan helpers ββββββββββββββββββββββββββββββββ
/// Running sum over a slice β the canonical sequential scan.
///
/// Produces a prefix-sum array. Sequential read + sequential write:
/// the prefetcher handles both streams perfectly.
pub fn prefix_sum(data: &[f32]) -> Vec<f32> {
data.iter()
.scan(0.0_f32, |acc, &x| {
*acc += x;
Some(*acc)
})
.collect()
}
/// Gather operation (random read): indices may jump anywhere in `src`.
///
/// Classic cache-unfriendly pattern β each `indices[i]` may point to
/// a different cache line in `src`.
pub fn gather(src: &[f32], indices: &[usize]) -> Vec<f32> {
indices.iter().map(|&i| src[i]).collect()
}
/// Scatter operation (random write): inverse cache problem on the write side.
pub fn scatter(dst: &mut [f32], indices: &[usize], values: &[f32]) {
indices
.iter()
.zip(values.iter())
.for_each(|(&i, &v)| dst[i] = v);
}
// βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#[cfg(test)]
mod tests {
use super::*;
// ββ Matrix correctness ββββββββββββββββββββββββββββββββββββββββββββββββββ
#[test]
fn row_and_col_sums_are_equal() {
// A 4Γ4 matrix filled with 1.0 β both access orders must give 16.0.
let m = Matrix::new(4, 4, 1.0);
assert_eq!(m.sum_row_major(), 16.0);
assert_eq!(m.sum_col_major(), 16.0);
}
#[test]
fn from_fn_and_get_round_trip() {
let m = Matrix::from_fn(3, 3, |r, c| (r * 3 + c) as f32);
assert_eq!(m.get(0, 0), 0.0);
assert_eq!(m.get(1, 1), 4.0);
assert_eq!(m.get(2, 2), 8.0);
}
#[test]
fn transpose_tiled_matches_naive() {
let m = Matrix::from_fn(8, 12, |r, c| (r * 12 + c) as f32);
let t_naive = m.transpose_naive();
let t_tiled = m.transpose_tiled(4);
assert_eq!(t_tiled.rows, 12);
assert_eq!(t_tiled.cols, 8);
for r in 0..t_naive.rows {
for c in 0..t_naive.cols {
assert_eq!(
t_naive.get(r, c),
t_tiled.get(r, c),
"mismatch at ({r},{c})"
);
}
}
}
#[test]
fn rows_iter_yields_correct_slices() {
let m = Matrix::from_fn(3, 4, |r, c| (r * 4 + c) as f32);
let rows: Vec<&[f32]> = m.rows_iter().collect();
assert_eq!(rows.len(), 3);
assert_eq!(rows[0], &[0.0, 1.0, 2.0, 3.0]);
assert_eq!(rows[1], &[4.0, 5.0, 6.0, 7.0]);
assert_eq!(rows[2], &[8.0, 9.0, 10.0, 11.0]);
}
// ββ SoA correctness βββββββββββββββββββββββββββββββββββββββββββββββββββββ
#[test]
fn soa_sum_x_is_correct() {
let mut soa = Vec3SoA::new(4);
soa.xs = vec![1.0, 2.0, 3.0, 4.0];
assert_eq!(soa.sum_x(), 10.0);
}
#[test]
fn soa_magnitudes_sq_unit_vectors() {
let mut soa = Vec3SoA::new(2);
soa.xs = vec![1.0, 0.0];
soa.ys = vec![0.0, 1.0];
soa.zs = vec![0.0, 0.0];
let mags = soa.magnitudes_sq();
assert_eq!(mags, vec![1.0, 1.0]);
}
// ββ Sequential helpers ββββββββββββββββββββββββββββββββββββββββββββββββββ
#[test]
fn prefix_sum_basic() {
let data = vec![1.0, 2.0, 3.0, 4.0];
let ps = prefix_sum(&data);
assert_eq!(ps, vec![1.0, 3.0, 6.0, 10.0]);
}
#[test]
fn prefix_sum_empty() {
assert_eq!(prefix_sum(&[]), Vec::<f32>::new());
}
#[test]
fn gather_and_scatter_round_trip() {
let src = vec![10.0_f32, 20.0, 30.0, 40.0];
let indices = vec![3, 1, 0, 2];
let gathered = gather(&src, &indices);
assert_eq!(gathered, vec![40.0, 20.0, 10.0, 30.0]);
let mut dst = vec![0.0_f32; 4];
scatter(&mut dst, &indices, &gathered);
// scatter[3]=40, scatter[1]=20, scatter[0]=10, scatter[2]=30
assert_eq!(dst, vec![10.0, 20.0, 30.0, 40.0]);
}
// ββ Large-matrix sum consistency (catches index formula bugs) βββββββββββ
#[test]
fn large_matrix_row_col_sum_agree() {
let m = Matrix::from_fn(32, 64, |r, c| (r + c) as f32);
let row = m.sum_row_major();
let col = m.sum_col_major();
assert!(
(row - col).abs() < 1e-3,
"row={row} col={col} differ by more than epsilon"
);
}
}#[cfg(test)]
mod tests {
use super::*;
// ββ Matrix correctness ββββββββββββββββββββββββββββββββββββββββββββββββββ
#[test]
fn row_and_col_sums_are_equal() {
// A 4Γ4 matrix filled with 1.0 β both access orders must give 16.0.
let m = Matrix::new(4, 4, 1.0);
assert_eq!(m.sum_row_major(), 16.0);
assert_eq!(m.sum_col_major(), 16.0);
}
#[test]
fn from_fn_and_get_round_trip() {
let m = Matrix::from_fn(3, 3, |r, c| (r * 3 + c) as f32);
assert_eq!(m.get(0, 0), 0.0);
assert_eq!(m.get(1, 1), 4.0);
assert_eq!(m.get(2, 2), 8.0);
}
#[test]
fn transpose_tiled_matches_naive() {
let m = Matrix::from_fn(8, 12, |r, c| (r * 12 + c) as f32);
let t_naive = m.transpose_naive();
let t_tiled = m.transpose_tiled(4);
assert_eq!(t_tiled.rows, 12);
assert_eq!(t_tiled.cols, 8);
for r in 0..t_naive.rows {
for c in 0..t_naive.cols {
assert_eq!(
t_naive.get(r, c),
t_tiled.get(r, c),
"mismatch at ({r},{c})"
);
}
}
}
#[test]
fn rows_iter_yields_correct_slices() {
let m = Matrix::from_fn(3, 4, |r, c| (r * 4 + c) as f32);
let rows: Vec<&[f32]> = m.rows_iter().collect();
assert_eq!(rows.len(), 3);
assert_eq!(rows[0], &[0.0, 1.0, 2.0, 3.0]);
assert_eq!(rows[1], &[4.0, 5.0, 6.0, 7.0]);
assert_eq!(rows[2], &[8.0, 9.0, 10.0, 11.0]);
}
// ββ SoA correctness βββββββββββββββββββββββββββββββββββββββββββββββββββββ
#[test]
fn soa_sum_x_is_correct() {
let mut soa = Vec3SoA::new(4);
soa.xs = vec![1.0, 2.0, 3.0, 4.0];
assert_eq!(soa.sum_x(), 10.0);
}
#[test]
fn soa_magnitudes_sq_unit_vectors() {
let mut soa = Vec3SoA::new(2);
soa.xs = vec![1.0, 0.0];
soa.ys = vec![0.0, 1.0];
soa.zs = vec![0.0, 0.0];
let mags = soa.magnitudes_sq();
assert_eq!(mags, vec![1.0, 1.0]);
}
// ββ Sequential helpers ββββββββββββββββββββββββββββββββββββββββββββββββββ
#[test]
fn prefix_sum_basic() {
let data = vec![1.0, 2.0, 3.0, 4.0];
let ps = prefix_sum(&data);
assert_eq!(ps, vec![1.0, 3.0, 6.0, 10.0]);
}
#[test]
fn prefix_sum_empty() {
assert_eq!(prefix_sum(&[]), Vec::<f32>::new());
}
#[test]
fn gather_and_scatter_round_trip() {
let src = vec![10.0_f32, 20.0, 30.0, 40.0];
let indices = vec![3, 1, 0, 2];
let gathered = gather(&src, &indices);
assert_eq!(gathered, vec![40.0, 20.0, 10.0, 30.0]);
let mut dst = vec![0.0_f32; 4];
scatter(&mut dst, &indices, &gathered);
// scatter[3]=40, scatter[1]=20, scatter[0]=10, scatter[2]=30
assert_eq!(dst, vec![10.0, 20.0, 30.0, 40.0]);
}
// ββ Large-matrix sum consistency (catches index formula bugs) βββββββββββ
#[test]
fn large_matrix_row_col_sum_agree() {
let m = Matrix::from_fn(32, 64, |r, c| (r + c) as f32);
let row = m.sum_row_major();
let col = m.sum_col_major();
assert!(
(row - col).abs() < 1e-3,
"row={row} col={col} differ by more than epsilon"
);
}
}
Deep Comparison
OCaml vs Rust: Cache-Friendly Iteration and Data Access Patterns
Side-by-Side Code
OCaml β flat matrix, row-major sum
type matrix = { data: float array; rows: int; cols: int }
let make_matrix rows cols init =
{ data = Array.init (rows * cols) init; rows; cols }
let get m r c = m.data.(r * m.cols + c)
(* Row-major: sequential β cache-friendly *)
let sum_row_major m = Array.fold_left (+.) 0.0 m.data
(* Column-major: stride = cols β cache-unfriendly *)
let sum_col_major m =
let acc = ref 0.0 in
for c = 0 to m.cols - 1 do
for r = 0 to m.rows - 1 do
acc := !acc +. get m r c
done
done;
!acc
Rust β flat matrix, row-major sum (idiomatic)
pub struct Matrix {
data: Vec<f32>,
pub rows: usize,
pub cols: usize,
}
impl Matrix {
// Row-major sum: iterates the backing Vec sequentially β optimal
pub fn sum_row_major(&self) -> f32 {
self.data.iter().sum()
}
// Column-major: stride = cols * sizeof(f32) bytes β cache-unfriendly
pub fn sum_col_major(&self) -> f32 {
(0..self.cols)
.flat_map(|c| (0..self.rows).map(move |r| (r, c)))
.map(|(r, c)| self.get(r, c))
.sum()
}
}
Rust β tiled transpose (cache-friendly write path)
pub fn transpose_tiled(&self, tile: usize) -> Self {
let mut out = Self::new(self.cols, self.rows, 0.0);
for r_base in (0..self.rows).step_by(tile) {
for c_base in (0..self.cols).step_by(tile) {
let r_end = (r_base + tile).min(self.rows);
let c_end = (c_base + tile).min(self.cols);
for r in r_base..r_end {
for c in c_base..c_end {
out.set(c, r, self.get(r, c));
}
}
}
}
out
}
Rust β Structure of Arrays (SoA)
pub struct Vec3SoA {
pub xs: Vec<f32>,
pub ys: Vec<f32>,
pub zs: Vec<f32>,
}
impl Vec3SoA {
// Only xs are loaded into cache β ys and zs never touched
pub fn sum_x(&self) -> f32 { self.xs.iter().sum() }
// Three sequential passes, each over one contiguous slice
pub fn magnitudes_sq(&self) -> Vec<f32> {
self.xs.iter().zip(self.ys.iter()).zip(self.zs.iter())
.map(|((x, y), z)| x * x + y * y + z * z)
.collect()
}
}
Type Signatures
| Concept | OCaml | Rust |
|---|---|---|
| Flat matrix | { data: float array; rows: int; cols: int } | struct Matrix { data: Vec<f32>, rows: usize, cols: usize } |
| Element access | m.data.(r * m.cols + c) | self.data[r * self.cols + c] |
| Row-major sum | Array.fold_left (+.) 0.0 m.data | self.data.iter().sum() |
| Prefix scan | Array.fold_left with accumulator | Iterator::scan |
| Stride type | int (unboxed int) | usize (pointer-sized) |
Key Insights
float array and Rust Vec<f32> both store elements contiguously in a single heap allocation. The row-major index formula r * cols + c is the same in both languages.f32s regardless of language. Row-major iteration is fast in both OCaml and Rust because it is sequential β the hardware prefetcher loads the next cache line speculatively. Column-major access jumps cols * 4 bytes between elements, causing a cache miss on each access once the matrix exceeds L1 capacity (~32 KB).float array array vs flat Vec**: OCaml's nested array float array array is an array of pointers to individual row allocations β two pointer dereferences per access and potentially non-contiguous rows. Rust's Vec<Vec<f32>> has the same problem. The flat Vec<f32> with manual index arithmetic eliminates the extra indirection entirely.TILE Γ TILE sub-blocks ensures both the read tile and the write tile fit in L1 cache simultaneously, maintaining ~L1 bandwidth for both sides. The optimal tile size is sqrt(L1_size / sizeof(element)).Vec into cache. AoS forces the CPU to load all fields even if only one is used, wasting cache bandwidth by a factor of struct_size / field_size.Iterator::scan is the idiomatic prefix-sum**: OCaml uses Array.fold_left with a ref accumulator; Rust's scan threads the state through the iterator chain without mutation at the call site. Both produce sequential memory access β the access pattern is what matters for cache performance, not the syntactic form.f32s per cycle) only when the data is contiguous and the access stride is 1. data.iter().sum() meets this condition; sum_col_major does not, so it cannot be vectorised.When to Use Each Style
Use row-major iteration when processing all elements or all elements of a row β this is the fast path and should be the default for any flat matrix.
Use tiled transposition when you must transpose a large matrix and performance matters; the tile size should be tuned to the target machine's L1 cache size (typically 4β16 elements per dimension for f32).
Use SoA layout when hot loops operate on a subset of fields from a collection of records (e.g., physics simulation updating only positions, or a renderer reading only normals).
Use gather/scatter sparingly β random-access patterns defeat the hardware prefetcher. Where possible, sort work by memory address or use sequential layouts instead.
Exercises
with criterion. Plot cache-miss counts using perf stat.
L1 cache (check getconf LEVEL1_DCACHE_SIZE).
TiledIter<T> that yields (row_start, col_start, tile_data)for any 2D slice, abstracting the tiling logic.
row-range loop to keep the x vector in L2 cache.
sum on a matrix stored as Vec<Vec<f64>> vs flat Vec<f64> to quantify the cost of pointer indirection in addition to stride.