720 Fundamental

Cache-Friendly Iteration

Functional Programming

Tutorial

The Problem

A CPU cache operates on 64-byte cache lines. When a program accesses memory sequentially (stride-1 access), the hardware prefetcher loads ahead and nearly every access hits cache. When a program accesses memory with large strides—skipping N bytes between each element—the prefetcher cannot keep up and every access is a cache miss costing 100–300 cycles on modern hardware.

Matrix transposition is the canonical example: iterating row-major over a row-major matrix is fast; iterating column-major is slow because each column element is separated by cols * sizeof(element) bytes. For a 1024×1024 float matrix, column-major access produces ~1M cache misses vs ~16K for row-major—a 60× difference. Tiled (blocked) transposition trades the miss pattern by fitting tiles into L1/L2 cache, enabling sequential reads and writes within each tile.

🎯 Learning Outcomes

• Explain the relationship between access stride, cache line size, and miss rate

• Implement row-major vs column-major matrix iteration and measure the gap

• Apply loop tiling/blocking to fit working sets into cache

• Use cargo bench / criterion to attribute performance differences to memory access

• Reason about the interaction between auto-vectorization and access patterns

Code Example

pub struct Matrix {
    data: Vec<f32>,
    pub rows: usize,
    pub cols: usize,
}

impl Matrix {
    // Row-major sum: iterates the backing Vec sequentially — optimal
    pub fn sum_row_major(&self) -> f32 {
        self.data.iter().sum()
    }

    // Column-major: stride = cols * sizeof(f32) bytes — cache-unfriendly
    pub fn sum_col_major(&self) -> f32 {
        (0..self.cols)
            .flat_map(|c| (0..self.rows).map(move |r| (r, c)))
            .map(|(r, c)| self.get(r, c))
            .sum()
    }
}

type matrix = { data: float array; rows: int; cols: int }

let make_matrix rows cols init =
  { data = Array.init (rows * cols) init; rows; cols }

let get m r c = m.data.(r * m.cols + c)

(* Row-major: sequential — cache-friendly *)
let sum_row_major m = Array.fold_left (+.) 0.0 m.data

(* Column-major: stride = cols — cache-unfriendly *)
let sum_col_major m =
  let acc = ref 0.0 in
  for c = 0 to m.cols - 1 do
    for r = 0 to m.rows - 1 do
      acc := !acc +. get m r c
    done
  done;
  !acc

Key Differences

Aspect	Rust	OCaml
Float array layout	Unboxed `Vec<f64>`, row-major	Flat `float array` or `Bigarray`
Auto-vectorization	LLVM vectorizes stride-1 loops	Limited; needs `Bigarray` + hints
Tiling strategy	Nested loops with `step_by`	Same approach, no stdlib support
Cache profiling	`perf stat -e cache-misses`	Same external tools
Abstraction cost	Zero-cost iterators	Closure calls add minor overhead

Both languages expose identical cache-miss pathologies. Rust's LLVM backend is more likely to auto-vectorize the tiled inner loop.

OCaml Approach

OCaml arrays are flat and unboxed for float array, so the access pattern issues are identical. The standard library offers Bigarray for C-layout (row-major) or Fortran-layout (column-major) control:

open Bigarray

let make_matrix rows cols =
  Array2.create float64 c_layout rows cols

(* Row-major: inner loop over columns — cache-friendly *)
let sum_row_major mat =
  let rows = Array2.dim1 mat and cols = Array2.dim2 mat in
  let acc = ref 0.0 in
  for r = 0 to rows - 1 do
    for c = 0 to cols - 1 do
      acc := !acc +. mat.{r, c}
    done
  done;
  !acc

OCaml's optimizer does not auto-vectorize as aggressively as LLVM/Clang; SIMD benefits require manual Bigarray with Lacaml or Owl.

Full Source

#![allow(clippy::all)]
//! 720: Cache-Friendly Iteration and Data Access Patterns
//!
//! Demonstrates row-major vs column-major access, tiled transposition,
//! and how sequential iterator chaining produces cache-friendly, auto-vectorisable code.
//!
//! Key insight: a CPU cache line is 64 bytes = 16 f32s. Sequential access lets
//! the hardware prefetcher load the next cache line before you need it.
//! Column-major access on a row-major matrix jumps `cols * 4` bytes between
//! elements — each access is a potential cache miss.

// ── Flat row-major matrix ────────────────────────────────────────────────────

/// A flat, row-major 2-D matrix backed by a single `Vec<f32>`.
///
/// Element `(r, c)` lives at `data[r * cols + c]`, so iterating over a
/// row is sequential — identical to iterating over a slice.
pub struct Matrix {
    data: Vec<f32>,
    pub rows: usize,
    pub cols: usize,
}

impl Matrix {
    /// Allocate a matrix filled with `init`.
    pub fn new(rows: usize, cols: usize, init: f32) -> Self {
        Self {
            data: vec![init; rows * cols],
            rows,
            cols,
        }
    }

    /// Allocate a matrix where element `(r, c)` is computed by `f(r, c)`.
    pub fn from_fn(rows: usize, cols: usize, f: impl Fn(usize, usize) -> f32) -> Self {
        // Sequential nested push: identical memory access order to the flat_map
        // version, but avoids closure ownership issues when `f` is not `Copy`.
        let mut data = Vec::with_capacity(rows * cols);
        for r in 0..rows {
            for c in 0..cols {
                data.push(f(r, c));
            }
        }
        Self { data, rows, cols }
    }

    #[inline]
    pub fn get(&self, r: usize, c: usize) -> f32 {
        self.data[r * self.cols + c]
    }

    #[inline]
    pub fn set(&mut self, r: usize, c: usize, v: f32) {
        self.data[r * self.cols + c] = v;
    }

    /// Sum all elements: iterates the backing `Vec` sequentially — optimal.
    ///
    /// The compiler can auto-vectorise this into SIMD because the memory
    /// layout is contiguous and the access pattern is fully predictable.
    pub fn sum_row_major(&self) -> f32 {
        self.data.iter().sum()
    }

    /// Sum all elements column-by-column (cache-unfriendly reference path).
    ///
    /// Between two successive accesses the stride is `cols * sizeof(f32)` bytes.
    /// On a 1024-column matrix that is 4 096 bytes — 64 cache lines skipped.
    /// Every access is likely a cache miss once the matrix exceeds L1 capacity.
    pub fn sum_col_major(&self) -> f32 {
        (0..self.cols)
            .flat_map(|c| (0..self.rows).map(move |r| (r, c)))
            .map(|(r, c)| self.get(r, c))
            .sum()
    }

    /// Row iterator: yields `&[f32]` slices — zero overhead, sequential.
    pub fn rows_iter(&self) -> impl Iterator<Item = &[f32]> {
        self.data.chunks(self.cols)
    }

    /// Transpose using tiled (blocked) access.
    ///
    /// A naïve transpose writes column-by-column into the output (cache-unfriendly
    /// on writes). Tiling processes `TILE × TILE` sub-blocks that fit in L1 cache,
    /// so both reads and writes stay warm.
    pub fn transpose_tiled(&self, tile: usize) -> Self {
        let mut out = Self::new(self.cols, self.rows, 0.0);
        for r_base in (0..self.rows).step_by(tile) {
            for c_base in (0..self.cols).step_by(tile) {
                let r_end = (r_base + tile).min(self.rows);
                let c_end = (c_base + tile).min(self.cols);
                for r in r_base..r_end {
                    for c in c_base..c_end {
                        out.set(c, r, self.get(r, c));
                    }
                }
            }
        }
        out
    }

    /// Naïve (non-tiled) transpose for comparison.
    pub fn transpose_naive(&self) -> Self {
        Self::from_fn(self.cols, self.rows, |r, c| self.get(c, r))
    }
}

// ── Structure of Arrays (SoA) vs Array of Structures (AoS) ──────────────────

/// Array of Structures — common OOP layout, cache-unfriendly for field-only ops.
///
/// If you only need `x` values, you still load `y` and `z` into cache.
pub struct Vec3Aos {
    pub x: f32,
    pub y: f32,
    pub z: f32,
}

/// Structure of Arrays — all `x` values are contiguous.
///
/// A loop over `xs` touches only the `x` cache lines; `ys` and `zs` are never
/// loaded. SIMD auto-vectorisation is straightforward because the data is already
/// in the right layout for `_mm256_add_ps` etc.
pub struct Vec3SoA {
    pub xs: Vec<f32>,
    pub ys: Vec<f32>,
    pub zs: Vec<f32>,
}

impl Vec3SoA {
    pub fn new(n: usize) -> Self {
        Self {
            xs: vec![0.0; n],
            ys: vec![0.0; n],
            zs: vec![0.0; n],
        }
    }

    /// Sum only the x-components: touches one contiguous Vec — minimal cache pressure.
    pub fn sum_x(&self) -> f32 {
        self.xs.iter().sum()
    }

    /// Compute squared magnitudes for all vectors.
    ///
    /// Three sequential passes, each over a single contiguous slice.
    /// The compiler can vectorise each `.zip` independently.
    pub fn magnitudes_sq(&self) -> Vec<f32> {
        self.xs
            .iter()
            .zip(self.ys.iter())
            .zip(self.zs.iter())
            .map(|((x, y), z)| x * x + y * y + z * z)
            .collect()
    }
}

// ── Prefetch-friendly sequential scan helpers ────────────────────────────────

/// Running sum over a slice — the canonical sequential scan.
///
/// Produces a prefix-sum array. Sequential read + sequential write:
/// the prefetcher handles both streams perfectly.
pub fn prefix_sum(data: &[f32]) -> Vec<f32> {
    data.iter()
        .scan(0.0_f32, |acc, &x| {
            *acc += x;
            Some(*acc)
        })
        .collect()
}

/// Gather operation (random read): indices may jump anywhere in `src`.
///
/// Classic cache-unfriendly pattern — each `indices[i]` may point to
/// a different cache line in `src`.
pub fn gather(src: &[f32], indices: &[usize]) -> Vec<f32> {
    indices.iter().map(|&i| src[i]).collect()
}

/// Scatter operation (random write): inverse cache problem on the write side.
pub fn scatter(dst: &mut [f32], indices: &[usize], values: &[f32]) {
    indices
        .iter()
        .zip(values.iter())
        .for_each(|(&i, &v)| dst[i] = v);
}

// ─────────────────────────────────────────────────────────────────────────────
#[cfg(test)]
mod tests {
    use super::*;

    // ── Matrix correctness ──────────────────────────────────────────────────

    #[test]
    fn row_and_col_sums_are_equal() {
        // A 4×4 matrix filled with 1.0 — both access orders must give 16.0.
        let m = Matrix::new(4, 4, 1.0);
        assert_eq!(m.sum_row_major(), 16.0);
        assert_eq!(m.sum_col_major(), 16.0);
    }

    #[test]
    fn from_fn_and_get_round_trip() {
        let m = Matrix::from_fn(3, 3, |r, c| (r * 3 + c) as f32);
        assert_eq!(m.get(0, 0), 0.0);
        assert_eq!(m.get(1, 1), 4.0);
        assert_eq!(m.get(2, 2), 8.0);
    }

    #[test]
    fn transpose_tiled_matches_naive() {
        let m = Matrix::from_fn(8, 12, |r, c| (r * 12 + c) as f32);
        let t_naive = m.transpose_naive();
        let t_tiled = m.transpose_tiled(4);

        assert_eq!(t_tiled.rows, 12);
        assert_eq!(t_tiled.cols, 8);

        for r in 0..t_naive.rows {
            for c in 0..t_naive.cols {
                assert_eq!(
                    t_naive.get(r, c),
                    t_tiled.get(r, c),
                    "mismatch at ({r},{c})"
                );
            }
        }
    }

    #[test]
    fn rows_iter_yields_correct_slices() {
        let m = Matrix::from_fn(3, 4, |r, c| (r * 4 + c) as f32);
        let rows: Vec<&[f32]> = m.rows_iter().collect();
        assert_eq!(rows.len(), 3);
        assert_eq!(rows[0], &[0.0, 1.0, 2.0, 3.0]);
        assert_eq!(rows[1], &[4.0, 5.0, 6.0, 7.0]);
        assert_eq!(rows[2], &[8.0, 9.0, 10.0, 11.0]);
    }

    // ── SoA correctness ─────────────────────────────────────────────────────

    #[test]
    fn soa_sum_x_is_correct() {
        let mut soa = Vec3SoA::new(4);
        soa.xs = vec![1.0, 2.0, 3.0, 4.0];
        assert_eq!(soa.sum_x(), 10.0);
    }

    #[test]
    fn soa_magnitudes_sq_unit_vectors() {
        let mut soa = Vec3SoA::new(2);
        soa.xs = vec![1.0, 0.0];
        soa.ys = vec![0.0, 1.0];
        soa.zs = vec![0.0, 0.0];
        let mags = soa.magnitudes_sq();
        assert_eq!(mags, vec![1.0, 1.0]);
    }

    // ── Sequential helpers ──────────────────────────────────────────────────

    #[test]
    fn prefix_sum_basic() {
        let data = vec![1.0, 2.0, 3.0, 4.0];
        let ps = prefix_sum(&data);
        assert_eq!(ps, vec![1.0, 3.0, 6.0, 10.0]);
    }

    #[test]
    fn prefix_sum_empty() {
        assert_eq!(prefix_sum(&[]), Vec::<f32>::new());
    }

    #[test]
    fn gather_and_scatter_round_trip() {
        let src = vec![10.0_f32, 20.0, 30.0, 40.0];
        let indices = vec![3, 1, 0, 2];
        let gathered = gather(&src, &indices);
        assert_eq!(gathered, vec![40.0, 20.0, 10.0, 30.0]);

        let mut dst = vec![0.0_f32; 4];
        scatter(&mut dst, &indices, &gathered);
        // scatter[3]=40, scatter[1]=20, scatter[0]=10, scatter[2]=30
        assert_eq!(dst, vec![10.0, 20.0, 30.0, 40.0]);
    }

    // ── Large-matrix sum consistency (catches index formula bugs) ───────────

    #[test]
    fn large_matrix_row_col_sum_agree() {
        let m = Matrix::from_fn(32, 64, |r, c| (r + c) as f32);
        let row = m.sum_row_major();
        let col = m.sum_col_major();
        assert!(
            (row - col).abs() < 1e-3,
            "row={row} col={col} differ by more than epsilon"
        );
    }
}

(* OCaml: Cache-friendly iteration patterns *)

(* OCaml's `float array` is unboxed (flat), but `float array array`
   is an array of pointers to individual row arrays — two hops per access. *)

(* --- Flat 2D matrix (row-major, cache-friendly) --- *)

type matrix = { data: float array; rows: int; cols: int }

let make_matrix rows cols init =
  { data = Array.init (rows * cols) init; rows; cols }

let get m r c = m.data.(r * m.cols + c)
let set m r c v = m.data.(r * m.cols + c) <- v

(* Row-major sum: sequential access — cache friendly *)
let sum_row_major m =
  Array.fold_left (+.) 0.0 m.data

(* Column-major sum: stride = cols elements between accesses — cache unfriendly *)
let sum_col_major m =
  let acc = ref 0.0 in
  for c = 0 to m.cols - 1 do
    for r = 0 to m.rows - 1 do
      acc := !acc +. get m r c
    done
  done;
  !acc

(* Transpose into a new matrix — writes in row-major order into dst *)
let transpose_naive m =
  let t = make_matrix m.cols m.rows (fun _ -> 0.0) in
  for r = 0 to m.rows - 1 do
    for c = 0 to m.cols - 1 do
      set t c r (get m r c)
    done
  done;
  t

(* Tiled transpose — better cache behaviour for large matrices *)
let transpose_tiled ?(tile=32) m =
  let t = make_matrix m.cols m.rows (fun _ -> 0.0) in
  let r = ref 0 in
  while !r < m.rows do
    let c = ref 0 in
    while !c < m.cols do
      for rr = !r to min (!r + tile - 1) (m.rows - 1) do
        for cc = !c to min (!c + tile - 1) (m.cols - 1) do
          set t cc rr (get m rr cc)
        done
      done;
      c := !c + tile
    done;
    r := !r + tile
  done;
  t

let time_it label f =
  let t0 = Sys.time () in
  let r = f () in
  Printf.printf "%s: %.6fs\n" label (Sys.time () -. t0);
  r

let () =
  let n = 1024 in
  let m = make_matrix n n (fun i -> float_of_int i) in

  let _sr = time_it "Row-major sum" (fun () -> sum_row_major m) in
  let _sc = time_it "Col-major sum" (fun () -> sum_col_major m) in
  let _ = time_it "Naive transpose" (fun () -> transpose_naive m) in
  let _ = time_it "Tiled transpose (tile=32)" (fun () -> transpose_tiled m) in
  Printf.printf "Matrix %dx%d: %d elements\n" n n (n * n)

✓ Tests Rust test suite

#[cfg(test)]
mod tests {
    use super::*;

    // ── Matrix correctness ──────────────────────────────────────────────────

    #[test]
    fn row_and_col_sums_are_equal() {
        // A 4×4 matrix filled with 1.0 — both access orders must give 16.0.
        let m = Matrix::new(4, 4, 1.0);
        assert_eq!(m.sum_row_major(), 16.0);
        assert_eq!(m.sum_col_major(), 16.0);
    }

    #[test]
    fn from_fn_and_get_round_trip() {
        let m = Matrix::from_fn(3, 3, |r, c| (r * 3 + c) as f32);
        assert_eq!(m.get(0, 0), 0.0);
        assert_eq!(m.get(1, 1), 4.0);
        assert_eq!(m.get(2, 2), 8.0);
    }

    #[test]
    fn transpose_tiled_matches_naive() {
        let m = Matrix::from_fn(8, 12, |r, c| (r * 12 + c) as f32);
        let t_naive = m.transpose_naive();
        let t_tiled = m.transpose_tiled(4);

        assert_eq!(t_tiled.rows, 12);
        assert_eq!(t_tiled.cols, 8);

        for r in 0..t_naive.rows {
            for c in 0..t_naive.cols {
                assert_eq!(
                    t_naive.get(r, c),
                    t_tiled.get(r, c),
                    "mismatch at ({r},{c})"
                );
            }
        }
    }

    #[test]
    fn rows_iter_yields_correct_slices() {
        let m = Matrix::from_fn(3, 4, |r, c| (r * 4 + c) as f32);
        let rows: Vec<&[f32]> = m.rows_iter().collect();
        assert_eq!(rows.len(), 3);
        assert_eq!(rows[0], &[0.0, 1.0, 2.0, 3.0]);
        assert_eq!(rows[1], &[4.0, 5.0, 6.0, 7.0]);
        assert_eq!(rows[2], &[8.0, 9.0, 10.0, 11.0]);
    }

    // ── SoA correctness ─────────────────────────────────────────────────────

    #[test]
    fn soa_sum_x_is_correct() {
        let mut soa = Vec3SoA::new(4);
        soa.xs = vec![1.0, 2.0, 3.0, 4.0];
        assert_eq!(soa.sum_x(), 10.0);
    }

    #[test]
    fn soa_magnitudes_sq_unit_vectors() {
        let mut soa = Vec3SoA::new(2);
        soa.xs = vec![1.0, 0.0];
        soa.ys = vec![0.0, 1.0];
        soa.zs = vec![0.0, 0.0];
        let mags = soa.magnitudes_sq();
        assert_eq!(mags, vec![1.0, 1.0]);
    }

    // ── Sequential helpers ──────────────────────────────────────────────────

    #[test]
    fn prefix_sum_basic() {
        let data = vec![1.0, 2.0, 3.0, 4.0];
        let ps = prefix_sum(&data);
        assert_eq!(ps, vec![1.0, 3.0, 6.0, 10.0]);
    }

    #[test]
    fn prefix_sum_empty() {
        assert_eq!(prefix_sum(&[]), Vec::<f32>::new());
    }

    #[test]
    fn gather_and_scatter_round_trip() {
        let src = vec![10.0_f32, 20.0, 30.0, 40.0];
        let indices = vec![3, 1, 0, 2];
        let gathered = gather(&src, &indices);
        assert_eq!(gathered, vec![40.0, 20.0, 10.0, 30.0]);

        let mut dst = vec![0.0_f32; 4];
        scatter(&mut dst, &indices, &gathered);
        // scatter[3]=40, scatter[1]=20, scatter[0]=10, scatter[2]=30
        assert_eq!(dst, vec![10.0, 20.0, 30.0, 40.0]);
    }

    // ── Large-matrix sum consistency (catches index formula bugs) ───────────

    #[test]
    fn large_matrix_row_col_sum_agree() {
        let m = Matrix::from_fn(32, 64, |r, c| (r + c) as f32);
        let row = m.sum_row_major();
        let col = m.sum_col_major();
        assert!(
            (row - col).abs() < 1e-3,
            "row={row} col={col} differ by more than epsilon"
        );
    }
}

Deep Comparison

OCaml vs Rust: Cache-Friendly Iteration and Data Access Patterns

Side-by-Side Code

OCaml — flat matrix, row-major sum

type matrix = { data: float array; rows: int; cols: int }

let make_matrix rows cols init =
  { data = Array.init (rows * cols) init; rows; cols }

let get m r c = m.data.(r * m.cols + c)

(* Row-major: sequential — cache-friendly *)
let sum_row_major m = Array.fold_left (+.) 0.0 m.data

(* Column-major: stride = cols — cache-unfriendly *)
let sum_col_major m =
  let acc = ref 0.0 in
  for c = 0 to m.cols - 1 do
    for r = 0 to m.rows - 1 do
      acc := !acc +. get m r c
    done
  done;
  !acc

Rust — flat matrix, row-major sum (idiomatic)

pub struct Matrix {
    data: Vec<f32>,
    pub rows: usize,
    pub cols: usize,
}

impl Matrix {
    // Row-major sum: iterates the backing Vec sequentially — optimal
    pub fn sum_row_major(&self) -> f32 {
        self.data.iter().sum()
    }

    // Column-major: stride = cols * sizeof(f32) bytes — cache-unfriendly
    pub fn sum_col_major(&self) -> f32 {
        (0..self.cols)
            .flat_map(|c| (0..self.rows).map(move |r| (r, c)))
            .map(|(r, c)| self.get(r, c))
            .sum()
    }
}

Rust — tiled transpose (cache-friendly write path)

pub fn transpose_tiled(&self, tile: usize) -> Self {
    let mut out = Self::new(self.cols, self.rows, 0.0);
    for r_base in (0..self.rows).step_by(tile) {
        for c_base in (0..self.cols).step_by(tile) {
            let r_end = (r_base + tile).min(self.rows);
            let c_end = (c_base + tile).min(self.cols);
            for r in r_base..r_end {
                for c in c_base..c_end {
                    out.set(c, r, self.get(r, c));
                }
            }
        }
    }
    out
}

Rust — Structure of Arrays (SoA)

pub struct Vec3SoA {
    pub xs: Vec<f32>,
    pub ys: Vec<f32>,
    pub zs: Vec<f32>,
}

impl Vec3SoA {
    // Only xs are loaded into cache — ys and zs never touched
    pub fn sum_x(&self) -> f32 { self.xs.iter().sum() }

    // Three sequential passes, each over one contiguous slice
    pub fn magnitudes_sq(&self) -> Vec<f32> {
        self.xs.iter().zip(self.ys.iter()).zip(self.zs.iter())
            .map(|((x, y), z)| x * x + y * y + z * z)
            .collect()
    }
}

Type Signatures

Concept	OCaml	Rust
Flat matrix	`{ data: float array; rows: int; cols: int }`	`struct Matrix { data: Vec<f32>, rows: usize, cols: usize }`
Element access	`m.data.(r * m.cols + c)`	`self.data[r * self.cols + c]`
Row-major sum	`Array.fold_left (+.) 0.0 m.data`	`self.data.iter().sum()`
Prefix scan	`Array.fold_left` with accumulator	`Iterator::scan`
Stride type	`int` (unboxed int)	`usize` (pointer-sized)

Key Insights

Memory layout is identical: OCaml float array and Rust Vec<f32> both store elements contiguously in a single heap allocation. The row-major index formula r * cols + c is the same in both languages.

Cache line arithmetic is language-agnostic: A cache line is 64 bytes = 16 f32s regardless of language. Row-major iteration is fast in both OCaml and Rust because it is sequential — the hardware prefetcher loads the next cache line speculatively. Column-major access jumps cols * 4 bytes between elements, causing a cache miss on each access once the matrix exceeds L1 capacity (~32 KB).

**float array array vs flat Vec**: OCaml's nested array float array array is an array of pointers to individual row allocations — two pointer dereferences per access and potentially non-contiguous rows. Rust's Vec<Vec<f32>> has the same problem. The flat Vec<f32> with manual index arithmetic eliminates the extra indirection entirely.

Tiling (blocking) beats algorithmic cleverness: A naïve transpose iterates input rows sequentially but writes output columns — cache-unfriendly on writes. Tiling TILE × TILE sub-blocks ensures both the read tile and the write tile fit in L1 cache simultaneously, maintaining ~L1 bandwidth for both sides. The optimal tile size is sqrt(L1_size / sizeof(element)).

Structure of Arrays (SoA) vs Array of Structures (AoS): When only one field of a record is needed in a hot loop, SoA layouts load only the relevant Vec into cache. AoS forces the CPU to load all fields even if only one is used, wasting cache bandwidth by a factor of struct_size / field_size.

**Iterator::scan is the idiomatic prefix-sum**: OCaml uses Array.fold_left with a ref accumulator; Rust's scan threads the state through the iterator chain without mutation at the call site. Both produce sequential memory access — the access pattern is what matters for cache performance, not the syntactic form.

Auto-vectorisation requires sequential layout: The LLVM backend can emit SIMD instructions (AVX2 processes 8 f32s per cycle) only when the data is contiguous and the access stride is 1. data.iter().sum() meets this condition; sum_col_major does not, so it cannot be vectorised.

When to Use Each Style

Use row-major iteration when processing all elements or all elements of a row — this is the fast path and should be the default for any flat matrix.

Use tiled transposition when you must transpose a large matrix and performance matters; the tile size should be tuned to the target machine's L1 cache size (typically 4–16 elements per dimension for f32).

Use SoA layout when hot loops operate on a subset of fields from a collection of records (e.g., physics simulation updating only positions, or a renderer reading only normals).

Use gather/scatter sparingly — random-access patterns defeat the hardware prefetcher. Where possible, sort work by memory address or use sequential layouts instead.

Exercises

Benchmark naive vs tiled transposition for 256×256, 512×512, and 1024×1024 matrices

with criterion. Plot cache-miss counts using perf stat.

Experiment with tile sizes (8, 16, 32, 64) and find the optimal size for your CPU's

L1 cache (check getconf LEVEL1_DCACHE_SIZE).

Implement a generic TiledIter<T> that yields (row_start, col_start, tile_data)

for any 2D slice, abstracting the tiling logic.

Apply tiling to sparse matrix-vector multiplication: given a CSR matrix, tile the

row-range loop to keep the x vector in L2 cache.

Compare row-major vs column-major sum on a matrix stored as Vec<Vec<f64>> vs

flat Vec<f64> to quantify the cost of pointer indirection in addition to stride.

Open Source Repos

functional-rust

View the source for this example on GitHub — OCaml and Rust side by side in the repo.

Rust