ExamplesBy LevelBy TopicLearning Paths
720 Fundamental

Cache-Friendly Iteration

Functional Programming

Tutorial

The Problem

A CPU cache operates on 64-byte cache lines. When a program accesses memory sequentially (stride-1 access), the hardware prefetcher loads ahead and nearly every access hits cache. When a program accesses memory with large stridesβ€”skipping N bytes between each elementβ€”the prefetcher cannot keep up and every access is a cache miss costing 100–300 cycles on modern hardware.

Matrix transposition is the canonical example: iterating row-major over a row-major matrix is fast; iterating column-major is slow because each column element is separated by cols * sizeof(element) bytes. For a 1024Γ—1024 float matrix, column-major access produces ~1M cache misses vs ~16K for row-majorβ€”a 60Γ— difference. Tiled (blocked) transposition trades the miss pattern by fitting tiles into L1/L2 cache, enabling sequential reads and writes within each tile.

🎯 Learning Outcomes

  • β€’ Explain the relationship between access stride, cache line size, and miss rate
  • β€’ Implement row-major vs column-major matrix iteration and measure the gap
  • β€’ Apply loop tiling/blocking to fit working sets into cache
  • β€’ Use cargo bench / criterion to attribute performance differences to memory access
  • β€’ Reason about the interaction between auto-vectorization and access patterns
  • Code Example

    pub struct Matrix {
        data: Vec<f32>,
        pub rows: usize,
        pub cols: usize,
    }
    
    impl Matrix {
        // Row-major sum: iterates the backing Vec sequentially β€” optimal
        pub fn sum_row_major(&self) -> f32 {
            self.data.iter().sum()
        }
    
        // Column-major: stride = cols * sizeof(f32) bytes β€” cache-unfriendly
        pub fn sum_col_major(&self) -> f32 {
            (0..self.cols)
                .flat_map(|c| (0..self.rows).map(move |r| (r, c)))
                .map(|(r, c)| self.get(r, c))
                .sum()
        }
    }

    Key Differences

    AspectRustOCaml
    Float array layoutUnboxed Vec<f64>, row-majorFlat float array or Bigarray
    Auto-vectorizationLLVM vectorizes stride-1 loopsLimited; needs Bigarray + hints
    Tiling strategyNested loops with step_bySame approach, no stdlib support
    Cache profilingperf stat -e cache-missesSame external tools
    Abstraction costZero-cost iteratorsClosure calls add minor overhead

    Both languages expose identical cache-miss pathologies. Rust's LLVM backend is more likely to auto-vectorize the tiled inner loop.

    OCaml Approach

    OCaml arrays are flat and unboxed for float array, so the access pattern issues are identical. The standard library offers Bigarray for C-layout (row-major) or Fortran-layout (column-major) control:

    open Bigarray
    
    let make_matrix rows cols =
      Array2.create float64 c_layout rows cols
    
    (* Row-major: inner loop over columns β€” cache-friendly *)
    let sum_row_major mat =
      let rows = Array2.dim1 mat and cols = Array2.dim2 mat in
      let acc = ref 0.0 in
      for r = 0 to rows - 1 do
        for c = 0 to cols - 1 do
          acc := !acc +. mat.{r, c}
        done
      done;
      !acc
    

    OCaml's optimizer does not auto-vectorize as aggressively as LLVM/Clang; SIMD benefits require manual Bigarray with Lacaml or Owl.

    Full Source

    #![allow(clippy::all)]
    //! 720: Cache-Friendly Iteration and Data Access Patterns
    //!
    //! Demonstrates row-major vs column-major access, tiled transposition,
    //! and how sequential iterator chaining produces cache-friendly, auto-vectorisable code.
    //!
    //! Key insight: a CPU cache line is 64 bytes = 16 f32s. Sequential access lets
    //! the hardware prefetcher load the next cache line before you need it.
    //! Column-major access on a row-major matrix jumps `cols * 4` bytes between
    //! elements β€” each access is a potential cache miss.
    
    // ── Flat row-major matrix ────────────────────────────────────────────────────
    
    /// A flat, row-major 2-D matrix backed by a single `Vec<f32>`.
    ///
    /// Element `(r, c)` lives at `data[r * cols + c]`, so iterating over a
    /// row is sequential β€” identical to iterating over a slice.
    pub struct Matrix {
        data: Vec<f32>,
        pub rows: usize,
        pub cols: usize,
    }
    
    impl Matrix {
        /// Allocate a matrix filled with `init`.
        pub fn new(rows: usize, cols: usize, init: f32) -> Self {
            Self {
                data: vec![init; rows * cols],
                rows,
                cols,
            }
        }
    
        /// Allocate a matrix where element `(r, c)` is computed by `f(r, c)`.
        pub fn from_fn(rows: usize, cols: usize, f: impl Fn(usize, usize) -> f32) -> Self {
            // Sequential nested push: identical memory access order to the flat_map
            // version, but avoids closure ownership issues when `f` is not `Copy`.
            let mut data = Vec::with_capacity(rows * cols);
            for r in 0..rows {
                for c in 0..cols {
                    data.push(f(r, c));
                }
            }
            Self { data, rows, cols }
        }
    
        #[inline]
        pub fn get(&self, r: usize, c: usize) -> f32 {
            self.data[r * self.cols + c]
        }
    
        #[inline]
        pub fn set(&mut self, r: usize, c: usize, v: f32) {
            self.data[r * self.cols + c] = v;
        }
    
        /// Sum all elements: iterates the backing `Vec` sequentially β€” optimal.
        ///
        /// The compiler can auto-vectorise this into SIMD because the memory
        /// layout is contiguous and the access pattern is fully predictable.
        pub fn sum_row_major(&self) -> f32 {
            self.data.iter().sum()
        }
    
        /// Sum all elements column-by-column (cache-unfriendly reference path).
        ///
        /// Between two successive accesses the stride is `cols * sizeof(f32)` bytes.
        /// On a 1024-column matrix that is 4 096 bytes β€” 64 cache lines skipped.
        /// Every access is likely a cache miss once the matrix exceeds L1 capacity.
        pub fn sum_col_major(&self) -> f32 {
            (0..self.cols)
                .flat_map(|c| (0..self.rows).map(move |r| (r, c)))
                .map(|(r, c)| self.get(r, c))
                .sum()
        }
    
        /// Row iterator: yields `&[f32]` slices β€” zero overhead, sequential.
        pub fn rows_iter(&self) -> impl Iterator<Item = &[f32]> {
            self.data.chunks(self.cols)
        }
    
        /// Transpose using tiled (blocked) access.
        ///
        /// A naΓ―ve transpose writes column-by-column into the output (cache-unfriendly
        /// on writes). Tiling processes `TILE Γ— TILE` sub-blocks that fit in L1 cache,
        /// so both reads and writes stay warm.
        pub fn transpose_tiled(&self, tile: usize) -> Self {
            let mut out = Self::new(self.cols, self.rows, 0.0);
            for r_base in (0..self.rows).step_by(tile) {
                for c_base in (0..self.cols).step_by(tile) {
                    let r_end = (r_base + tile).min(self.rows);
                    let c_end = (c_base + tile).min(self.cols);
                    for r in r_base..r_end {
                        for c in c_base..c_end {
                            out.set(c, r, self.get(r, c));
                        }
                    }
                }
            }
            out
        }
    
        /// NaΓ―ve (non-tiled) transpose for comparison.
        pub fn transpose_naive(&self) -> Self {
            Self::from_fn(self.cols, self.rows, |r, c| self.get(c, r))
        }
    }
    
    // ── Structure of Arrays (SoA) vs Array of Structures (AoS) ──────────────────
    
    /// Array of Structures β€” common OOP layout, cache-unfriendly for field-only ops.
    ///
    /// If you only need `x` values, you still load `y` and `z` into cache.
    pub struct Vec3Aos {
        pub x: f32,
        pub y: f32,
        pub z: f32,
    }
    
    /// Structure of Arrays β€” all `x` values are contiguous.
    ///
    /// A loop over `xs` touches only the `x` cache lines; `ys` and `zs` are never
    /// loaded. SIMD auto-vectorisation is straightforward because the data is already
    /// in the right layout for `_mm256_add_ps` etc.
    pub struct Vec3SoA {
        pub xs: Vec<f32>,
        pub ys: Vec<f32>,
        pub zs: Vec<f32>,
    }
    
    impl Vec3SoA {
        pub fn new(n: usize) -> Self {
            Self {
                xs: vec![0.0; n],
                ys: vec![0.0; n],
                zs: vec![0.0; n],
            }
        }
    
        /// Sum only the x-components: touches one contiguous Vec β€” minimal cache pressure.
        pub fn sum_x(&self) -> f32 {
            self.xs.iter().sum()
        }
    
        /// Compute squared magnitudes for all vectors.
        ///
        /// Three sequential passes, each over a single contiguous slice.
        /// The compiler can vectorise each `.zip` independently.
        pub fn magnitudes_sq(&self) -> Vec<f32> {
            self.xs
                .iter()
                .zip(self.ys.iter())
                .zip(self.zs.iter())
                .map(|((x, y), z)| x * x + y * y + z * z)
                .collect()
        }
    }
    
    // ── Prefetch-friendly sequential scan helpers ────────────────────────────────
    
    /// Running sum over a slice β€” the canonical sequential scan.
    ///
    /// Produces a prefix-sum array. Sequential read + sequential write:
    /// the prefetcher handles both streams perfectly.
    pub fn prefix_sum(data: &[f32]) -> Vec<f32> {
        data.iter()
            .scan(0.0_f32, |acc, &x| {
                *acc += x;
                Some(*acc)
            })
            .collect()
    }
    
    /// Gather operation (random read): indices may jump anywhere in `src`.
    ///
    /// Classic cache-unfriendly pattern β€” each `indices[i]` may point to
    /// a different cache line in `src`.
    pub fn gather(src: &[f32], indices: &[usize]) -> Vec<f32> {
        indices.iter().map(|&i| src[i]).collect()
    }
    
    /// Scatter operation (random write): inverse cache problem on the write side.
    pub fn scatter(dst: &mut [f32], indices: &[usize], values: &[f32]) {
        indices
            .iter()
            .zip(values.iter())
            .for_each(|(&i, &v)| dst[i] = v);
    }
    
    // ─────────────────────────────────────────────────────────────────────────────
    #[cfg(test)]
    mod tests {
        use super::*;
    
        // ── Matrix correctness ──────────────────────────────────────────────────
    
        #[test]
        fn row_and_col_sums_are_equal() {
            // A 4Γ—4 matrix filled with 1.0 β€” both access orders must give 16.0.
            let m = Matrix::new(4, 4, 1.0);
            assert_eq!(m.sum_row_major(), 16.0);
            assert_eq!(m.sum_col_major(), 16.0);
        }
    
        #[test]
        fn from_fn_and_get_round_trip() {
            let m = Matrix::from_fn(3, 3, |r, c| (r * 3 + c) as f32);
            assert_eq!(m.get(0, 0), 0.0);
            assert_eq!(m.get(1, 1), 4.0);
            assert_eq!(m.get(2, 2), 8.0);
        }
    
        #[test]
        fn transpose_tiled_matches_naive() {
            let m = Matrix::from_fn(8, 12, |r, c| (r * 12 + c) as f32);
            let t_naive = m.transpose_naive();
            let t_tiled = m.transpose_tiled(4);
    
            assert_eq!(t_tiled.rows, 12);
            assert_eq!(t_tiled.cols, 8);
    
            for r in 0..t_naive.rows {
                for c in 0..t_naive.cols {
                    assert_eq!(
                        t_naive.get(r, c),
                        t_tiled.get(r, c),
                        "mismatch at ({r},{c})"
                    );
                }
            }
        }
    
        #[test]
        fn rows_iter_yields_correct_slices() {
            let m = Matrix::from_fn(3, 4, |r, c| (r * 4 + c) as f32);
            let rows: Vec<&[f32]> = m.rows_iter().collect();
            assert_eq!(rows.len(), 3);
            assert_eq!(rows[0], &[0.0, 1.0, 2.0, 3.0]);
            assert_eq!(rows[1], &[4.0, 5.0, 6.0, 7.0]);
            assert_eq!(rows[2], &[8.0, 9.0, 10.0, 11.0]);
        }
    
        // ── SoA correctness ─────────────────────────────────────────────────────
    
        #[test]
        fn soa_sum_x_is_correct() {
            let mut soa = Vec3SoA::new(4);
            soa.xs = vec![1.0, 2.0, 3.0, 4.0];
            assert_eq!(soa.sum_x(), 10.0);
        }
    
        #[test]
        fn soa_magnitudes_sq_unit_vectors() {
            let mut soa = Vec3SoA::new(2);
            soa.xs = vec![1.0, 0.0];
            soa.ys = vec![0.0, 1.0];
            soa.zs = vec![0.0, 0.0];
            let mags = soa.magnitudes_sq();
            assert_eq!(mags, vec![1.0, 1.0]);
        }
    
        // ── Sequential helpers ──────────────────────────────────────────────────
    
        #[test]
        fn prefix_sum_basic() {
            let data = vec![1.0, 2.0, 3.0, 4.0];
            let ps = prefix_sum(&data);
            assert_eq!(ps, vec![1.0, 3.0, 6.0, 10.0]);
        }
    
        #[test]
        fn prefix_sum_empty() {
            assert_eq!(prefix_sum(&[]), Vec::<f32>::new());
        }
    
        #[test]
        fn gather_and_scatter_round_trip() {
            let src = vec![10.0_f32, 20.0, 30.0, 40.0];
            let indices = vec![3, 1, 0, 2];
            let gathered = gather(&src, &indices);
            assert_eq!(gathered, vec![40.0, 20.0, 10.0, 30.0]);
    
            let mut dst = vec![0.0_f32; 4];
            scatter(&mut dst, &indices, &gathered);
            // scatter[3]=40, scatter[1]=20, scatter[0]=10, scatter[2]=30
            assert_eq!(dst, vec![10.0, 20.0, 30.0, 40.0]);
        }
    
        // ── Large-matrix sum consistency (catches index formula bugs) ───────────
    
        #[test]
        fn large_matrix_row_col_sum_agree() {
            let m = Matrix::from_fn(32, 64, |r, c| (r + c) as f32);
            let row = m.sum_row_major();
            let col = m.sum_col_major();
            assert!(
                (row - col).abs() < 1e-3,
                "row={row} col={col} differ by more than epsilon"
            );
        }
    }
    ✓ Tests Rust test suite
    #[cfg(test)]
    mod tests {
        use super::*;
    
        // ── Matrix correctness ──────────────────────────────────────────────────
    
        #[test]
        fn row_and_col_sums_are_equal() {
            // A 4Γ—4 matrix filled with 1.0 β€” both access orders must give 16.0.
            let m = Matrix::new(4, 4, 1.0);
            assert_eq!(m.sum_row_major(), 16.0);
            assert_eq!(m.sum_col_major(), 16.0);
        }
    
        #[test]
        fn from_fn_and_get_round_trip() {
            let m = Matrix::from_fn(3, 3, |r, c| (r * 3 + c) as f32);
            assert_eq!(m.get(0, 0), 0.0);
            assert_eq!(m.get(1, 1), 4.0);
            assert_eq!(m.get(2, 2), 8.0);
        }
    
        #[test]
        fn transpose_tiled_matches_naive() {
            let m = Matrix::from_fn(8, 12, |r, c| (r * 12 + c) as f32);
            let t_naive = m.transpose_naive();
            let t_tiled = m.transpose_tiled(4);
    
            assert_eq!(t_tiled.rows, 12);
            assert_eq!(t_tiled.cols, 8);
    
            for r in 0..t_naive.rows {
                for c in 0..t_naive.cols {
                    assert_eq!(
                        t_naive.get(r, c),
                        t_tiled.get(r, c),
                        "mismatch at ({r},{c})"
                    );
                }
            }
        }
    
        #[test]
        fn rows_iter_yields_correct_slices() {
            let m = Matrix::from_fn(3, 4, |r, c| (r * 4 + c) as f32);
            let rows: Vec<&[f32]> = m.rows_iter().collect();
            assert_eq!(rows.len(), 3);
            assert_eq!(rows[0], &[0.0, 1.0, 2.0, 3.0]);
            assert_eq!(rows[1], &[4.0, 5.0, 6.0, 7.0]);
            assert_eq!(rows[2], &[8.0, 9.0, 10.0, 11.0]);
        }
    
        // ── SoA correctness ─────────────────────────────────────────────────────
    
        #[test]
        fn soa_sum_x_is_correct() {
            let mut soa = Vec3SoA::new(4);
            soa.xs = vec![1.0, 2.0, 3.0, 4.0];
            assert_eq!(soa.sum_x(), 10.0);
        }
    
        #[test]
        fn soa_magnitudes_sq_unit_vectors() {
            let mut soa = Vec3SoA::new(2);
            soa.xs = vec![1.0, 0.0];
            soa.ys = vec![0.0, 1.0];
            soa.zs = vec![0.0, 0.0];
            let mags = soa.magnitudes_sq();
            assert_eq!(mags, vec![1.0, 1.0]);
        }
    
        // ── Sequential helpers ──────────────────────────────────────────────────
    
        #[test]
        fn prefix_sum_basic() {
            let data = vec![1.0, 2.0, 3.0, 4.0];
            let ps = prefix_sum(&data);
            assert_eq!(ps, vec![1.0, 3.0, 6.0, 10.0]);
        }
    
        #[test]
        fn prefix_sum_empty() {
            assert_eq!(prefix_sum(&[]), Vec::<f32>::new());
        }
    
        #[test]
        fn gather_and_scatter_round_trip() {
            let src = vec![10.0_f32, 20.0, 30.0, 40.0];
            let indices = vec![3, 1, 0, 2];
            let gathered = gather(&src, &indices);
            assert_eq!(gathered, vec![40.0, 20.0, 10.0, 30.0]);
    
            let mut dst = vec![0.0_f32; 4];
            scatter(&mut dst, &indices, &gathered);
            // scatter[3]=40, scatter[1]=20, scatter[0]=10, scatter[2]=30
            assert_eq!(dst, vec![10.0, 20.0, 30.0, 40.0]);
        }
    
        // ── Large-matrix sum consistency (catches index formula bugs) ───────────
    
        #[test]
        fn large_matrix_row_col_sum_agree() {
            let m = Matrix::from_fn(32, 64, |r, c| (r + c) as f32);
            let row = m.sum_row_major();
            let col = m.sum_col_major();
            assert!(
                (row - col).abs() < 1e-3,
                "row={row} col={col} differ by more than epsilon"
            );
        }
    }

    Deep Comparison

    OCaml vs Rust: Cache-Friendly Iteration and Data Access Patterns

    Side-by-Side Code

    OCaml β€” flat matrix, row-major sum

    type matrix = { data: float array; rows: int; cols: int }
    
    let make_matrix rows cols init =
      { data = Array.init (rows * cols) init; rows; cols }
    
    let get m r c = m.data.(r * m.cols + c)
    
    (* Row-major: sequential β€” cache-friendly *)
    let sum_row_major m = Array.fold_left (+.) 0.0 m.data
    
    (* Column-major: stride = cols β€” cache-unfriendly *)
    let sum_col_major m =
      let acc = ref 0.0 in
      for c = 0 to m.cols - 1 do
        for r = 0 to m.rows - 1 do
          acc := !acc +. get m r c
        done
      done;
      !acc
    

    Rust β€” flat matrix, row-major sum (idiomatic)

    pub struct Matrix {
        data: Vec<f32>,
        pub rows: usize,
        pub cols: usize,
    }
    
    impl Matrix {
        // Row-major sum: iterates the backing Vec sequentially β€” optimal
        pub fn sum_row_major(&self) -> f32 {
            self.data.iter().sum()
        }
    
        // Column-major: stride = cols * sizeof(f32) bytes β€” cache-unfriendly
        pub fn sum_col_major(&self) -> f32 {
            (0..self.cols)
                .flat_map(|c| (0..self.rows).map(move |r| (r, c)))
                .map(|(r, c)| self.get(r, c))
                .sum()
        }
    }
    

    Rust β€” tiled transpose (cache-friendly write path)

    pub fn transpose_tiled(&self, tile: usize) -> Self {
        let mut out = Self::new(self.cols, self.rows, 0.0);
        for r_base in (0..self.rows).step_by(tile) {
            for c_base in (0..self.cols).step_by(tile) {
                let r_end = (r_base + tile).min(self.rows);
                let c_end = (c_base + tile).min(self.cols);
                for r in r_base..r_end {
                    for c in c_base..c_end {
                        out.set(c, r, self.get(r, c));
                    }
                }
            }
        }
        out
    }
    

    Rust β€” Structure of Arrays (SoA)

    pub struct Vec3SoA {
        pub xs: Vec<f32>,
        pub ys: Vec<f32>,
        pub zs: Vec<f32>,
    }
    
    impl Vec3SoA {
        // Only xs are loaded into cache β€” ys and zs never touched
        pub fn sum_x(&self) -> f32 { self.xs.iter().sum() }
    
        // Three sequential passes, each over one contiguous slice
        pub fn magnitudes_sq(&self) -> Vec<f32> {
            self.xs.iter().zip(self.ys.iter()).zip(self.zs.iter())
                .map(|((x, y), z)| x * x + y * y + z * z)
                .collect()
        }
    }
    

    Type Signatures

    ConceptOCamlRust
    Flat matrix{ data: float array; rows: int; cols: int }struct Matrix { data: Vec<f32>, rows: usize, cols: usize }
    Element accessm.data.(r * m.cols + c)self.data[r * self.cols + c]
    Row-major sumArray.fold_left (+.) 0.0 m.dataself.data.iter().sum()
    Prefix scanArray.fold_left with accumulatorIterator::scan
    Stride typeint (unboxed int)usize (pointer-sized)

    Key Insights

  • Memory layout is identical: OCaml float array and Rust Vec<f32> both store elements contiguously in a single heap allocation. The row-major index formula r * cols + c is the same in both languages.
  • Cache line arithmetic is language-agnostic: A cache line is 64 bytes = 16 f32s regardless of language. Row-major iteration is fast in both OCaml and Rust because it is sequential β€” the hardware prefetcher loads the next cache line speculatively. Column-major access jumps cols * 4 bytes between elements, causing a cache miss on each access once the matrix exceeds L1 capacity (~32 KB).
  • **float array array vs flat Vec**: OCaml's nested array float array array is an array of pointers to individual row allocations β€” two pointer dereferences per access and potentially non-contiguous rows. Rust's Vec<Vec<f32>> has the same problem. The flat Vec<f32> with manual index arithmetic eliminates the extra indirection entirely.
  • Tiling (blocking) beats algorithmic cleverness: A naΓ―ve transpose iterates input rows sequentially but writes output columns β€” cache-unfriendly on writes. Tiling TILE Γ— TILE sub-blocks ensures both the read tile and the write tile fit in L1 cache simultaneously, maintaining ~L1 bandwidth for both sides. The optimal tile size is sqrt(L1_size / sizeof(element)).
  • Structure of Arrays (SoA) vs Array of Structures (AoS): When only one field of a record is needed in a hot loop, SoA layouts load only the relevant Vec into cache. AoS forces the CPU to load all fields even if only one is used, wasting cache bandwidth by a factor of struct_size / field_size.
  • **Iterator::scan is the idiomatic prefix-sum**: OCaml uses Array.fold_left with a ref accumulator; Rust's scan threads the state through the iterator chain without mutation at the call site. Both produce sequential memory access β€” the access pattern is what matters for cache performance, not the syntactic form.
  • Auto-vectorisation requires sequential layout: The LLVM backend can emit SIMD instructions (AVX2 processes 8 f32s per cycle) only when the data is contiguous and the access stride is 1. data.iter().sum() meets this condition; sum_col_major does not, so it cannot be vectorised.
  • When to Use Each Style

    Use row-major iteration when processing all elements or all elements of a row β€” this is the fast path and should be the default for any flat matrix.

    Use tiled transposition when you must transpose a large matrix and performance matters; the tile size should be tuned to the target machine's L1 cache size (typically 4–16 elements per dimension for f32).

    Use SoA layout when hot loops operate on a subset of fields from a collection of records (e.g., physics simulation updating only positions, or a renderer reading only normals).

    Use gather/scatter sparingly β€” random-access patterns defeat the hardware prefetcher. Where possible, sort work by memory address or use sequential layouts instead.

    Exercises

  • Benchmark naive vs tiled transposition for 256Γ—256, 512Γ—512, and 1024Γ—1024 matrices
  • with criterion. Plot cache-miss counts using perf stat.

  • Experiment with tile sizes (8, 16, 32, 64) and find the optimal size for your CPU's
  • L1 cache (check getconf LEVEL1_DCACHE_SIZE).

  • Implement a generic TiledIter<T> that yields (row_start, col_start, tile_data)
  • for any 2D slice, abstracting the tiling logic.

  • Apply tiling to sparse matrix-vector multiplication: given a CSR matrix, tile the
  • row-range loop to keep the x vector in L2 cache.

  • Compare row-major vs column-major sum on a matrix stored as Vec<Vec<f64>> vs
  • flat Vec<f64> to quantify the cost of pointer indirection in addition to stride.

    Open Source Repos