ExamplesBy LevelBy TopicLearning Paths
727 Fundamental

Portable SIMD Concepts

Functional Programming

Tutorial

The Problem

Single Instruction, Multiple Data (SIMD) instructions process N data elements with one CPU operation. SSE4.2 processes 4 floats simultaneously; AVX2 processes 8; AVX-512 processes 16. A scalar loop that multiplies 1 million floats takes ~1M multiply instructions; an AVX2 loop takes ~125Kβ€”an 8Γ— speedup from instruction count alone. SIMD is essential in signal processing (FFT, FIR filters), image processing (convolution, color conversion), machine learning (matrix multiply, softmax), physics (N-body, fluid simulation), and cryptography (AES-NI, polynomial hashing).

The challenge: x86 SSE/AVX, ARM NEON, and RISC-V V extension have different intrinsics, different vector widths, and different semantics. Code written against _mm256_add_ps is not portable. Rust's std::simd (portable SIMD, stabilizing) and the wide crate provide lane-generic vector types that compile to optimal intrinsics per target.

🎯 Learning Outcomes

  • β€’ Understand SIMD lanes, vector width, and the lane-parallel execution model
  • β€’ Implement scalar algorithms in a lane-parallel style that mirrors SIMD semantics
  • β€’ Use std::simd::f32x4 / f32x8 portable vector types (nightly)
  • β€’ Apply chunks_exact to process arrays in SIMD-width batches with scalar remainder
  • β€’ Recognize SIMD alignment requirements and when to use #[repr(align(32))]
  • Code Example

    #![allow(clippy::all)]
    // 727. SIMD concepts with std::simd (portable_simd)
    //
    // This file demonstrates the portable_simd API using stable-compatible
    // scalar implementations that mirror the SIMD mental model exactly.
    // On nightly Rust, replace the scalar impls with std::simd types.
    //
    // To enable on nightly: add `#![feature(portable_simd)]` and use
    // `use std::simd::*;`
    //
    // The scalar versions below are structurally identical to the SIMD versions β€”
    // just swap `[f32; LANES]` for `f32xN` and loops for SIMD ops.
    
    // ── LANES constant β€” matches f32x8 ───────────────────────────────────────────
    
    const LANES: usize = 8;
    
    // ── Portable scalar "SIMD" types (mirrors std::simd API) ─────────────────────
    
    /// Simulates `std::simd::f32x8` β€” 8 f32 lanes.
    #[derive(Clone, Copy, Debug, PartialEq)]
    pub struct F32x8([f32; LANES]);
    
    impl F32x8 {
        pub fn splat(v: f32) -> Self {
            Self([v; LANES])
        }
        pub fn from_array(a: [f32; LANES]) -> Self {
            Self(a)
        }
        pub fn to_array(self) -> [f32; LANES] {
            self.0
        }
    
        /// Lane-wise addition β€” compiles to VADDPS ymm on AVX2.
        pub fn add(self, rhs: Self) -> Self {
            let mut r = [0.0f32; LANES];
            for i in 0..LANES {
                r[i] = self.0[i] + rhs.0[i];
            }
            Self(r)
        }
    
        /// Lane-wise multiplication β€” compiles to VMULPS ymm on AVX2.
        pub fn mul(self, rhs: Self) -> Self {
            let mut r = [0.0f32; LANES];
            for i in 0..LANES {
                r[i] = self.0[i] * rhs.0[i];
            }
            Self(r)
        }
    
        /// Fused multiply-add: self * a + b β€” compiles to VFMADD213PS on AVX2+FMA.
        pub fn mul_add(self, a: Self, b: Self) -> Self {
            let mut r = [0.0f32; LANES];
            for i in 0..LANES {
                r[i] = self.0[i].mul_add(a.0[i], b.0[i]);
            }
            Self(r)
        }
    
        /// Horizontal sum reduction β€” compiles to `vhaddps` or tree reduction.
        pub fn reduce_sum(self) -> f32 {
            self.0.iter().copied().sum()
        }
    
        /// Lane-wise max β€” compiles to VMAXPS ymm.
        pub fn max(self, rhs: Self) -> Self {
            let mut r = [0.0f32; LANES];
            for i in 0..LANES {
                r[i] = self.0[i].max(rhs.0[i]);
            }
            Self(r)
        }
    
        /// Lane-wise min β€” compiles to VMINPS ymm.
        pub fn min(self, rhs: Self) -> Self {
            let mut r = [0.0f32; LANES];
            for i in 0..LANES {
                r[i] = self.0[i].min(rhs.0[i]);
            }
            Self(r)
        }
    
        /// Mask select: choose `on_true[i]` where `mask[i] > 0`, else `on_false[i]`.
        /// Maps to `VBLENDVPS` or `VPBLENDMD` on AVX512.
        pub fn select(mask: &MaskF32x8, on_true: Self, on_false: Self) -> Self {
            let mut r = [0.0f32; LANES];
            for i in 0..LANES {
                r[i] = if mask.0[i] {
                    on_true.0[i]
                } else {
                    on_false.0[i]
                };
            }
            Self(r)
        }
    
        /// Compare greater-than, producing a mask.
        pub fn gt(self, rhs: Self) -> MaskF32x8 {
            let mut m = [false; LANES];
            for i in 0..LANES {
                m[i] = self.0[i] > rhs.0[i];
            }
            MaskF32x8(m)
        }
    }
    
    /// A boolean mask over 8 f32 lanes. Maps to `std::simd::Mask<i32, 8>`.
    #[derive(Clone, Copy, Debug)]
    pub struct MaskF32x8([bool; LANES]);
    
    impl MaskF32x8 {
        pub fn any(self) -> bool {
            self.0.iter().any(|&b| b)
        }
        pub fn all(self) -> bool {
            self.0.iter().all(|&b| b)
        }
    }
    
    // ── Vectorised algorithms ─────────────────────────────────────────────────────
    
    /// Dot product using 8-wide SIMD accumulation.
    /// Processes 8 elements per loop iteration.
    pub fn dot_product_simd(a: &[f32], b: &[f32]) -> f32 {
        assert_eq!(a.len(), b.len());
        let n = a.len();
        let full_chunks = n / LANES;
    
        let mut acc = F32x8::splat(0.0);
        for i in 0..full_chunks {
            let off = i * LANES;
            let va = F32x8::from_array(a[off..off + LANES].try_into().unwrap());
            let vb = F32x8::from_array(b[off..off + LANES].try_into().unwrap());
            // acc = va * vb + acc  (FMA)
            acc = va.mul_add(vb, acc);
        }
    
        // Handle remaining elements (scalar tail)
        let mut result = acc.reduce_sum();
        for i in (full_chunks * LANES)..n {
            result += a[i] * b[i];
        }
        result
    }
    
    /// Element-wise clamp using SIMD min/max.
    pub fn clamp_simd(data: &mut [f32], lo: f32, hi: f32) {
        let vlo = F32x8::splat(lo);
        let vhi = F32x8::splat(hi);
        let n = data.len();
        let full = n / LANES;
    
        for i in 0..full {
            let off = i * LANES;
            let v = F32x8::from_array(data[off..off + LANES].try_into().unwrap());
            let clamped = v.max(vlo).min(vhi);
            data[off..off + LANES].copy_from_slice(&clamped.to_array());
        }
        // Scalar tail
        for v in &mut data[full * LANES..] {
            *v = v.clamp(lo, hi);
        }
    }
    
    /// ReLU activation: max(x, 0) β€” common in neural networks.
    pub fn relu_simd(data: &mut [f32]) {
        clamp_simd(data, 0.0, f32::INFINITY);
    }
    
    /// Horizontal sum of a large slice using 8-wide vectors.
    pub fn sum_simd(data: &[f32]) -> f32 {
        let full = data.len() / LANES;
        let mut acc = F32x8::splat(0.0);
        for i in 0..full {
            let off = i * LANES;
            let v = F32x8::from_array(data[off..off + LANES].try_into().unwrap());
            acc = acc.add(v);
        }
        let mut result = acc.reduce_sum();
        for &v in &data[full * LANES..] {
            result += v;
        }
        result
    }
    
    // ── Scalar reference implementations ─────────────────────────────────────────
    
    pub fn dot_scalar(a: &[f32], b: &[f32]) -> f32 {
        a.iter().zip(b).map(|(&x, &y)| x * y).sum()
    }
    
    // ── main ──────────────────────────────────────────────────────────────────────
    
    // ── Tests ─────────────────────────────────────────────────────────────────────
    
    #[cfg(test)]
    mod tests {
        use super::*;
    
        #[test]
        fn lane_add() {
            let a = F32x8::splat(2.0);
            let b = F32x8::splat(3.0);
            assert_eq!(a.add(b).to_array(), [5.0; 8]);
        }
    
        #[test]
        fn reduce_sum() {
            let a = F32x8::from_array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]);
            assert_eq!(a.reduce_sum(), 36.0);
        }
    
        #[test]
        fn dot_product_matches_scalar() {
            let a: Vec<f32> = (0..64).map(|i| i as f32).collect();
            let b: Vec<f32> = (0..64).map(|i| (64 - i) as f32).collect();
            let d_simd = dot_product_simd(&a, &b);
            let d_scalar = dot_scalar(&a, &b);
            assert!((d_simd - d_scalar).abs() < 0.01);
        }
    
        #[test]
        fn relu_zeroes_negatives() {
            let mut v = vec![-2.0f32, -1.0, 0.0, 1.0, 2.0, -3.0, 4.0, -0.5];
            relu_simd(&mut v);
            assert_eq!(v, [0.0, 0.0, 0.0, 1.0, 2.0, 0.0, 4.0, 0.0]);
        }
    
        #[test]
        fn clamp_bounds() {
            let mut v = vec![-5.0f32, 0.0, 5.0, 10.0, 15.0, 3.0, -1.0, 8.0];
            clamp_simd(&mut v, 0.0, 10.0);
            for &x in &v {
                assert!(x >= 0.0 && x <= 10.0);
            }
        }
    
        #[test]
        fn sum_simd_correct() {
            let data: Vec<f32> = (1..=16).map(|i| i as f32).collect();
            let s = sum_simd(&data);
            assert_eq!(s, 136.0); // 1+2+…+16
        }
    
        #[test]
        fn mask_select() {
            let a = F32x8::from_array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]);
            let threshold = F32x8::splat(4.5);
            let mask = a.gt(threshold);
            let selected = F32x8::select(&mask, F32x8::splat(1.0), F32x8::splat(0.0));
            assert_eq!(&selected.to_array()[..4], &[0.0, 0.0, 0.0, 0.0]);
            assert_eq!(&selected.to_array()[4..], &[1.0, 1.0, 1.0, 1.0]);
        }
    }

    Key Differences

    AspectRustOCaml
    Portable SIMD APIstd::simd (nightly), wide crateNone (requires C FFI)
    Auto-vectorizationLLVM vectorizes chunks_exact loopsLimited; scalar loops stay scalar
    Alignment control#[repr(align(32))], aligned crateBigarray with alignment hints
    SIMD intrinsicsstd::arch::x86_64::_mm256_*C stubs only
    Lane widthConst generic LANES, f32x4 etc.Not applicable

    OCaml Approach

    OCaml has no portable SIMD abstraction. SIMD in OCaml requires C stubs or the owl-base scientific computing library, which calls BLAS/LAPACK (SIMD-optimized C):

    (* Pure OCaml: scalar, no SIMD *)
    let dot_product a b =
      Array.fold_left2 (fun acc x y -> acc +. x *. y) 0.0 a b
    
    (* With Owl: delegates to SIMD-optimized CBLAS *)
    (* let result = Owl.Dense.Ndarray.D.dot a b *)
    

    OCaml 5 has an experimental Simd module for internal compiler use but no stable portable SIMD API is available to users.

    Full Source

    #![allow(clippy::all)]
    // 727. SIMD concepts with std::simd (portable_simd)
    //
    // This file demonstrates the portable_simd API using stable-compatible
    // scalar implementations that mirror the SIMD mental model exactly.
    // On nightly Rust, replace the scalar impls with std::simd types.
    //
    // To enable on nightly: add `#![feature(portable_simd)]` and use
    // `use std::simd::*;`
    //
    // The scalar versions below are structurally identical to the SIMD versions β€”
    // just swap `[f32; LANES]` for `f32xN` and loops for SIMD ops.
    
    // ── LANES constant β€” matches f32x8 ───────────────────────────────────────────
    
    const LANES: usize = 8;
    
    // ── Portable scalar "SIMD" types (mirrors std::simd API) ─────────────────────
    
    /// Simulates `std::simd::f32x8` β€” 8 f32 lanes.
    #[derive(Clone, Copy, Debug, PartialEq)]
    pub struct F32x8([f32; LANES]);
    
    impl F32x8 {
        pub fn splat(v: f32) -> Self {
            Self([v; LANES])
        }
        pub fn from_array(a: [f32; LANES]) -> Self {
            Self(a)
        }
        pub fn to_array(self) -> [f32; LANES] {
            self.0
        }
    
        /// Lane-wise addition β€” compiles to VADDPS ymm on AVX2.
        pub fn add(self, rhs: Self) -> Self {
            let mut r = [0.0f32; LANES];
            for i in 0..LANES {
                r[i] = self.0[i] + rhs.0[i];
            }
            Self(r)
        }
    
        /// Lane-wise multiplication β€” compiles to VMULPS ymm on AVX2.
        pub fn mul(self, rhs: Self) -> Self {
            let mut r = [0.0f32; LANES];
            for i in 0..LANES {
                r[i] = self.0[i] * rhs.0[i];
            }
            Self(r)
        }
    
        /// Fused multiply-add: self * a + b β€” compiles to VFMADD213PS on AVX2+FMA.
        pub fn mul_add(self, a: Self, b: Self) -> Self {
            let mut r = [0.0f32; LANES];
            for i in 0..LANES {
                r[i] = self.0[i].mul_add(a.0[i], b.0[i]);
            }
            Self(r)
        }
    
        /// Horizontal sum reduction β€” compiles to `vhaddps` or tree reduction.
        pub fn reduce_sum(self) -> f32 {
            self.0.iter().copied().sum()
        }
    
        /// Lane-wise max β€” compiles to VMAXPS ymm.
        pub fn max(self, rhs: Self) -> Self {
            let mut r = [0.0f32; LANES];
            for i in 0..LANES {
                r[i] = self.0[i].max(rhs.0[i]);
            }
            Self(r)
        }
    
        /// Lane-wise min β€” compiles to VMINPS ymm.
        pub fn min(self, rhs: Self) -> Self {
            let mut r = [0.0f32; LANES];
            for i in 0..LANES {
                r[i] = self.0[i].min(rhs.0[i]);
            }
            Self(r)
        }
    
        /// Mask select: choose `on_true[i]` where `mask[i] > 0`, else `on_false[i]`.
        /// Maps to `VBLENDVPS` or `VPBLENDMD` on AVX512.
        pub fn select(mask: &MaskF32x8, on_true: Self, on_false: Self) -> Self {
            let mut r = [0.0f32; LANES];
            for i in 0..LANES {
                r[i] = if mask.0[i] {
                    on_true.0[i]
                } else {
                    on_false.0[i]
                };
            }
            Self(r)
        }
    
        /// Compare greater-than, producing a mask.
        pub fn gt(self, rhs: Self) -> MaskF32x8 {
            let mut m = [false; LANES];
            for i in 0..LANES {
                m[i] = self.0[i] > rhs.0[i];
            }
            MaskF32x8(m)
        }
    }
    
    /// A boolean mask over 8 f32 lanes. Maps to `std::simd::Mask<i32, 8>`.
    #[derive(Clone, Copy, Debug)]
    pub struct MaskF32x8([bool; LANES]);
    
    impl MaskF32x8 {
        pub fn any(self) -> bool {
            self.0.iter().any(|&b| b)
        }
        pub fn all(self) -> bool {
            self.0.iter().all(|&b| b)
        }
    }
    
    // ── Vectorised algorithms ─────────────────────────────────────────────────────
    
    /// Dot product using 8-wide SIMD accumulation.
    /// Processes 8 elements per loop iteration.
    pub fn dot_product_simd(a: &[f32], b: &[f32]) -> f32 {
        assert_eq!(a.len(), b.len());
        let n = a.len();
        let full_chunks = n / LANES;
    
        let mut acc = F32x8::splat(0.0);
        for i in 0..full_chunks {
            let off = i * LANES;
            let va = F32x8::from_array(a[off..off + LANES].try_into().unwrap());
            let vb = F32x8::from_array(b[off..off + LANES].try_into().unwrap());
            // acc = va * vb + acc  (FMA)
            acc = va.mul_add(vb, acc);
        }
    
        // Handle remaining elements (scalar tail)
        let mut result = acc.reduce_sum();
        for i in (full_chunks * LANES)..n {
            result += a[i] * b[i];
        }
        result
    }
    
    /// Element-wise clamp using SIMD min/max.
    pub fn clamp_simd(data: &mut [f32], lo: f32, hi: f32) {
        let vlo = F32x8::splat(lo);
        let vhi = F32x8::splat(hi);
        let n = data.len();
        let full = n / LANES;
    
        for i in 0..full {
            let off = i * LANES;
            let v = F32x8::from_array(data[off..off + LANES].try_into().unwrap());
            let clamped = v.max(vlo).min(vhi);
            data[off..off + LANES].copy_from_slice(&clamped.to_array());
        }
        // Scalar tail
        for v in &mut data[full * LANES..] {
            *v = v.clamp(lo, hi);
        }
    }
    
    /// ReLU activation: max(x, 0) β€” common in neural networks.
    pub fn relu_simd(data: &mut [f32]) {
        clamp_simd(data, 0.0, f32::INFINITY);
    }
    
    /// Horizontal sum of a large slice using 8-wide vectors.
    pub fn sum_simd(data: &[f32]) -> f32 {
        let full = data.len() / LANES;
        let mut acc = F32x8::splat(0.0);
        for i in 0..full {
            let off = i * LANES;
            let v = F32x8::from_array(data[off..off + LANES].try_into().unwrap());
            acc = acc.add(v);
        }
        let mut result = acc.reduce_sum();
        for &v in &data[full * LANES..] {
            result += v;
        }
        result
    }
    
    // ── Scalar reference implementations ─────────────────────────────────────────
    
    pub fn dot_scalar(a: &[f32], b: &[f32]) -> f32 {
        a.iter().zip(b).map(|(&x, &y)| x * y).sum()
    }
    
    // ── main ──────────────────────────────────────────────────────────────────────
    
    // ── Tests ─────────────────────────────────────────────────────────────────────
    
    #[cfg(test)]
    mod tests {
        use super::*;
    
        #[test]
        fn lane_add() {
            let a = F32x8::splat(2.0);
            let b = F32x8::splat(3.0);
            assert_eq!(a.add(b).to_array(), [5.0; 8]);
        }
    
        #[test]
        fn reduce_sum() {
            let a = F32x8::from_array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]);
            assert_eq!(a.reduce_sum(), 36.0);
        }
    
        #[test]
        fn dot_product_matches_scalar() {
            let a: Vec<f32> = (0..64).map(|i| i as f32).collect();
            let b: Vec<f32> = (0..64).map(|i| (64 - i) as f32).collect();
            let d_simd = dot_product_simd(&a, &b);
            let d_scalar = dot_scalar(&a, &b);
            assert!((d_simd - d_scalar).abs() < 0.01);
        }
    
        #[test]
        fn relu_zeroes_negatives() {
            let mut v = vec![-2.0f32, -1.0, 0.0, 1.0, 2.0, -3.0, 4.0, -0.5];
            relu_simd(&mut v);
            assert_eq!(v, [0.0, 0.0, 0.0, 1.0, 2.0, 0.0, 4.0, 0.0]);
        }
    
        #[test]
        fn clamp_bounds() {
            let mut v = vec![-5.0f32, 0.0, 5.0, 10.0, 15.0, 3.0, -1.0, 8.0];
            clamp_simd(&mut v, 0.0, 10.0);
            for &x in &v {
                assert!(x >= 0.0 && x <= 10.0);
            }
        }
    
        #[test]
        fn sum_simd_correct() {
            let data: Vec<f32> = (1..=16).map(|i| i as f32).collect();
            let s = sum_simd(&data);
            assert_eq!(s, 136.0); // 1+2+…+16
        }
    
        #[test]
        fn mask_select() {
            let a = F32x8::from_array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]);
            let threshold = F32x8::splat(4.5);
            let mask = a.gt(threshold);
            let selected = F32x8::select(&mask, F32x8::splat(1.0), F32x8::splat(0.0));
            assert_eq!(&selected.to_array()[..4], &[0.0, 0.0, 0.0, 0.0]);
            assert_eq!(&selected.to_array()[4..], &[1.0, 1.0, 1.0, 1.0]);
        }
    }
    ✓ Tests Rust test suite
    #[cfg(test)]
    mod tests {
        use super::*;
    
        #[test]
        fn lane_add() {
            let a = F32x8::splat(2.0);
            let b = F32x8::splat(3.0);
            assert_eq!(a.add(b).to_array(), [5.0; 8]);
        }
    
        #[test]
        fn reduce_sum() {
            let a = F32x8::from_array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]);
            assert_eq!(a.reduce_sum(), 36.0);
        }
    
        #[test]
        fn dot_product_matches_scalar() {
            let a: Vec<f32> = (0..64).map(|i| i as f32).collect();
            let b: Vec<f32> = (0..64).map(|i| (64 - i) as f32).collect();
            let d_simd = dot_product_simd(&a, &b);
            let d_scalar = dot_scalar(&a, &b);
            assert!((d_simd - d_scalar).abs() < 0.01);
        }
    
        #[test]
        fn relu_zeroes_negatives() {
            let mut v = vec![-2.0f32, -1.0, 0.0, 1.0, 2.0, -3.0, 4.0, -0.5];
            relu_simd(&mut v);
            assert_eq!(v, [0.0, 0.0, 0.0, 1.0, 2.0, 0.0, 4.0, 0.0]);
        }
    
        #[test]
        fn clamp_bounds() {
            let mut v = vec![-5.0f32, 0.0, 5.0, 10.0, 15.0, 3.0, -1.0, 8.0];
            clamp_simd(&mut v, 0.0, 10.0);
            for &x in &v {
                assert!(x >= 0.0 && x <= 10.0);
            }
        }
    
        #[test]
        fn sum_simd_correct() {
            let data: Vec<f32> = (1..=16).map(|i| i as f32).collect();
            let s = sum_simd(&data);
            assert_eq!(s, 136.0); // 1+2+…+16
        }
    
        #[test]
        fn mask_select() {
            let a = F32x8::from_array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]);
            let threshold = F32x8::splat(4.5);
            let mask = a.gt(threshold);
            let selected = F32x8::select(&mask, F32x8::splat(1.0), F32x8::splat(0.0));
            assert_eq!(&selected.to_array()[..4], &[0.0, 0.0, 0.0, 0.0]);
            assert_eq!(&selected.to_array()[4..], &[1.0, 1.0, 1.0, 1.0]);
        }
    }

    Exercises

  • Enable nightly and rewrite dot_product_lanes using std::simd::f32x4. Compare
  • the generated assembly with and without target-cpu=native using cargo asm.

  • Implement a SIMD-width-agnostic map_inplace<const L: usize>(data: &mut [f32], f: impl Fn(f32) -> f32)
  • that processes chunks of L with a scalar inner loop and benchmark L=4 vs L=8.

  • Implement a SIMD-friendly prefix sum (scan) over &[f32] using the lane-parallel
  • model. Verify correctness and benchmark vs iter().scan().

  • Use the wide crate's f32x8 to implement an element-wise sigmoid approximation
  • 1.0 / (1.0 + (-x).exp()) and benchmark vs scalar for 1M elements.

  • Implement an aligned buffer type AlignedBuf<T, const ALIGN: usize> using
  • #[repr(align)] and verify with std::mem::align_of_val that SIMD loads are safe.

    Open Source Repos