727 Fundamental

Portable SIMD Concepts

Functional Programming

Tutorial

The Problem

Single Instruction, Multiple Data (SIMD) instructions process N data elements with one CPU operation. SSE4.2 processes 4 floats simultaneously; AVX2 processes 8; AVX-512 processes 16. A scalar loop that multiplies 1 million floats takes ~1M multiply instructions; an AVX2 loop takes ~125K—an 8× speedup from instruction count alone. SIMD is essential in signal processing (FFT, FIR filters), image processing (convolution, color conversion), machine learning (matrix multiply, softmax), physics (N-body, fluid simulation), and cryptography (AES-NI, polynomial hashing).

The challenge: x86 SSE/AVX, ARM NEON, and RISC-V V extension have different intrinsics, different vector widths, and different semantics. Code written against _mm256_add_ps is not portable. Rust's std::simd (portable SIMD, stabilizing) and the wide crate provide lane-generic vector types that compile to optimal intrinsics per target.

🎯 Learning Outcomes

• Understand SIMD lanes, vector width, and the lane-parallel execution model

• Implement scalar algorithms in a lane-parallel style that mirrors SIMD semantics

• Use std::simd::f32x4 / f32x8 portable vector types (nightly)

• Apply chunks_exact to process arrays in SIMD-width batches with scalar remainder

• Recognize SIMD alignment requirements and when to use #[repr(align(32))]

Code Example

#![allow(clippy::all)]
// 727. SIMD concepts with std::simd (portable_simd)
//
// This file demonstrates the portable_simd API using stable-compatible
// scalar implementations that mirror the SIMD mental model exactly.
// On nightly Rust, replace the scalar impls with std::simd types.
//
// To enable on nightly: add `#![feature(portable_simd)]` and use
// `use std::simd::*;`
//
// The scalar versions below are structurally identical to the SIMD versions —
// just swap `[f32; LANES]` for `f32xN` and loops for SIMD ops.

// ── LANES constant — matches f32x8 ───────────────────────────────────────────

const LANES: usize = 8;

// ── Portable scalar "SIMD" types (mirrors std::simd API) ─────────────────────

/// Simulates `std::simd::f32x8` — 8 f32 lanes.
#[derive(Clone, Copy, Debug, PartialEq)]
pub struct F32x8([f32; LANES]);

impl F32x8 {
    pub fn splat(v: f32) -> Self {
        Self([v; LANES])
    }
    pub fn from_array(a: [f32; LANES]) -> Self {
        Self(a)
    }
    pub fn to_array(self) -> [f32; LANES] {
        self.0
    }

    /// Lane-wise addition — compiles to VADDPS ymm on AVX2.
    pub fn add(self, rhs: Self) -> Self {
        let mut r = [0.0f32; LANES];
        for i in 0..LANES {
            r[i] = self.0[i] + rhs.0[i];
        }
        Self(r)
    }

    /// Lane-wise multiplication — compiles to VMULPS ymm on AVX2.
    pub fn mul(self, rhs: Self) -> Self {
        let mut r = [0.0f32; LANES];
        for i in 0..LANES {
            r[i] = self.0[i] * rhs.0[i];
        }
        Self(r)
    }

    /// Fused multiply-add: self * a + b — compiles to VFMADD213PS on AVX2+FMA.
    pub fn mul_add(self, a: Self, b: Self) -> Self {
        let mut r = [0.0f32; LANES];
        for i in 0..LANES {
            r[i] = self.0[i].mul_add(a.0[i], b.0[i]);
        }
        Self(r)
    }

    /// Horizontal sum reduction — compiles to `vhaddps` or tree reduction.
    pub fn reduce_sum(self) -> f32 {
        self.0.iter().copied().sum()
    }

    /// Lane-wise max — compiles to VMAXPS ymm.
    pub fn max(self, rhs: Self) -> Self {
        let mut r = [0.0f32; LANES];
        for i in 0..LANES {
            r[i] = self.0[i].max(rhs.0[i]);
        }
        Self(r)
    }

    /// Lane-wise min — compiles to VMINPS ymm.
    pub fn min(self, rhs: Self) -> Self {
        let mut r = [0.0f32; LANES];
        for i in 0..LANES {
            r[i] = self.0[i].min(rhs.0[i]);
        }
        Self(r)
    }

    /// Mask select: choose `on_true[i]` where `mask[i] > 0`, else `on_false[i]`.
    /// Maps to `VBLENDVPS` or `VPBLENDMD` on AVX512.
    pub fn select(mask: &MaskF32x8, on_true: Self, on_false: Self) -> Self {
        let mut r = [0.0f32; LANES];
        for i in 0..LANES {
            r[i] = if mask.0[i] {
                on_true.0[i]
            } else {
                on_false.0[i]
            };
        }
        Self(r)
    }

    /// Compare greater-than, producing a mask.
    pub fn gt(self, rhs: Self) -> MaskF32x8 {
        let mut m = [false; LANES];
        for i in 0..LANES {
            m[i] = self.0[i] > rhs.0[i];
        }
        MaskF32x8(m)
    }
}

/// A boolean mask over 8 f32 lanes. Maps to `std::simd::Mask<i32, 8>`.
#[derive(Clone, Copy, Debug)]
pub struct MaskF32x8([bool; LANES]);

impl MaskF32x8 {
    pub fn any(self) -> bool {
        self.0.iter().any(|&b| b)
    }
    pub fn all(self) -> bool {
        self.0.iter().all(|&b| b)
    }
}

// ── Vectorised algorithms ─────────────────────────────────────────────────────

/// Dot product using 8-wide SIMD accumulation.
/// Processes 8 elements per loop iteration.
pub fn dot_product_simd(a: &[f32], b: &[f32]) -> f32 {
    assert_eq!(a.len(), b.len());
    let n = a.len();
    let full_chunks = n / LANES;

    let mut acc = F32x8::splat(0.0);
    for i in 0..full_chunks {
        let off = i * LANES;
        let va = F32x8::from_array(a[off..off + LANES].try_into().unwrap());
        let vb = F32x8::from_array(b[off..off + LANES].try_into().unwrap());
        // acc = va * vb + acc  (FMA)
        acc = va.mul_add(vb, acc);
    }

    // Handle remaining elements (scalar tail)
    let mut result = acc.reduce_sum();
    for i in (full_chunks * LANES)..n {
        result += a[i] * b[i];
    }
    result
}

/// Element-wise clamp using SIMD min/max.
pub fn clamp_simd(data: &mut [f32], lo: f32, hi: f32) {
    let vlo = F32x8::splat(lo);
    let vhi = F32x8::splat(hi);
    let n = data.len();
    let full = n / LANES;

    for i in 0..full {
        let off = i * LANES;
        let v = F32x8::from_array(data[off..off + LANES].try_into().unwrap());
        let clamped = v.max(vlo).min(vhi);
        data[off..off + LANES].copy_from_slice(&clamped.to_array());
    }
    // Scalar tail
    for v in &mut data[full * LANES..] {
        *v = v.clamp(lo, hi);
    }
}

/// ReLU activation: max(x, 0) — common in neural networks.
pub fn relu_simd(data: &mut [f32]) {
    clamp_simd(data, 0.0, f32::INFINITY);
}

/// Horizontal sum of a large slice using 8-wide vectors.
pub fn sum_simd(data: &[f32]) -> f32 {
    let full = data.len() / LANES;
    let mut acc = F32x8::splat(0.0);
    for i in 0..full {
        let off = i * LANES;
        let v = F32x8::from_array(data[off..off + LANES].try_into().unwrap());
        acc = acc.add(v);
    }
    let mut result = acc.reduce_sum();
    for &v in &data[full * LANES..] {
        result += v;
    }
    result
}

// ── Scalar reference implementations ─────────────────────────────────────────

pub fn dot_scalar(a: &[f32], b: &[f32]) -> f32 {
    a.iter().zip(b).map(|(&x, &y)| x * y).sum()
}

// ── main ──────────────────────────────────────────────────────────────────────

// ── Tests ─────────────────────────────────────────────────────────────────────

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn lane_add() {
        let a = F32x8::splat(2.0);
        let b = F32x8::splat(3.0);
        assert_eq!(a.add(b).to_array(), [5.0; 8]);
    }

    #[test]
    fn reduce_sum() {
        let a = F32x8::from_array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]);
        assert_eq!(a.reduce_sum(), 36.0);
    }

    #[test]
    fn dot_product_matches_scalar() {
        let a: Vec<f32> = (0..64).map(|i| i as f32).collect();
        let b: Vec<f32> = (0..64).map(|i| (64 - i) as f32).collect();
        let d_simd = dot_product_simd(&a, &b);
        let d_scalar = dot_scalar(&a, &b);
        assert!((d_simd - d_scalar).abs() < 0.01);
    }

    #[test]
    fn relu_zeroes_negatives() {
        let mut v = vec![-2.0f32, -1.0, 0.0, 1.0, 2.0, -3.0, 4.0, -0.5];
        relu_simd(&mut v);
        assert_eq!(v, [0.0, 0.0, 0.0, 1.0, 2.0, 0.0, 4.0, 0.0]);
    }

    #[test]
    fn clamp_bounds() {
        let mut v = vec![-5.0f32, 0.0, 5.0, 10.0, 15.0, 3.0, -1.0, 8.0];
        clamp_simd(&mut v, 0.0, 10.0);
        for &x in &v {
            assert!(x >= 0.0 && x <= 10.0);
        }
    }

    #[test]
    fn sum_simd_correct() {
        let data: Vec<f32> = (1..=16).map(|i| i as f32).collect();
        let s = sum_simd(&data);
        assert_eq!(s, 136.0); // 1+2+…+16
    }

    #[test]
    fn mask_select() {
        let a = F32x8::from_array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]);
        let threshold = F32x8::splat(4.5);
        let mask = a.gt(threshold);
        let selected = F32x8::select(&mask, F32x8::splat(1.0), F32x8::splat(0.0));
        assert_eq!(&selected.to_array()[..4], &[0.0, 0.0, 0.0, 0.0]);
        assert_eq!(&selected.to_array()[4..], &[1.0, 1.0, 1.0, 1.0]);
    }
}

(* OCaml: SIMD concepts via scalar simulation and Bigarray
   OCaml doesn't have SIMD intrinsics in the standard library.
   We demonstrate the vectorised mental model using Bigarray and
   show what the equivalent SIMD code achieves. *)

open Bigarray

type f32vec = (float, float32_elt, c_layout) Array1.t

let make_vec n = Array1.create float32 c_layout n

(* --- Simulated SIMD operations (scalar fallback) --- *)

(* Vectorised addition: element-wise a[i] + b[i] *)
let vec_add (a : f32vec) (b : f32vec) : f32vec =
  let n = Array1.dim a in
  let c = make_vec n in
  for i = 0 to n - 1 do
    c.{i} <- a.{i} +. b.{i}
  done;
  c

(* Vectorised multiply-accumulate (FMA pattern) *)
let vec_fma (a : f32vec) (b : f32vec) (c : f32vec) : f32vec =
  let n = Array1.dim a in
  let result = make_vec n in
  for i = 0 to n - 1 do
    result.{i} <- a.{i} *. b.{i} +. c.{i}
  done;
  result

(* Horizontal sum (reduction) *)
let vec_sum (a : f32vec) : float =
  let acc = ref 0.0 in
  for i = 0 to Array1.dim a - 1 do
    acc := !acc +. a.{i}
  done;
  !acc

(* Dot product via vec_fma pattern *)
let dot_product (a : f32vec) (b : f32vec) : float =
  let n = Array1.dim a in
  let acc = ref 0.0 in
  for i = 0 to n - 1 do
    acc := !acc +. a.{i} *. b.{i}
  done;
  !acc

(* Conditional select: select a[i] if mask[i] > 0, else b[i] *)
let vec_select (mask : f32vec) (a : f32vec) (b : f32vec) : f32vec =
  let n = Array1.dim a in
  let r = make_vec n in
  for i = 0 to n - 1 do
    r.{i} <- if mask.{i} > 0.0 then a.{i} else b.{i}
  done;
  r

(* --- Demo --- *)

let fill_vec n f =
  let v = make_vec n in
  for i = 0 to n - 1 do v.{i} <- f i done;
  v

let () =
  let n = 8 in
  let a = fill_vec n (fun i -> float_of_int (i + 1)) in
  let b = fill_vec n (fun i -> float_of_int (n - i)) in

  Printf.printf "a = [%s]\n"
    (String.concat "; " (List.init n (fun i -> Printf.sprintf "%.0f" a.{i})));
  Printf.printf "b = [%s]\n"
    (String.concat "; " (List.init n (fun i -> Printf.sprintf "%.0f" b.{i})));

  let s = vec_add a b in
  Printf.printf "a+b = [%s]\n"
    (String.concat "; " (List.init n (fun i -> Printf.sprintf "%.0f" s.{i})));

  Printf.printf "dot(a,b) = %.1f\n" (dot_product a b);
  Printf.printf "sum(a)   = %.1f\n" (vec_sum a);

  let mask = fill_vec n (fun i -> if i < 4 then 1.0 else -1.0) in
  let ones  = fill_vec n (fun _ -> 1.0) in
  let zeros = fill_vec n (fun _ -> 0.0) in
  let sel = vec_select mask ones zeros in
  Printf.printf "select = [%s]\n"
    (String.concat "; " (List.init n (fun i -> Printf.sprintf "%.0f" sel.{i})));

  (* Note: in real high-perf OCaml, you'd call C SIMD kernels via FFI.
     Owl/owl-base wraps BLAS/LAPACK which uses SIMD internally. *)
  Printf.printf "OCaml SIMD note: use owl-base or C FFI for native SIMD.\n"

Key Differences

Aspect	Rust	OCaml
Portable SIMD API	`std::simd` (nightly), `wide` crate	None (requires C FFI)
Auto-vectorization	LLVM vectorizes `chunks_exact` loops	Limited; scalar loops stay scalar
Alignment control	`#[repr(align(32))]`, `aligned` crate	`Bigarray` with alignment hints
SIMD intrinsics	`std::arch::x86_64::_mm256_*`	C stubs only
Lane width	Const generic `LANES`, `f32x4` etc.	Not applicable

OCaml Approach

OCaml has no portable SIMD abstraction. SIMD in OCaml requires C stubs or the owl-base scientific computing library, which calls BLAS/LAPACK (SIMD-optimized C):

(* Pure OCaml: scalar, no SIMD *)
let dot_product a b =
  Array.fold_left2 (fun acc x y -> acc +. x *. y) 0.0 a b

(* With Owl: delegates to SIMD-optimized CBLAS *)
(* let result = Owl.Dense.Ndarray.D.dot a b *)

OCaml 5 has an experimental Simd module for internal compiler use but no stable portable SIMD API is available to users.

Full Source

#![allow(clippy::all)]
// 727. SIMD concepts with std::simd (portable_simd)
//
// This file demonstrates the portable_simd API using stable-compatible
// scalar implementations that mirror the SIMD mental model exactly.
// On nightly Rust, replace the scalar impls with std::simd types.
//
// To enable on nightly: add `#![feature(portable_simd)]` and use
// `use std::simd::*;`
//
// The scalar versions below are structurally identical to the SIMD versions —
// just swap `[f32; LANES]` for `f32xN` and loops for SIMD ops.

// ── LANES constant — matches f32x8 ───────────────────────────────────────────

const LANES: usize = 8;

// ── Portable scalar "SIMD" types (mirrors std::simd API) ─────────────────────

/// Simulates `std::simd::f32x8` — 8 f32 lanes.
#[derive(Clone, Copy, Debug, PartialEq)]
pub struct F32x8([f32; LANES]);

impl F32x8 {
    pub fn splat(v: f32) -> Self {
        Self([v; LANES])
    }
    pub fn from_array(a: [f32; LANES]) -> Self {
        Self(a)
    }
    pub fn to_array(self) -> [f32; LANES] {
        self.0
    }

    /// Lane-wise addition — compiles to VADDPS ymm on AVX2.
    pub fn add(self, rhs: Self) -> Self {
        let mut r = [0.0f32; LANES];
        for i in 0..LANES {
            r[i] = self.0[i] + rhs.0[i];
        }
        Self(r)
    }

    /// Lane-wise multiplication — compiles to VMULPS ymm on AVX2.
    pub fn mul(self, rhs: Self) -> Self {
        let mut r = [0.0f32; LANES];
        for i in 0..LANES {
            r[i] = self.0[i] * rhs.0[i];
        }
        Self(r)
    }

    /// Fused multiply-add: self * a + b — compiles to VFMADD213PS on AVX2+FMA.
    pub fn mul_add(self, a: Self, b: Self) -> Self {
        let mut r = [0.0f32; LANES];
        for i in 0..LANES {
            r[i] = self.0[i].mul_add(a.0[i], b.0[i]);
        }
        Self(r)
    }

    /// Horizontal sum reduction — compiles to `vhaddps` or tree reduction.
    pub fn reduce_sum(self) -> f32 {
        self.0.iter().copied().sum()
    }

    /// Lane-wise max — compiles to VMAXPS ymm.
    pub fn max(self, rhs: Self) -> Self {
        let mut r = [0.0f32; LANES];
        for i in 0..LANES {
            r[i] = self.0[i].max(rhs.0[i]);
        }
        Self(r)
    }

    /// Lane-wise min — compiles to VMINPS ymm.
    pub fn min(self, rhs: Self) -> Self {
        let mut r = [0.0f32; LANES];
        for i in 0..LANES {
            r[i] = self.0[i].min(rhs.0[i]);
        }
        Self(r)
    }

    /// Mask select: choose `on_true[i]` where `mask[i] > 0`, else `on_false[i]`.
    /// Maps to `VBLENDVPS` or `VPBLENDMD` on AVX512.
    pub fn select(mask: &MaskF32x8, on_true: Self, on_false: Self) -> Self {
        let mut r = [0.0f32; LANES];
        for i in 0..LANES {
            r[i] = if mask.0[i] {
                on_true.0[i]
            } else {
                on_false.0[i]
            };
        }
        Self(r)
    }

    /// Compare greater-than, producing a mask.
    pub fn gt(self, rhs: Self) -> MaskF32x8 {
        let mut m = [false; LANES];
        for i in 0..LANES {
            m[i] = self.0[i] > rhs.0[i];
        }
        MaskF32x8(m)
    }
}

/// A boolean mask over 8 f32 lanes. Maps to `std::simd::Mask<i32, 8>`.
#[derive(Clone, Copy, Debug)]
pub struct MaskF32x8([bool; LANES]);

impl MaskF32x8 {
    pub fn any(self) -> bool {
        self.0.iter().any(|&b| b)
    }
    pub fn all(self) -> bool {
        self.0.iter().all(|&b| b)
    }
}

// ── Vectorised algorithms ─────────────────────────────────────────────────────

/// Dot product using 8-wide SIMD accumulation.
/// Processes 8 elements per loop iteration.
pub fn dot_product_simd(a: &[f32], b: &[f32]) -> f32 {
    assert_eq!(a.len(), b.len());
    let n = a.len();
    let full_chunks = n / LANES;

    let mut acc = F32x8::splat(0.0);
    for i in 0..full_chunks {
        let off = i * LANES;
        let va = F32x8::from_array(a[off..off + LANES].try_into().unwrap());
        let vb = F32x8::from_array(b[off..off + LANES].try_into().unwrap());
        // acc = va * vb + acc  (FMA)
        acc = va.mul_add(vb, acc);
    }

    // Handle remaining elements (scalar tail)
    let mut result = acc.reduce_sum();
    for i in (full_chunks * LANES)..n {
        result += a[i] * b[i];
    }
    result
}

/// Element-wise clamp using SIMD min/max.
pub fn clamp_simd(data: &mut [f32], lo: f32, hi: f32) {
    let vlo = F32x8::splat(lo);
    let vhi = F32x8::splat(hi);
    let n = data.len();
    let full = n / LANES;

    for i in 0..full {
        let off = i * LANES;
        let v = F32x8::from_array(data[off..off + LANES].try_into().unwrap());
        let clamped = v.max(vlo).min(vhi);
        data[off..off + LANES].copy_from_slice(&clamped.to_array());
    }
    // Scalar tail
    for v in &mut data[full * LANES..] {
        *v = v.clamp(lo, hi);
    }
}

/// ReLU activation: max(x, 0) — common in neural networks.
pub fn relu_simd(data: &mut [f32]) {
    clamp_simd(data, 0.0, f32::INFINITY);
}

/// Horizontal sum of a large slice using 8-wide vectors.
pub fn sum_simd(data: &[f32]) -> f32 {
    let full = data.len() / LANES;
    let mut acc = F32x8::splat(0.0);
    for i in 0..full {
        let off = i * LANES;
        let v = F32x8::from_array(data[off..off + LANES].try_into().unwrap());
        acc = acc.add(v);
    }
    let mut result = acc.reduce_sum();
    for &v in &data[full * LANES..] {
        result += v;
    }
    result
}

// ── Scalar reference implementations ─────────────────────────────────────────

pub fn dot_scalar(a: &[f32], b: &[f32]) -> f32 {
    a.iter().zip(b).map(|(&x, &y)| x * y).sum()
}

// ── main ──────────────────────────────────────────────────────────────────────

// ── Tests ─────────────────────────────────────────────────────────────────────

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn lane_add() {
        let a = F32x8::splat(2.0);
        let b = F32x8::splat(3.0);
        assert_eq!(a.add(b).to_array(), [5.0; 8]);
    }

    #[test]
    fn reduce_sum() {
        let a = F32x8::from_array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]);
        assert_eq!(a.reduce_sum(), 36.0);
    }

    #[test]
    fn dot_product_matches_scalar() {
        let a: Vec<f32> = (0..64).map(|i| i as f32).collect();
        let b: Vec<f32> = (0..64).map(|i| (64 - i) as f32).collect();
        let d_simd = dot_product_simd(&a, &b);
        let d_scalar = dot_scalar(&a, &b);
        assert!((d_simd - d_scalar).abs() < 0.01);
    }

    #[test]
    fn relu_zeroes_negatives() {
        let mut v = vec![-2.0f32, -1.0, 0.0, 1.0, 2.0, -3.0, 4.0, -0.5];
        relu_simd(&mut v);
        assert_eq!(v, [0.0, 0.0, 0.0, 1.0, 2.0, 0.0, 4.0, 0.0]);
    }

    #[test]
    fn clamp_bounds() {
        let mut v = vec![-5.0f32, 0.0, 5.0, 10.0, 15.0, 3.0, -1.0, 8.0];
        clamp_simd(&mut v, 0.0, 10.0);
        for &x in &v {
            assert!(x >= 0.0 && x <= 10.0);
        }
    }

    #[test]
    fn sum_simd_correct() {
        let data: Vec<f32> = (1..=16).map(|i| i as f32).collect();
        let s = sum_simd(&data);
        assert_eq!(s, 136.0); // 1+2+…+16
    }

    #[test]
    fn mask_select() {
        let a = F32x8::from_array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]);
        let threshold = F32x8::splat(4.5);
        let mask = a.gt(threshold);
        let selected = F32x8::select(&mask, F32x8::splat(1.0), F32x8::splat(0.0));
        assert_eq!(&selected.to_array()[..4], &[0.0, 0.0, 0.0, 0.0]);
        assert_eq!(&selected.to_array()[4..], &[1.0, 1.0, 1.0, 1.0]);
    }
}

(* OCaml: SIMD concepts via scalar simulation and Bigarray
   OCaml doesn't have SIMD intrinsics in the standard library.
   We demonstrate the vectorised mental model using Bigarray and
   show what the equivalent SIMD code achieves. *)

open Bigarray

type f32vec = (float, float32_elt, c_layout) Array1.t

let make_vec n = Array1.create float32 c_layout n

(* --- Simulated SIMD operations (scalar fallback) --- *)

(* Vectorised addition: element-wise a[i] + b[i] *)
let vec_add (a : f32vec) (b : f32vec) : f32vec =
  let n = Array1.dim a in
  let c = make_vec n in
  for i = 0 to n - 1 do
    c.{i} <- a.{i} +. b.{i}
  done;
  c

(* Vectorised multiply-accumulate (FMA pattern) *)
let vec_fma (a : f32vec) (b : f32vec) (c : f32vec) : f32vec =
  let n = Array1.dim a in
  let result = make_vec n in
  for i = 0 to n - 1 do
    result.{i} <- a.{i} *. b.{i} +. c.{i}
  done;
  result

(* Horizontal sum (reduction) *)
let vec_sum (a : f32vec) : float =
  let acc = ref 0.0 in
  for i = 0 to Array1.dim a - 1 do
    acc := !acc +. a.{i}
  done;
  !acc

(* Dot product via vec_fma pattern *)
let dot_product (a : f32vec) (b : f32vec) : float =
  let n = Array1.dim a in
  let acc = ref 0.0 in
  for i = 0 to n - 1 do
    acc := !acc +. a.{i} *. b.{i}
  done;
  !acc

(* Conditional select: select a[i] if mask[i] > 0, else b[i] *)
let vec_select (mask : f32vec) (a : f32vec) (b : f32vec) : f32vec =
  let n = Array1.dim a in
  let r = make_vec n in
  for i = 0 to n - 1 do
    r.{i} <- if mask.{i} > 0.0 then a.{i} else b.{i}
  done;
  r

(* --- Demo --- *)

let fill_vec n f =
  let v = make_vec n in
  for i = 0 to n - 1 do v.{i} <- f i done;
  v

let () =
  let n = 8 in
  let a = fill_vec n (fun i -> float_of_int (i + 1)) in
  let b = fill_vec n (fun i -> float_of_int (n - i)) in

  Printf.printf "a = [%s]\n"
    (String.concat "; " (List.init n (fun i -> Printf.sprintf "%.0f" a.{i})));
  Printf.printf "b = [%s]\n"
    (String.concat "; " (List.init n (fun i -> Printf.sprintf "%.0f" b.{i})));

  let s = vec_add a b in
  Printf.printf "a+b = [%s]\n"
    (String.concat "; " (List.init n (fun i -> Printf.sprintf "%.0f" s.{i})));

  Printf.printf "dot(a,b) = %.1f\n" (dot_product a b);
  Printf.printf "sum(a)   = %.1f\n" (vec_sum a);

  let mask = fill_vec n (fun i -> if i < 4 then 1.0 else -1.0) in
  let ones  = fill_vec n (fun _ -> 1.0) in
  let zeros = fill_vec n (fun _ -> 0.0) in
  let sel = vec_select mask ones zeros in
  Printf.printf "select = [%s]\n"
    (String.concat "; " (List.init n (fun i -> Printf.sprintf "%.0f" sel.{i})));

  (* Note: in real high-perf OCaml, you'd call C SIMD kernels via FFI.
     Owl/owl-base wraps BLAS/LAPACK which uses SIMD internally. *)
  Printf.printf "OCaml SIMD note: use owl-base or C FFI for native SIMD.\n"

✓ Tests Rust test suite

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn lane_add() {
        let a = F32x8::splat(2.0);
        let b = F32x8::splat(3.0);
        assert_eq!(a.add(b).to_array(), [5.0; 8]);
    }

    #[test]
    fn reduce_sum() {
        let a = F32x8::from_array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]);
        assert_eq!(a.reduce_sum(), 36.0);
    }

    #[test]
    fn dot_product_matches_scalar() {
        let a: Vec<f32> = (0..64).map(|i| i as f32).collect();
        let b: Vec<f32> = (0..64).map(|i| (64 - i) as f32).collect();
        let d_simd = dot_product_simd(&a, &b);
        let d_scalar = dot_scalar(&a, &b);
        assert!((d_simd - d_scalar).abs() < 0.01);
    }

    #[test]
    fn relu_zeroes_negatives() {
        let mut v = vec![-2.0f32, -1.0, 0.0, 1.0, 2.0, -3.0, 4.0, -0.5];
        relu_simd(&mut v);
        assert_eq!(v, [0.0, 0.0, 0.0, 1.0, 2.0, 0.0, 4.0, 0.0]);
    }

    #[test]
    fn clamp_bounds() {
        let mut v = vec![-5.0f32, 0.0, 5.0, 10.0, 15.0, 3.0, -1.0, 8.0];
        clamp_simd(&mut v, 0.0, 10.0);
        for &x in &v {
            assert!(x >= 0.0 && x <= 10.0);
        }
    }

    #[test]
    fn sum_simd_correct() {
        let data: Vec<f32> = (1..=16).map(|i| i as f32).collect();
        let s = sum_simd(&data);
        assert_eq!(s, 136.0); // 1+2+…+16
    }

    #[test]
    fn mask_select() {
        let a = F32x8::from_array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]);
        let threshold = F32x8::splat(4.5);
        let mask = a.gt(threshold);
        let selected = F32x8::select(&mask, F32x8::splat(1.0), F32x8::splat(0.0));
        assert_eq!(&selected.to_array()[..4], &[0.0, 0.0, 0.0, 0.0]);
        assert_eq!(&selected.to_array()[4..], &[1.0, 1.0, 1.0, 1.0]);
    }
}

Exercises

Enable nightly and rewrite dot_product_lanes using std::simd::f32x4. Compare

the generated assembly with and without target-cpu=native using cargo asm.

Implement a SIMD-width-agnostic map_inplace<const L: usize>(data: &mut [f32], f: impl Fn(f32) -> f32)

that processes chunks of L with a scalar inner loop and benchmark L=4 vs L=8.

Implement a SIMD-friendly prefix sum (scan) over &[f32] using the lane-parallel

model. Verify correctness and benchmark vs iter().scan().

Use the wide crate's f32x8 to implement an element-wise sigmoid approximation

1.0 / (1.0 + (-x).exp()) and benchmark vs scalar for 1M elements.

Implement an aligned buffer type AlignedBuf<T, const ALIGN: usize> using

#[repr(align)] and verify with std::mem::align_of_val that SIMD loads are safe.

Open Source Repos

functional-rust

View the source for this example on GitHub — OCaml and Rust side by side in the repo.

Rust