815 Advanced

Boyer-Moore String Search

Functional Programming

Tutorial Video

Text description (accessibility)

This video demonstrates the "Boyer-Moore String Search" functional Rust example. Difficulty level: Advanced. Key concepts covered: Functional Programming. Naive string search checks every position in the text, yielding O(n*m) worst-case time. Key difference from OCaml: | Aspect | Rust | OCaml |

Tutorial

The Problem

Naive string search checks every position in the text, yielding O(n*m) worst-case time. For long texts with long patterns this is prohibitively slow. Real-world applications — log scanning, virus signature detection, text editors, DNA sequence analysis — require sublinear average-case search. Boyer-Moore achieves this by using two heuristics: the bad-character rule skips large chunks of text when a mismatch occurs at the wrong character position, and the good-suffix rule leverages known pattern structure to skip even further. The result is often O(n/m) average-case performance, making Boyer-Moore the algorithm behind grep and many production text search tools.

🎯 Learning Outcomes

• Understand the bad-character heuristic and how it allows right-to-left scanning with large skips

• Implement a preprocessing phase that builds a shift table from the pattern

• Recognize why Boyer-Moore scans the pattern right-to-left even though it advances left-to-right through text

• Learn how the algorithm achieves sublinear average-case complexity

• Compare Boyer-Moore with KMP: Boyer-Moore is faster in practice for large alphabets, KMP has better worst-case

Code Example

#![allow(clippy::all)]
//! # Boyer-Moore String Search
pub fn boyer_moore(text: &str, pattern: &str) -> Vec<usize> {
    let t: Vec<char> = text.chars().collect();
    let p: Vec<char> = pattern.chars().collect();
    let (n, m) = (t.len(), p.len());
    if m == 0 || m > n {
        return vec![];
    }

    let mut bad_char = std::collections::HashMap::new();
    for (i, &c) in p.iter().enumerate() {
        bad_char.insert(c, i);
    }

    let mut results = vec![];
    let mut s = 0;
    while s <= n - m {
        let mut j = m - 1;
        while j < m && p[j] == t[s + j] {
            if j == 0 {
                break;
            }
            j -= 1;
        }
        if j == 0 && p[0] == t[s] {
            results.push(s);
            s += 1;
        } else {
            s += (j.saturating_sub(*bad_char.get(&t[s + j]).unwrap_or(&0))).max(1);
        }
    }
    results
}

#[cfg(test)]
mod tests {
    use super::*;
    #[test]
    fn test_bm() {
        assert!(!boyer_moore("ABAAABCD", "ABC").is_empty());
    }
}

(* Boyer-Moore-Horspool string search *)

let build_shift pattern =
  let m     = String.length pattern in
  let shift = Array.make 256 m in  (* default: shift by full pattern length *)
  for i = 0 to m - 2 do
    shift.(Char.code pattern.[i]) <- m - 1 - i
  done;
  shift

let bmh_search text pattern =
  let n = String.length text in
  let m = String.length pattern in
  if m = 0 then []
  else if m > n then []
  else begin
    let shift   = build_shift pattern in
    let matches = ref [] in
    let pos     = ref 0 in
    while !pos <= n - m do
      (* Compare right to left *)
      let j = ref (m - 1) in
      while !j >= 0 && pattern.[!j] = text.[!pos + !j] do
        decr j
      done;
      if !j < 0 then
        matches := !pos :: !matches;
      pos := !pos + shift.(Char.code text.[!pos + m - 1])
    done;
    List.rev !matches
  end

let () =
  let text = "ABAAABCDABABCABAB" in
  let pat  = "ABAB" in
  Printf.printf "Text:    %S\n" text;
  Printf.printf "Pattern: %S\n" pat;
  Printf.printf "Matches: [%s]\n"
    (String.concat "; " (List.map string_of_int (bmh_search text pat)));

  let text2 = "aaaaaaaaaa" in
  let pat2  = "aaa" in
  Printf.printf "\nText: %S  Pattern: %S\n" text2 pat2;
  Printf.printf "Matches: [%s]\n"
    (String.concat "; " (List.map string_of_int (bmh_search text2 pat2)))

Key Differences

Aspect	Rust	OCaml
Shift table	`HashMap<char, usize>` for Unicode	`int array` (256) for ASCII or `Map`
String indexing	`chars().collect()` to Vec	`String.get`/`Bytes.get` by index
Alphabet size	Unbounded (any Unicode)	Usually 256 for byte-level search
Result collection	`Vec<usize>` pushed in loop	Recursive list accumulation
Pattern scanning	Right-to-left inner loop	Same, via index arithmetic
Worst case	O(n*m) with bad-character only	Same without good-suffix rule

OCaml Approach

In OCaml, Boyer-Moore is implemented using Bytes or String with Char.code indexing for O(1) character access. The bad-character table is often an int array of size 256 for ASCII. OCaml's tail recursion enables a clean recursive formulation of the scan loop. The Buffer module avoids string concatenation in preprocessing. Functors can parameterize over alphabet type, and the Map module serves the role of Rust's HashMap. OCaml's immutable-first style encourages returning a list of match positions rather than mutating a results vector.

Full Source

#![allow(clippy::all)]
//! # Boyer-Moore String Search
pub fn boyer_moore(text: &str, pattern: &str) -> Vec<usize> {
    let t: Vec<char> = text.chars().collect();
    let p: Vec<char> = pattern.chars().collect();
    let (n, m) = (t.len(), p.len());
    if m == 0 || m > n {
        return vec![];
    }

    let mut bad_char = std::collections::HashMap::new();
    for (i, &c) in p.iter().enumerate() {
        bad_char.insert(c, i);
    }

    let mut results = vec![];
    let mut s = 0;
    while s <= n - m {
        let mut j = m - 1;
        while j < m && p[j] == t[s + j] {
            if j == 0 {
                break;
            }
            j -= 1;
        }
        if j == 0 && p[0] == t[s] {
            results.push(s);
            s += 1;
        } else {
            s += (j.saturating_sub(*bad_char.get(&t[s + j]).unwrap_or(&0))).max(1);
        }
    }
    results
}

#[cfg(test)]
mod tests {
    use super::*;
    #[test]
    fn test_bm() {
        assert!(!boyer_moore("ABAAABCD", "ABC").is_empty());
    }
}

(* Boyer-Moore-Horspool string search *)

let build_shift pattern =
  let m     = String.length pattern in
  let shift = Array.make 256 m in  (* default: shift by full pattern length *)
  for i = 0 to m - 2 do
    shift.(Char.code pattern.[i]) <- m - 1 - i
  done;
  shift

let bmh_search text pattern =
  let n = String.length text in
  let m = String.length pattern in
  if m = 0 then []
  else if m > n then []
  else begin
    let shift   = build_shift pattern in
    let matches = ref [] in
    let pos     = ref 0 in
    while !pos <= n - m do
      (* Compare right to left *)
      let j = ref (m - 1) in
      while !j >= 0 && pattern.[!j] = text.[!pos + !j] do
        decr j
      done;
      if !j < 0 then
        matches := !pos :: !matches;
      pos := !pos + shift.(Char.code text.[!pos + m - 1])
    done;
    List.rev !matches
  end

let () =
  let text = "ABAAABCDABABCABAB" in
  let pat  = "ABAB" in
  Printf.printf "Text:    %S\n" text;
  Printf.printf "Pattern: %S\n" pat;
  Printf.printf "Matches: [%s]\n"
    (String.concat "; " (List.map string_of_int (bmh_search text pat)));

  let text2 = "aaaaaaaaaa" in
  let pat2  = "aaa" in
  Printf.printf "\nText: %S  Pattern: %S\n" text2 pat2;
  Printf.printf "Matches: [%s]\n"
    (String.concat "; " (List.map string_of_int (bmh_search text2 pat2)))

✓ Tests Rust test suite

#[cfg(test)]
mod tests {
    use super::*;
    #[test]
    fn test_bm() {
        assert!(!boyer_moore("ABAAABCD", "ABC").is_empty());
    }
}

Deep Comparison

OCaml vs Rust: Boyer Moore Search

Overview

See the example.rs and example.ml files for detailed implementations.

Key Differences

Aspect	OCaml	Rust
Type system	Hindley-Milner	Ownership + traits
Memory	GC	Zero-cost abstractions
Mutability	Explicit ref	mut keyword
Error handling	Option/Result	Result<T, E>

See README.md for detailed comparison.

Exercises

Add the good-suffix heuristic and measure the improvement on repetitive text like "aaaa...a" searching for "aaab".

Implement a case-insensitive version by normalizing to lowercase during the bad-character table build.

Benchmark Boyer-Moore against KMP on: short patterns in long text, long patterns in long text, and DNA sequences.

Extend to return (usize, usize) pairs of (start, end) positions instead of just start indices.

Handle the edge case where the pattern is longer than the text gracefully without panicking.

Open Source Repos

functional-rust

View the source for this example on GitHub — OCaml and Rust side by side in the repo.

Rust