ExamplesBy LevelBy TopicLearning Paths
815 Advanced

Boyer-Moore String Search

Functional Programming

Tutorial Video

Text description (accessibility)

This video demonstrates the "Boyer-Moore String Search" functional Rust example. Difficulty level: Advanced. Key concepts covered: Functional Programming. Naive string search checks every position in the text, yielding O(n*m) worst-case time. Key difference from OCaml: | Aspect | Rust | OCaml |

Tutorial

The Problem

Naive string search checks every position in the text, yielding O(n*m) worst-case time. For long texts with long patterns this is prohibitively slow. Real-world applications — log scanning, virus signature detection, text editors, DNA sequence analysis — require sublinear average-case search. Boyer-Moore achieves this by using two heuristics: the bad-character rule skips large chunks of text when a mismatch occurs at the wrong character position, and the good-suffix rule leverages known pattern structure to skip even further. The result is often O(n/m) average-case performance, making Boyer-Moore the algorithm behind grep and many production text search tools.

🎯 Learning Outcomes

  • • Understand the bad-character heuristic and how it allows right-to-left scanning with large skips
  • • Implement a preprocessing phase that builds a shift table from the pattern
  • • Recognize why Boyer-Moore scans the pattern right-to-left even though it advances left-to-right through text
  • • Learn how the algorithm achieves sublinear average-case complexity
  • • Compare Boyer-Moore with KMP: Boyer-Moore is faster in practice for large alphabets, KMP has better worst-case
  • Code Example

    #![allow(clippy::all)]
    //! # Boyer-Moore String Search
    pub fn boyer_moore(text: &str, pattern: &str) -> Vec<usize> {
        let t: Vec<char> = text.chars().collect();
        let p: Vec<char> = pattern.chars().collect();
        let (n, m) = (t.len(), p.len());
        if m == 0 || m > n {
            return vec![];
        }
    
        let mut bad_char = std::collections::HashMap::new();
        for (i, &c) in p.iter().enumerate() {
            bad_char.insert(c, i);
        }
    
        let mut results = vec![];
        let mut s = 0;
        while s <= n - m {
            let mut j = m - 1;
            while j < m && p[j] == t[s + j] {
                if j == 0 {
                    break;
                }
                j -= 1;
            }
            if j == 0 && p[0] == t[s] {
                results.push(s);
                s += 1;
            } else {
                s += (j.saturating_sub(*bad_char.get(&t[s + j]).unwrap_or(&0))).max(1);
            }
        }
        results
    }
    
    #[cfg(test)]
    mod tests {
        use super::*;
        #[test]
        fn test_bm() {
            assert!(!boyer_moore("ABAAABCD", "ABC").is_empty());
        }
    }

    Key Differences

    AspectRustOCaml
    Shift tableHashMap<char, usize> for Unicodeint array (256) for ASCII or Map
    String indexingchars().collect() to VecString.get/Bytes.get by index
    Alphabet sizeUnbounded (any Unicode)Usually 256 for byte-level search
    Result collectionVec<usize> pushed in loopRecursive list accumulation
    Pattern scanningRight-to-left inner loopSame, via index arithmetic
    Worst caseO(n*m) with bad-character onlySame without good-suffix rule

    OCaml Approach

    In OCaml, Boyer-Moore is implemented using Bytes or String with Char.code indexing for O(1) character access. The bad-character table is often an int array of size 256 for ASCII. OCaml's tail recursion enables a clean recursive formulation of the scan loop. The Buffer module avoids string concatenation in preprocessing. Functors can parameterize over alphabet type, and the Map module serves the role of Rust's HashMap. OCaml's immutable-first style encourages returning a list of match positions rather than mutating a results vector.

    Full Source

    #![allow(clippy::all)]
    //! # Boyer-Moore String Search
    pub fn boyer_moore(text: &str, pattern: &str) -> Vec<usize> {
        let t: Vec<char> = text.chars().collect();
        let p: Vec<char> = pattern.chars().collect();
        let (n, m) = (t.len(), p.len());
        if m == 0 || m > n {
            return vec![];
        }
    
        let mut bad_char = std::collections::HashMap::new();
        for (i, &c) in p.iter().enumerate() {
            bad_char.insert(c, i);
        }
    
        let mut results = vec![];
        let mut s = 0;
        while s <= n - m {
            let mut j = m - 1;
            while j < m && p[j] == t[s + j] {
                if j == 0 {
                    break;
                }
                j -= 1;
            }
            if j == 0 && p[0] == t[s] {
                results.push(s);
                s += 1;
            } else {
                s += (j.saturating_sub(*bad_char.get(&t[s + j]).unwrap_or(&0))).max(1);
            }
        }
        results
    }
    
    #[cfg(test)]
    mod tests {
        use super::*;
        #[test]
        fn test_bm() {
            assert!(!boyer_moore("ABAAABCD", "ABC").is_empty());
        }
    }
    ✓ Tests Rust test suite
    #[cfg(test)]
    mod tests {
        use super::*;
        #[test]
        fn test_bm() {
            assert!(!boyer_moore("ABAAABCD", "ABC").is_empty());
        }
    }

    Deep Comparison

    OCaml vs Rust: Boyer Moore Search

    Overview

    See the example.rs and example.ml files for detailed implementations.

    Key Differences

    AspectOCamlRust
    Type systemHindley-MilnerOwnership + traits
    MemoryGCZero-cost abstractions
    MutabilityExplicit refmut keyword
    Error handlingOption/ResultResult<T, E>

    See README.md for detailed comparison.

    Exercises

  • Add the good-suffix heuristic and measure the improvement on repetitive text like "aaaa...a" searching for "aaab".
  • Implement a case-insensitive version by normalizing to lowercase during the bad-character table build.
  • Benchmark Boyer-Moore against KMP on: short patterns in long text, long patterns in long text, and DNA sequences.
  • Extend to return (usize, usize) pairs of (start, end) positions instead of just start indices.
  • Handle the edge case where the pattern is longer than the text gracefully without panicking.
  • Open Source Repos