ExamplesBy LevelBy TopicLearning Paths
945 Fundamental

945 Word Count

Functional Programming

Tutorial

The Problem

Count word frequencies in a string, normalizing to lowercase and extracting only alphanumeric tokens. Implement three variants: imperative mutation with HashMap, a functional fold over owned tokens, and an ordered BTreeMap version. Compare the fold-based approach with OCaml's pattern of building frequency maps from lists.

🎯 Learning Outcomes

  • • Tokenize text: lowercase conversion, character-by-character scanning, buffer-based word extraction
  • • Build a word frequency map using HashMap with entry().or_insert(0) and *count += 1
  • • Express the same logic as a functional fold over an Iterator of owned String values
  • • Use BTreeMap for alphabetically ordered output — the Rust analog of OCaml's Map.Make(String)
  • • Understand the difference between owned-key insertion and reference-based lookups
  • Code Example

    #![allow(clippy::all)]
    use std::collections::BTreeMap;
    /// Word Count with Map
    ///
    /// Building a word-frequency map from text. Demonstrates string
    /// normalization, splitting, and folding into a map.
    use std::collections::HashMap;
    
    /// Tokenize: lowercase and extract alphanumeric words.
    pub fn tokenize(s: &str) -> Vec<String> {
        let s = s.to_lowercase();
        let mut words = Vec::new();
        let mut buf = String::new();
    
        for c in s.chars() {
            if c.is_alphanumeric() {
                buf.push(c);
            } else if !buf.is_empty() {
                words.push(buf.clone());
                buf.clear();
            }
        }
        if !buf.is_empty() {
            words.push(buf);
        }
        words
    }
    
    /// Word count using HashMap — O(1) average lookup.
    pub fn word_count(sentence: &str) -> HashMap<String, usize> {
        let mut map = HashMap::new();
        for word in tokenize(sentence) {
            *map.entry(word).or_insert(0) += 1;
        }
        map
    }
    
    /// Word count using iterator fold — more functional style.
    pub fn word_count_fold(sentence: &str) -> HashMap<String, usize> {
        tokenize(sentence)
            .into_iter()
            .fold(HashMap::new(), |mut map, word| {
                *map.entry(word).or_insert(0) += 1;
                map
            })
    }
    
    /// Ordered word count using BTreeMap (like OCaml's Map).
    pub fn word_count_ordered(sentence: &str) -> BTreeMap<String, usize> {
        let mut map = BTreeMap::new();
        for word in tokenize(sentence) {
            *map.entry(word).or_insert(0) += 1;
        }
        map
    }
    
    #[cfg(test)]
    mod tests {
        use super::*;
    
        #[test]
        fn test_basic() {
            let m = word_count("the cat sat on the mat");
            assert_eq!(m.get("the"), Some(&2));
            assert_eq!(m.get("cat"), Some(&1));
        }
    
        #[test]
        fn test_punctuation() {
            let m = word_count("the cat sat on the mat, the cat sat");
            assert_eq!(m.get("the"), Some(&3));
            assert_eq!(m.get("cat"), Some(&2));
            assert_eq!(m.get("sat"), Some(&2));
        }
    
        #[test]
        fn test_case_insensitive() {
            let m = word_count("The THE the");
            assert_eq!(m.get("the"), Some(&3));
        }
    
        #[test]
        fn test_empty() {
            let m = word_count("");
            assert!(m.is_empty());
        }
    
        #[test]
        fn test_single_word() {
            let m = word_count("hello");
            assert_eq!(m.get("hello"), Some(&1));
            assert_eq!(m.len(), 1);
        }
    
        #[test]
        fn test_fold_matches() {
            let s = "the cat sat on the mat";
            assert_eq!(word_count(s), word_count_fold(s));
        }
    
        #[test]
        fn test_ordered() {
            let m = word_count_ordered("banana apple cherry apple");
            let keys: Vec<_> = m.keys().collect();
            assert_eq!(keys, vec!["apple", "banana", "cherry"]);
        }
    }

    Key Differences

    AspectRustOCaml
    Default mapHashMap — O(1) average, unorderedMap.Make(M) — O(log n), always ordered
    Hash tableHashMapHashtbl (mutable)
    Entry APIentry().or_insert(0) — single lookuptry find ... with Not_found -> 0 + add
    Ordered mapBTreeMapMap.Make(M)
    TokenizationChar-level is_alphanumeric()String.iter with manual predicate

    HashMap is the default Rust choice for frequency counting due to O(1) average-case insertion and lookup. Use BTreeMap when sorted output is needed, accepting O(log n) overhead.

    OCaml Approach

    let tokenize s =
      let s = String.lowercase_ascii s in
      let buf = Buffer.create 16 in
      let words = ref [] in
      String.iter (fun c ->
        if Char.code c land 127 |> Char.chr |> (fun c -> Char.code c >= 48)
           (* simplified: use is_alnum predicate *)
        then Buffer.add_char buf c
        else if Buffer.length buf > 0 then begin
          words := Buffer.contents buf :: !words;
          Buffer.clear buf
        end
      ) s;
      if Buffer.length buf > 0 then words := Buffer.contents buf :: !words;
      List.rev !words
    
    module StringMap = Map.Make(String)
    
    let word_count sentence =
      List.fold_left (fun acc word ->
        let n = try StringMap.find word acc with Not_found -> 0 in
        StringMap.add word (n + 1) acc
      ) StringMap.empty (tokenize sentence)
    

    OCaml's Map.Make(String) creates a balanced BST map keyed by strings — always ordered, like Rust's BTreeMap. OCaml has no HashMap in the standard library (the Hashtbl module provides mutable hash tables).

    Full Source

    #![allow(clippy::all)]
    use std::collections::BTreeMap;
    /// Word Count with Map
    ///
    /// Building a word-frequency map from text. Demonstrates string
    /// normalization, splitting, and folding into a map.
    use std::collections::HashMap;
    
    /// Tokenize: lowercase and extract alphanumeric words.
    pub fn tokenize(s: &str) -> Vec<String> {
        let s = s.to_lowercase();
        let mut words = Vec::new();
        let mut buf = String::new();
    
        for c in s.chars() {
            if c.is_alphanumeric() {
                buf.push(c);
            } else if !buf.is_empty() {
                words.push(buf.clone());
                buf.clear();
            }
        }
        if !buf.is_empty() {
            words.push(buf);
        }
        words
    }
    
    /// Word count using HashMap — O(1) average lookup.
    pub fn word_count(sentence: &str) -> HashMap<String, usize> {
        let mut map = HashMap::new();
        for word in tokenize(sentence) {
            *map.entry(word).or_insert(0) += 1;
        }
        map
    }
    
    /// Word count using iterator fold — more functional style.
    pub fn word_count_fold(sentence: &str) -> HashMap<String, usize> {
        tokenize(sentence)
            .into_iter()
            .fold(HashMap::new(), |mut map, word| {
                *map.entry(word).or_insert(0) += 1;
                map
            })
    }
    
    /// Ordered word count using BTreeMap (like OCaml's Map).
    pub fn word_count_ordered(sentence: &str) -> BTreeMap<String, usize> {
        let mut map = BTreeMap::new();
        for word in tokenize(sentence) {
            *map.entry(word).or_insert(0) += 1;
        }
        map
    }
    
    #[cfg(test)]
    mod tests {
        use super::*;
    
        #[test]
        fn test_basic() {
            let m = word_count("the cat sat on the mat");
            assert_eq!(m.get("the"), Some(&2));
            assert_eq!(m.get("cat"), Some(&1));
        }
    
        #[test]
        fn test_punctuation() {
            let m = word_count("the cat sat on the mat, the cat sat");
            assert_eq!(m.get("the"), Some(&3));
            assert_eq!(m.get("cat"), Some(&2));
            assert_eq!(m.get("sat"), Some(&2));
        }
    
        #[test]
        fn test_case_insensitive() {
            let m = word_count("The THE the");
            assert_eq!(m.get("the"), Some(&3));
        }
    
        #[test]
        fn test_empty() {
            let m = word_count("");
            assert!(m.is_empty());
        }
    
        #[test]
        fn test_single_word() {
            let m = word_count("hello");
            assert_eq!(m.get("hello"), Some(&1));
            assert_eq!(m.len(), 1);
        }
    
        #[test]
        fn test_fold_matches() {
            let s = "the cat sat on the mat";
            assert_eq!(word_count(s), word_count_fold(s));
        }
    
        #[test]
        fn test_ordered() {
            let m = word_count_ordered("banana apple cherry apple");
            let keys: Vec<_> = m.keys().collect();
            assert_eq!(keys, vec!["apple", "banana", "cherry"]);
        }
    }
    ✓ Tests Rust test suite
    #[cfg(test)]
    mod tests {
        use super::*;
    
        #[test]
        fn test_basic() {
            let m = word_count("the cat sat on the mat");
            assert_eq!(m.get("the"), Some(&2));
            assert_eq!(m.get("cat"), Some(&1));
        }
    
        #[test]
        fn test_punctuation() {
            let m = word_count("the cat sat on the mat, the cat sat");
            assert_eq!(m.get("the"), Some(&3));
            assert_eq!(m.get("cat"), Some(&2));
            assert_eq!(m.get("sat"), Some(&2));
        }
    
        #[test]
        fn test_case_insensitive() {
            let m = word_count("The THE the");
            assert_eq!(m.get("the"), Some(&3));
        }
    
        #[test]
        fn test_empty() {
            let m = word_count("");
            assert!(m.is_empty());
        }
    
        #[test]
        fn test_single_word() {
            let m = word_count("hello");
            assert_eq!(m.get("hello"), Some(&1));
            assert_eq!(m.len(), 1);
        }
    
        #[test]
        fn test_fold_matches() {
            let s = "the cat sat on the mat";
            assert_eq!(word_count(s), word_count_fold(s));
        }
    
        #[test]
        fn test_ordered() {
            let m = word_count_ordered("banana apple cherry apple");
            let keys: Vec<_> = m.keys().collect();
            assert_eq!(keys, vec!["apple", "banana", "cherry"]);
        }
    }

    Deep Comparison

    Word Count with Map: OCaml vs Rust

    The Core Insight

    Word counting combines string processing with map operations — a practical pattern that reveals how each language handles mutable vs immutable data structures. OCaml builds a new map on every insert; Rust mutates a HashMap in place. Both are correct; the performance characteristics differ significantly.

    OCaml Approach

    OCaml uses StringMap (from Map.Make(String)) — an immutable balanced BST. Each StringMap.add creates a new map sharing most of its structure with the old one. The fold accumulates by threading the map through each word. String tokenization uses a mutable Buffer and ref — one of the few places where mutation is pragmatic in OCaml. Option.value ~default:0 handles the missing-key case.

    Rust Approach

    Rust's HashMap is a mutable hash table with O(1) amortized insert/lookup. The entry API (map.entry(word).or_insert(0)) is a Rust-specific ergonomic pattern that handles both insert and update in one call. The fold variant passes mut map through the closure. BTreeMap is available when ordered output is needed (like OCaml's Map). Tokenization uses iterators and String::push for building words.

    Side-by-Side

    ConceptOCamlRust
    Map typeStringMap (immutable BST)HashMap (mutable hash table)
    InsertStringMap.add k v m → new mapmap.entry(k).or_insert(0) += 1
    LookupStringMap.find_opt k mmap.get(&k)
    OrderedYes (BST)BTreeMap for ordered, HashMap for fast
    TokenizationBuffer + ref (mutable)String::push + Vec
    ComplexityO(n log n) totalO(n) amortized total

    What Rust Learners Should Notice

  • • The entry API is uniquely powerful: it returns a mutable reference to the value (inserting a default if missing), avoiding double-lookup
  • *map.entry(word).or_insert(0) += 1 is the idiomatic one-liner for frequency counting in Rust
  • HashMap is unordered; use BTreeMap if you need sorted keys (like OCaml's Map)
  • • OCaml's immutable map naturally supports persistent snapshots; Rust's mutable map is faster but requires explicit cloning for snapshots
  • • String normalization (to_lowercase()) returns a new String in Rust — strings are always UTF-8, never mutated in place
  • Further Reading

  • • [The Rust Book — Hash Maps](https://doc.rust-lang.org/book/ch08-03-hash-maps.html)
  • • [Exercism Word Count](https://exercism.org/tracks/ocaml/exercises/word-count)
  • Exercises

  • Add a top_n(map, n) function that returns the n most frequent words using sort_by.
  • Implement word_count_parallel that splits the text into chunks, counts each chunk in a thread, then merges the maps.
  • Add stop-word filtering: skip common words like "the", "a", "is" before counting.
  • Write bigram_count that counts frequency of consecutive word pairs.
  • Compare HashMap vs BTreeMap performance on a 10,000-word corpus with a benchmark.
  • Open Source Repos