945 Word Count
Tutorial
The Problem
Count word frequencies in a string, normalizing to lowercase and extracting only alphanumeric tokens. Implement three variants: imperative mutation with HashMap, a functional fold over owned tokens, and an ordered BTreeMap version. Compare the fold-based approach with OCaml's pattern of building frequency maps from lists.
🎯 Learning Outcomes
HashMap with entry().or_insert(0) and *count += 1fold over an Iterator of owned String valuesBTreeMap for alphabetically ordered output — the Rust analog of OCaml's Map.Make(String)Code Example
#![allow(clippy::all)]
use std::collections::BTreeMap;
/// Word Count with Map
///
/// Building a word-frequency map from text. Demonstrates string
/// normalization, splitting, and folding into a map.
use std::collections::HashMap;
/// Tokenize: lowercase and extract alphanumeric words.
pub fn tokenize(s: &str) -> Vec<String> {
let s = s.to_lowercase();
let mut words = Vec::new();
let mut buf = String::new();
for c in s.chars() {
if c.is_alphanumeric() {
buf.push(c);
} else if !buf.is_empty() {
words.push(buf.clone());
buf.clear();
}
}
if !buf.is_empty() {
words.push(buf);
}
words
}
/// Word count using HashMap — O(1) average lookup.
pub fn word_count(sentence: &str) -> HashMap<String, usize> {
let mut map = HashMap::new();
for word in tokenize(sentence) {
*map.entry(word).or_insert(0) += 1;
}
map
}
/// Word count using iterator fold — more functional style.
pub fn word_count_fold(sentence: &str) -> HashMap<String, usize> {
tokenize(sentence)
.into_iter()
.fold(HashMap::new(), |mut map, word| {
*map.entry(word).or_insert(0) += 1;
map
})
}
/// Ordered word count using BTreeMap (like OCaml's Map).
pub fn word_count_ordered(sentence: &str) -> BTreeMap<String, usize> {
let mut map = BTreeMap::new();
for word in tokenize(sentence) {
*map.entry(word).or_insert(0) += 1;
}
map
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_basic() {
let m = word_count("the cat sat on the mat");
assert_eq!(m.get("the"), Some(&2));
assert_eq!(m.get("cat"), Some(&1));
}
#[test]
fn test_punctuation() {
let m = word_count("the cat sat on the mat, the cat sat");
assert_eq!(m.get("the"), Some(&3));
assert_eq!(m.get("cat"), Some(&2));
assert_eq!(m.get("sat"), Some(&2));
}
#[test]
fn test_case_insensitive() {
let m = word_count("The THE the");
assert_eq!(m.get("the"), Some(&3));
}
#[test]
fn test_empty() {
let m = word_count("");
assert!(m.is_empty());
}
#[test]
fn test_single_word() {
let m = word_count("hello");
assert_eq!(m.get("hello"), Some(&1));
assert_eq!(m.len(), 1);
}
#[test]
fn test_fold_matches() {
let s = "the cat sat on the mat";
assert_eq!(word_count(s), word_count_fold(s));
}
#[test]
fn test_ordered() {
let m = word_count_ordered("banana apple cherry apple");
let keys: Vec<_> = m.keys().collect();
assert_eq!(keys, vec!["apple", "banana", "cherry"]);
}
}Key Differences
| Aspect | Rust | OCaml |
|---|---|---|
| Default map | HashMap — O(1) average, unordered | Map.Make(M) — O(log n), always ordered |
| Hash table | HashMap | Hashtbl (mutable) |
| Entry API | entry().or_insert(0) — single lookup | try find ... with Not_found -> 0 + add |
| Ordered map | BTreeMap | Map.Make(M) |
| Tokenization | Char-level is_alphanumeric() | String.iter with manual predicate |
HashMap is the default Rust choice for frequency counting due to O(1) average-case insertion and lookup. Use BTreeMap when sorted output is needed, accepting O(log n) overhead.
OCaml Approach
let tokenize s =
let s = String.lowercase_ascii s in
let buf = Buffer.create 16 in
let words = ref [] in
String.iter (fun c ->
if Char.code c land 127 |> Char.chr |> (fun c -> Char.code c >= 48)
(* simplified: use is_alnum predicate *)
then Buffer.add_char buf c
else if Buffer.length buf > 0 then begin
words := Buffer.contents buf :: !words;
Buffer.clear buf
end
) s;
if Buffer.length buf > 0 then words := Buffer.contents buf :: !words;
List.rev !words
module StringMap = Map.Make(String)
let word_count sentence =
List.fold_left (fun acc word ->
let n = try StringMap.find word acc with Not_found -> 0 in
StringMap.add word (n + 1) acc
) StringMap.empty (tokenize sentence)
OCaml's Map.Make(String) creates a balanced BST map keyed by strings — always ordered, like Rust's BTreeMap. OCaml has no HashMap in the standard library (the Hashtbl module provides mutable hash tables).
Full Source
#![allow(clippy::all)]
use std::collections::BTreeMap;
/// Word Count with Map
///
/// Building a word-frequency map from text. Demonstrates string
/// normalization, splitting, and folding into a map.
use std::collections::HashMap;
/// Tokenize: lowercase and extract alphanumeric words.
pub fn tokenize(s: &str) -> Vec<String> {
let s = s.to_lowercase();
let mut words = Vec::new();
let mut buf = String::new();
for c in s.chars() {
if c.is_alphanumeric() {
buf.push(c);
} else if !buf.is_empty() {
words.push(buf.clone());
buf.clear();
}
}
if !buf.is_empty() {
words.push(buf);
}
words
}
/// Word count using HashMap — O(1) average lookup.
pub fn word_count(sentence: &str) -> HashMap<String, usize> {
let mut map = HashMap::new();
for word in tokenize(sentence) {
*map.entry(word).or_insert(0) += 1;
}
map
}
/// Word count using iterator fold — more functional style.
pub fn word_count_fold(sentence: &str) -> HashMap<String, usize> {
tokenize(sentence)
.into_iter()
.fold(HashMap::new(), |mut map, word| {
*map.entry(word).or_insert(0) += 1;
map
})
}
/// Ordered word count using BTreeMap (like OCaml's Map).
pub fn word_count_ordered(sentence: &str) -> BTreeMap<String, usize> {
let mut map = BTreeMap::new();
for word in tokenize(sentence) {
*map.entry(word).or_insert(0) += 1;
}
map
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_basic() {
let m = word_count("the cat sat on the mat");
assert_eq!(m.get("the"), Some(&2));
assert_eq!(m.get("cat"), Some(&1));
}
#[test]
fn test_punctuation() {
let m = word_count("the cat sat on the mat, the cat sat");
assert_eq!(m.get("the"), Some(&3));
assert_eq!(m.get("cat"), Some(&2));
assert_eq!(m.get("sat"), Some(&2));
}
#[test]
fn test_case_insensitive() {
let m = word_count("The THE the");
assert_eq!(m.get("the"), Some(&3));
}
#[test]
fn test_empty() {
let m = word_count("");
assert!(m.is_empty());
}
#[test]
fn test_single_word() {
let m = word_count("hello");
assert_eq!(m.get("hello"), Some(&1));
assert_eq!(m.len(), 1);
}
#[test]
fn test_fold_matches() {
let s = "the cat sat on the mat";
assert_eq!(word_count(s), word_count_fold(s));
}
#[test]
fn test_ordered() {
let m = word_count_ordered("banana apple cherry apple");
let keys: Vec<_> = m.keys().collect();
assert_eq!(keys, vec!["apple", "banana", "cherry"]);
}
}#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_basic() {
let m = word_count("the cat sat on the mat");
assert_eq!(m.get("the"), Some(&2));
assert_eq!(m.get("cat"), Some(&1));
}
#[test]
fn test_punctuation() {
let m = word_count("the cat sat on the mat, the cat sat");
assert_eq!(m.get("the"), Some(&3));
assert_eq!(m.get("cat"), Some(&2));
assert_eq!(m.get("sat"), Some(&2));
}
#[test]
fn test_case_insensitive() {
let m = word_count("The THE the");
assert_eq!(m.get("the"), Some(&3));
}
#[test]
fn test_empty() {
let m = word_count("");
assert!(m.is_empty());
}
#[test]
fn test_single_word() {
let m = word_count("hello");
assert_eq!(m.get("hello"), Some(&1));
assert_eq!(m.len(), 1);
}
#[test]
fn test_fold_matches() {
let s = "the cat sat on the mat";
assert_eq!(word_count(s), word_count_fold(s));
}
#[test]
fn test_ordered() {
let m = word_count_ordered("banana apple cherry apple");
let keys: Vec<_> = m.keys().collect();
assert_eq!(keys, vec!["apple", "banana", "cherry"]);
}
}
Deep Comparison
Word Count with Map: OCaml vs Rust
The Core Insight
Word counting combines string processing with map operations — a practical pattern that reveals how each language handles mutable vs immutable data structures. OCaml builds a new map on every insert; Rust mutates a HashMap in place. Both are correct; the performance characteristics differ significantly.
OCaml Approach
OCaml uses StringMap (from Map.Make(String)) — an immutable balanced BST. Each StringMap.add creates a new map sharing most of its structure with the old one. The fold accumulates by threading the map through each word. String tokenization uses a mutable Buffer and ref — one of the few places where mutation is pragmatic in OCaml. Option.value ~default:0 handles the missing-key case.
Rust Approach
Rust's HashMap is a mutable hash table with O(1) amortized insert/lookup. The entry API (map.entry(word).or_insert(0)) is a Rust-specific ergonomic pattern that handles both insert and update in one call. The fold variant passes mut map through the closure. BTreeMap is available when ordered output is needed (like OCaml's Map). Tokenization uses iterators and String::push for building words.
Side-by-Side
| Concept | OCaml | Rust |
|---|---|---|
| Map type | StringMap (immutable BST) | HashMap (mutable hash table) |
| Insert | StringMap.add k v m → new map | map.entry(k).or_insert(0) += 1 |
| Lookup | StringMap.find_opt k m | map.get(&k) |
| Ordered | Yes (BST) | BTreeMap for ordered, HashMap for fast |
| Tokenization | Buffer + ref (mutable) | String::push + Vec |
| Complexity | O(n log n) total | O(n) amortized total |
What Rust Learners Should Notice
entry API is uniquely powerful: it returns a mutable reference to the value (inserting a default if missing), avoiding double-lookup*map.entry(word).or_insert(0) += 1 is the idiomatic one-liner for frequency counting in RustHashMap is unordered; use BTreeMap if you need sorted keys (like OCaml's Map)to_lowercase()) returns a new String in Rust — strings are always UTF-8, never mutated in placeFurther Reading
Exercises
top_n(map, n) function that returns the n most frequent words using sort_by.word_count_parallel that splits the text into chunks, counts each chunk in a thread, then merges the maps.bigram_count that counts frequency of consecutive word pairs.HashMap vs BTreeMap performance on a 10,000-word corpus with a benchmark.