Frequency Analysis — Letter Distribution
Tutorial Video
Text description (accessibility)
This video demonstrates the "Frequency Analysis — Letter Distribution" functional Rust example. Difficulty level: Intermediate. Key concepts covered: String Processing. Count how many times each letter (a–z) appears in a string, ignoring case and non-alphabetic characters, then return results sorted by frequency descending. Key difference from OCaml: 1. **Map type:** OCaml uses `Map.Make(Char)` (balanced BST, sorted by key); Rust offers `HashMap` (hash table, unordered) or `BTreeMap` (B
Tutorial
The Problem
Count how many times each letter (a–z) appears in a string, ignoring case and non-alphabetic characters, then return results sorted by frequency descending.
🎯 Learning Outcomes
HashMap::entry with or_insert replaces OCaml's Map.update for in-place countingBTreeMap mirrors OCaml's Map.Make(Char) — both keep keys sorted, both are O(log n) per operation.filter().map().fold()) replace OCaml's String.fold_left pipelineHashMap) and tree-based (BTreeMap) maps and when to choose each🦀 The Rust Way
Rust offers two natural equivalents: HashMap<char, usize> for O(1) average access and BTreeMap<char, usize> for O(log n) with keys in sorted order — the latter directly mirrors OCaml's Map.Make. The entry().or_insert(0) pattern compresses OCaml's update ... function None -> Some 1 | Some n -> Some (n+1) into a single idiomatic expression.
Code Example
use std::collections::HashMap;
pub fn frequency(s: &str) -> HashMap<char, usize> {
s.chars()
.filter(|c| c.is_ascii_alphabetic())
.map(|c| c.to_ascii_lowercase())
.fold(HashMap::new(), |mut map, c| {
*map.entry(c).or_insert(0) += 1;
map
})
}
pub fn sorted_freq(s: &str) -> Vec<(char, usize)> {
let mut pairs: Vec<(char, usize)> = frequency(s).into_iter().collect();
pairs.sort_by(|(c1, n1), (c2, n2)| n2.cmp(n1).then(c1.cmp(c2)));
pairs
}Key Differences
Map.Make(Char) (balanced BST, sorted by key); Rust offers HashMap (hash table, unordered) or BTreeMap (B-tree, sorted by key)CMap.update c f m takes a function over option; Rust's map.entry(c).or_insert(0) returns a mutable reference for in-place mutationString.fold_left threads state; Rust's .chars().filter().map().fold() composes the same pipeline with explicit typesList.sort (fun (_, a) (_, b) -> compare b a) sorts by value descending; Rust's .sort_by(|(c1,n1),(c2,n2)| n2.cmp(n1).then(c1.cmp(c2))) adds a stable tiebreakerOCaml Approach
OCaml uses a functor Map.Make(Char) to create a balanced BST map keyed on char. The String.fold_left threads a map accumulator through each character, calling CMap.update to increment or initialize counts. The map is then converted to a sorted binding list for display.
Full Source
#![allow(clippy::all)]
use std::collections::{BTreeMap, HashMap};
/// Solution 1: Idiomatic Rust — HashMap with fold
/// Uses entry API for efficient in-place counting
pub fn frequency(s: &str) -> HashMap<char, usize> {
s.chars()
.filter(|c| c.is_ascii_alphabetic())
.map(|c| c.to_ascii_lowercase())
.fold(HashMap::new(), |mut map, c| {
*map.entry(c).or_insert(0) += 1;
map
})
}
/// Solution 2: BTreeMap — mirrors OCaml's Map.Make(Char), keys stay sorted
/// Equivalent to OCaml's `CMap.update c (function None -> Some 1 | Some n -> Some (n+1)) m`
pub fn frequency_btree(s: &str) -> BTreeMap<char, usize> {
s.chars()
.filter(|c| c.is_ascii_alphabetic())
.map(|c| c.to_ascii_lowercase())
.fold(BTreeMap::new(), |mut map, c| {
*map.entry(c).or_insert(0) += 1;
map
})
}
/// Solution 3: Sorted by frequency descending, ties broken alphabetically
/// Mirrors OCaml's `List.sort (fun (_, a) (_, b) -> compare b a)`
pub fn sorted_freq(s: &str) -> Vec<(char, usize)> {
let mut pairs: Vec<(char, usize)> = frequency(s).into_iter().collect();
pairs.sort_by(|(c1, n1), (c2, n2)| n2.cmp(n1).then(c1.cmp(c2)));
pairs
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_empty_string() {
assert_eq!(frequency(""), HashMap::new());
assert!(sorted_freq("").is_empty());
}
#[test]
fn test_single_character() {
let freq = frequency("a");
assert_eq!(freq[&'a'], 1);
assert_eq!(freq.len(), 1);
}
#[test]
fn test_case_insensitive() {
let freq = frequency("AaAa");
assert_eq!(freq[&'a'], 4);
assert_eq!(freq.len(), 1);
}
#[test]
fn test_non_alpha_ignored() {
let freq = frequency("a1b2c! a");
assert_eq!(freq[&'a'], 2);
assert_eq!(freq[&'b'], 1);
assert_eq!(freq[&'c'], 1);
assert_eq!(freq.len(), 3);
}
#[test]
fn test_pangram_all_letters_present() {
let text = "The quick brown fox jumps over the lazy dog";
let freq = frequency(text);
// All 26 letters appear at least once in a pangram
assert_eq!(freq.len(), 26);
// 'e' and 'o' appear most in this pangram
assert_eq!(freq[&'e'], 3);
assert_eq!(freq[&'o'], 4);
}
#[test]
fn test_btree_sorted_by_key() {
let freq = frequency_btree("bac");
let keys: Vec<char> = freq.keys().copied().collect();
assert_eq!(keys, vec!['a', 'b', 'c']);
}
#[test]
fn test_sorted_freq_descending() {
let result = sorted_freq("aaabbc");
assert_eq!(result[0], ('a', 3));
assert_eq!(result[1], ('b', 2));
assert_eq!(result[2], ('c', 1));
}
#[test]
fn test_sorted_freq_ties_broken_alphabetically() {
// 'a' and 'b' both appear twice — 'a' should come first
let result = sorted_freq("aabb");
assert_eq!(result[0], ('a', 2));
assert_eq!(result[1], ('b', 2));
}
}#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_empty_string() {
assert_eq!(frequency(""), HashMap::new());
assert!(sorted_freq("").is_empty());
}
#[test]
fn test_single_character() {
let freq = frequency("a");
assert_eq!(freq[&'a'], 1);
assert_eq!(freq.len(), 1);
}
#[test]
fn test_case_insensitive() {
let freq = frequency("AaAa");
assert_eq!(freq[&'a'], 4);
assert_eq!(freq.len(), 1);
}
#[test]
fn test_non_alpha_ignored() {
let freq = frequency("a1b2c! a");
assert_eq!(freq[&'a'], 2);
assert_eq!(freq[&'b'], 1);
assert_eq!(freq[&'c'], 1);
assert_eq!(freq.len(), 3);
}
#[test]
fn test_pangram_all_letters_present() {
let text = "The quick brown fox jumps over the lazy dog";
let freq = frequency(text);
// All 26 letters appear at least once in a pangram
assert_eq!(freq.len(), 26);
// 'e' and 'o' appear most in this pangram
assert_eq!(freq[&'e'], 3);
assert_eq!(freq[&'o'], 4);
}
#[test]
fn test_btree_sorted_by_key() {
let freq = frequency_btree("bac");
let keys: Vec<char> = freq.keys().copied().collect();
assert_eq!(keys, vec!['a', 'b', 'c']);
}
#[test]
fn test_sorted_freq_descending() {
let result = sorted_freq("aaabbc");
assert_eq!(result[0], ('a', 3));
assert_eq!(result[1], ('b', 2));
assert_eq!(result[2], ('c', 1));
}
#[test]
fn test_sorted_freq_ties_broken_alphabetically() {
// 'a' and 'b' both appear twice — 'a' should come first
let result = sorted_freq("aabb");
assert_eq!(result[0], ('a', 2));
assert_eq!(result[1], ('b', 2));
}
}
Deep Comparison
OCaml vs Rust: Frequency Analysis — Letter Distribution
Side-by-Side Code
OCaml
module CMap = Map.Make(Char)
let frequency s =
String.fold_left (fun m c ->
let c = Char.lowercase_ascii c in
if c >= 'a' && c <= 'z' then
CMap.update c (function None -> Some 1 | Some n -> Some (n+1)) m
else m
) CMap.empty s
let sorted_freq s =
frequency s |> CMap.bindings
|> List.sort (fun (_, a) (_, b) -> compare b a)
Rust (idiomatic — HashMap)
use std::collections::HashMap;
pub fn frequency(s: &str) -> HashMap<char, usize> {
s.chars()
.filter(|c| c.is_ascii_alphabetic())
.map(|c| c.to_ascii_lowercase())
.fold(HashMap::new(), |mut map, c| {
*map.entry(c).or_insert(0) += 1;
map
})
}
pub fn sorted_freq(s: &str) -> Vec<(char, usize)> {
let mut pairs: Vec<(char, usize)> = frequency(s).into_iter().collect();
pairs.sort_by(|(c1, n1), (c2, n2)| n2.cmp(n1).then(c1.cmp(c2)));
pairs
}
Rust (BTreeMap — mirrors OCaml's Map.Make, keys sorted)
use std::collections::BTreeMap;
pub fn frequency_btree(s: &str) -> BTreeMap<char, usize> {
s.chars()
.filter(|c| c.is_ascii_alphabetic())
.map(|c| c.to_ascii_lowercase())
.fold(BTreeMap::new(), |mut map, c| {
*map.entry(c).or_insert(0) += 1;
map
})
}
Type Signatures
| Concept | OCaml | Rust |
|---|---|---|
| Frequency map type | CMap.t (≈ Char -> int) | HashMap<char, usize> or BTreeMap<char, usize> |
| Function signature | val frequency : string -> CMap.t | fn frequency(s: &str) -> HashMap<char, usize> |
| Sorted result | (char * int) list | Vec<(char, usize)> |
| Map entry update | CMap.update c f m | *map.entry(c).or_insert(0) += 1 |
| String fold | String.fold_left f init s | s.chars().fold(init, f) |
| Bindings extraction | CMap.bindings m | map.into_iter().collect() |
Key Insights
BTreeMap is the faithful OCaml equivalent.** OCaml's Map.Make(Char) is a balanced BST with O(log n) operations and keys always in sorted order. Rust's BTreeMap<char, usize> is structurally identical. HashMap is faster on average but unordered — a deliberate trade-off.entry().or_insert() beats OCaml's update.** OCaml requires a function option -> option to express "insert 1 or increment". Rust's entry API returns a &mut usize, enabling += 1 directly — no pattern match needed, zero allocation, and no intermediate closure.String.fold_left (fun m c -> ...) CMap.empty s threads state explicitly. Rust's .chars().filter().map().fold() is the same pipeline with typed stages. Both are lazy in spirit; Rust's iterators are actually zero-cost lazy.Char.lowercase_ascii; Rust uses .to_ascii_lowercase(). Both operate on ASCII only — correct for letter frequency on ASCII text and cheaper than Unicode case folding.List.sort (fun (_, a) (_, b) -> compare b a) only compares counts, leaving ties non-deterministic (sort is not stable in all implementations). Rust's .sort_by(|(c1,n1),(c2,n2)| n2.cmp(n1).then(c1.cmp(c2))) adds alphabetical tiebreaking, producing a fully deterministic result — important for testing and reproducibility.When to Use Each Style
**Use HashMap (idiomatic Rust) when:** you need maximum throughput and don't care about key order — e.g., building frequency tables for large corpora where you'll sort the results anyway.
**Use BTreeMap (OCaml-equivalent) when:** you need keys in sorted order without a separate sort step, or when you want structural parity with OCaml code for comparison/porting purposes.
Exercises
HashMap<char, f64>) where values sum to 1.0, then compute the cosine similarity between two texts.index_of_coincidence — the probability that two randomly selected letters from a text are the same — and use it to estimate whether a text is in English.