978 Count Min Sketch
Tutorial
The Problem
Implement a Count-Min Sketch — a probabilistic frequency estimation structure that uses O(depth × width) space independent of the number of distinct keys. Insert items and query frequency in O(depth) time. The frequency estimate equals the minimum counter across all rows, providing an upper bound with tunable error probability.
🎯 Learning Outcomes
depth × width counter table: Vec<Vec<u64>>depth independent hash functions by parameterizing with different seedsupdate(key, delta) that increments one column per rowquery(key) -> u64 that returns min over all row counters — always >= true frequencyw = ceil(e/ε) and depth d = ceil(ln(1/δ)), the estimate exceeds true count by more than ε * N with probability at most δCode Example
#![allow(clippy::all)]
// 978: Count-Min Sketch
// Frequency estimation: O(1) update/query, O(depth × width) space
// Uses d hash functions × w counters; estimate = min over d rows
// Approach 1: Count-Min Sketch with multiple hash seeds
fn hash(seed: u64, s: &str) -> u64 {
s.bytes().fold(seed, |h, b| {
h.wrapping_mul(seed).wrapping_add(b as u64) ^ b as u64
})
}
pub struct CountMinSketch {
table: Vec<Vec<u64>>, // depth rows × width cols
seeds: Vec<u64>,
width: usize,
depth: usize,
}
impl CountMinSketch {
pub fn new(width: usize, depth: usize) -> Self {
let seeds = vec![31, 37, 41, 43, 47, 53, 59, 61, 67, 71];
let depth_seeds: Vec<u64> = (0..depth).map(|i| seeds[i % seeds.len()]).collect();
CountMinSketch {
table: vec![vec![0u64; width]; depth],
seeds: depth_seeds,
width,
depth,
}
}
fn column(&self, row: usize, key: &str) -> usize {
(hash(self.seeds[row], key) as usize) % self.width
}
/// Increment count for key by delta
pub fn update(&mut self, key: &str, delta: u64) {
for i in 0..self.depth {
let col = self.column(i, key);
self.table[i][col] += delta;
}
}
/// Estimate frequency: minimum over all rows
pub fn query(&self, key: &str) -> u64 {
(0..self.depth)
.map(|i| self.table[i][self.column(i, key)])
.min()
.unwrap_or(0)
}
/// Total events tracked (sum of row 0, approximate)
pub fn total(&self) -> u64 {
self.table[0].iter().sum()
}
}
// Approach 2: Heavy Hitter tracking with Count-Min
pub struct FrequencyTracker {
sketch: CountMinSketch,
total_events: u64,
}
impl FrequencyTracker {
pub fn new(width: usize, depth: usize) -> Self {
FrequencyTracker {
sketch: CountMinSketch::new(width, depth),
total_events: 0,
}
}
pub fn add(&mut self, key: &str) {
self.sketch.update(key, 1);
self.total_events += 1;
}
pub fn estimate_count(&self, key: &str) -> u64 {
self.sketch.query(key)
}
pub fn estimate_frequency(&self, key: &str) -> f64 {
if self.total_events == 0 {
return 0.0;
}
self.estimate_count(key) as f64 / self.total_events as f64
}
pub fn total(&self) -> u64 {
self.total_events
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_no_underestimate() {
let mut sk = CountMinSketch::new(100, 5);
for _ in 0..100 {
sk.update("apple", 1);
}
for _ in 0..50 {
sk.update("banana", 1);
}
for _ in 0..25 {
sk.update("cherry", 1);
}
// Count-Min never underestimates
assert!(sk.query("apple") >= 100);
assert!(sk.query("banana") >= 50);
assert!(sk.query("cherry") >= 25);
}
#[test]
fn test_unseen_items_low() {
let mut sk = CountMinSketch::new(1000, 5);
for i in 0..100 {
sk.update(&format!("item_{}", i), 1);
}
// Unseen items should have very low count
let unseen = sk.query("completely_unseen_item_xyz");
assert!(unseen < 5, "unseen item count too high: {}", unseen);
}
#[test]
fn test_batch_update() {
let mut sk = CountMinSketch::new(100, 4);
sk.update("key", 100);
assert!(sk.query("key") >= 100);
}
#[test]
fn test_frequency_tracker() {
let mut tracker = FrequencyTracker::new(200, 4);
for _ in 0..900 {
tracker.add("hot");
}
for _ in 0..100 {
tracker.add("cold");
}
assert_eq!(tracker.total(), 1000);
let hot_freq = tracker.estimate_frequency("hot");
assert!(
hot_freq >= 0.9,
"hot frequency {:.3} should be >= 0.9",
hot_freq
);
let cold_freq = tracker.estimate_frequency("cold");
assert!(
cold_freq >= 0.1,
"cold frequency {:.3} should be >= 0.1",
cold_freq
);
}
#[test]
fn test_multiple_deltas() {
let mut sk = CountMinSketch::new(100, 4);
sk.update("x", 50);
sk.update("x", 50);
assert!(sk.query("x") >= 100);
}
}Key Differences
| Aspect | Rust | OCaml |
|---|---|---|
| Hash overflow | wrapping_mul/add — defined behavior | Int64 wrapping or land max_int |
| Counter type | u64 — no overflow for most counts | int — 63-bit; risk for very large counts |
min over rows | .min().unwrap_or(0) | Array.fold_left min max_int ... |
| Table allocation | vec![vec![0u64; width]; depth] | Array.init depth (fun _ -> Array.make ...) |
Count-Min Sketches are used in network traffic analysis, database query optimization, and streaming analytics. They trade exactness for bounded space — ideal when the key space is too large for a full frequency map.
OCaml Approach
let hash seed s =
String.fold_left (fun h b ->
Int64.(to_int (logxor
(add (mul (of_int h) (of_int seed)) (of_int (Char.code b)))
(of_int (Char.code b))))
) (Int64.to_int seed) s
type t = {
table: int array array;
seeds: int array;
width: int;
depth: int;
}
let create width depth =
let seeds = [|31;37;41;43;47;53;59;61;67;71|] in
{ table = Array.init depth (fun _ -> Array.make width 0);
seeds = Array.init depth (fun i -> seeds.(i mod 10));
width; depth }
let update cms key delta =
Array.iteri (fun i seed ->
let col = (abs (hash seed key)) mod cms.width in
cms.table.(i).(col) <- cms.table.(i).(col) + delta
) cms.seeds
let query cms key =
Array.fold_left (fun acc (i, seed) ->
min acc cms.table.(i).((abs (hash seed key)) mod cms.width)
) max_int (Array.mapi (fun i s -> (i, s)) cms.seeds)
OCaml's Int64 arithmetic for hashing avoids the GC tag issue with overflow. The sketch structure is identical; only the hash implementation differs in verbosity.
Full Source
#![allow(clippy::all)]
// 978: Count-Min Sketch
// Frequency estimation: O(1) update/query, O(depth × width) space
// Uses d hash functions × w counters; estimate = min over d rows
// Approach 1: Count-Min Sketch with multiple hash seeds
fn hash(seed: u64, s: &str) -> u64 {
s.bytes().fold(seed, |h, b| {
h.wrapping_mul(seed).wrapping_add(b as u64) ^ b as u64
})
}
pub struct CountMinSketch {
table: Vec<Vec<u64>>, // depth rows × width cols
seeds: Vec<u64>,
width: usize,
depth: usize,
}
impl CountMinSketch {
pub fn new(width: usize, depth: usize) -> Self {
let seeds = vec![31, 37, 41, 43, 47, 53, 59, 61, 67, 71];
let depth_seeds: Vec<u64> = (0..depth).map(|i| seeds[i % seeds.len()]).collect();
CountMinSketch {
table: vec![vec![0u64; width]; depth],
seeds: depth_seeds,
width,
depth,
}
}
fn column(&self, row: usize, key: &str) -> usize {
(hash(self.seeds[row], key) as usize) % self.width
}
/// Increment count for key by delta
pub fn update(&mut self, key: &str, delta: u64) {
for i in 0..self.depth {
let col = self.column(i, key);
self.table[i][col] += delta;
}
}
/// Estimate frequency: minimum over all rows
pub fn query(&self, key: &str) -> u64 {
(0..self.depth)
.map(|i| self.table[i][self.column(i, key)])
.min()
.unwrap_or(0)
}
/// Total events tracked (sum of row 0, approximate)
pub fn total(&self) -> u64 {
self.table[0].iter().sum()
}
}
// Approach 2: Heavy Hitter tracking with Count-Min
pub struct FrequencyTracker {
sketch: CountMinSketch,
total_events: u64,
}
impl FrequencyTracker {
pub fn new(width: usize, depth: usize) -> Self {
FrequencyTracker {
sketch: CountMinSketch::new(width, depth),
total_events: 0,
}
}
pub fn add(&mut self, key: &str) {
self.sketch.update(key, 1);
self.total_events += 1;
}
pub fn estimate_count(&self, key: &str) -> u64 {
self.sketch.query(key)
}
pub fn estimate_frequency(&self, key: &str) -> f64 {
if self.total_events == 0 {
return 0.0;
}
self.estimate_count(key) as f64 / self.total_events as f64
}
pub fn total(&self) -> u64 {
self.total_events
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_no_underestimate() {
let mut sk = CountMinSketch::new(100, 5);
for _ in 0..100 {
sk.update("apple", 1);
}
for _ in 0..50 {
sk.update("banana", 1);
}
for _ in 0..25 {
sk.update("cherry", 1);
}
// Count-Min never underestimates
assert!(sk.query("apple") >= 100);
assert!(sk.query("banana") >= 50);
assert!(sk.query("cherry") >= 25);
}
#[test]
fn test_unseen_items_low() {
let mut sk = CountMinSketch::new(1000, 5);
for i in 0..100 {
sk.update(&format!("item_{}", i), 1);
}
// Unseen items should have very low count
let unseen = sk.query("completely_unseen_item_xyz");
assert!(unseen < 5, "unseen item count too high: {}", unseen);
}
#[test]
fn test_batch_update() {
let mut sk = CountMinSketch::new(100, 4);
sk.update("key", 100);
assert!(sk.query("key") >= 100);
}
#[test]
fn test_frequency_tracker() {
let mut tracker = FrequencyTracker::new(200, 4);
for _ in 0..900 {
tracker.add("hot");
}
for _ in 0..100 {
tracker.add("cold");
}
assert_eq!(tracker.total(), 1000);
let hot_freq = tracker.estimate_frequency("hot");
assert!(
hot_freq >= 0.9,
"hot frequency {:.3} should be >= 0.9",
hot_freq
);
let cold_freq = tracker.estimate_frequency("cold");
assert!(
cold_freq >= 0.1,
"cold frequency {:.3} should be >= 0.1",
cold_freq
);
}
#[test]
fn test_multiple_deltas() {
let mut sk = CountMinSketch::new(100, 4);
sk.update("x", 50);
sk.update("x", 50);
assert!(sk.query("x") >= 100);
}
}#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_no_underestimate() {
let mut sk = CountMinSketch::new(100, 5);
for _ in 0..100 {
sk.update("apple", 1);
}
for _ in 0..50 {
sk.update("banana", 1);
}
for _ in 0..25 {
sk.update("cherry", 1);
}
// Count-Min never underestimates
assert!(sk.query("apple") >= 100);
assert!(sk.query("banana") >= 50);
assert!(sk.query("cherry") >= 25);
}
#[test]
fn test_unseen_items_low() {
let mut sk = CountMinSketch::new(1000, 5);
for i in 0..100 {
sk.update(&format!("item_{}", i), 1);
}
// Unseen items should have very low count
let unseen = sk.query("completely_unseen_item_xyz");
assert!(unseen < 5, "unseen item count too high: {}", unseen);
}
#[test]
fn test_batch_update() {
let mut sk = CountMinSketch::new(100, 4);
sk.update("key", 100);
assert!(sk.query("key") >= 100);
}
#[test]
fn test_frequency_tracker() {
let mut tracker = FrequencyTracker::new(200, 4);
for _ in 0..900 {
tracker.add("hot");
}
for _ in 0..100 {
tracker.add("cold");
}
assert_eq!(tracker.total(), 1000);
let hot_freq = tracker.estimate_frequency("hot");
assert!(
hot_freq >= 0.9,
"hot frequency {:.3} should be >= 0.9",
hot_freq
);
let cold_freq = tracker.estimate_frequency("cold");
assert!(
cold_freq >= 0.1,
"cold frequency {:.3} should be >= 0.1",
cold_freq
);
}
#[test]
fn test_multiple_deltas() {
let mut sk = CountMinSketch::new(100, 4);
sk.update("x", 50);
sk.update("x", 50);
assert!(sk.query("x") >= 100);
}
}
Deep Comparison
Count-Min Sketch — Comparison
Core Insight
A Count-Min Sketch maintains a depth × width array of counters. update(key, delta) increments one counter per row (using d different hash functions). query(key) returns the minimum counter across all rows. The minimum gives the best estimate — other counters may be inflated by collisions. Guaranteed to never underestimate (hence "count-min"). Uses O(d × w) space for any number of distinct keys.
OCaml Approach
Array.make_matrix depth width 0 — 2D counter arraymake_hash seed returns a closure capturing the seed (higher-order function)make_hashes d creates an array of d hash functionsString.fold_left (fun h c -> h * seed lxor Char.code c) seed s — polynomial hashfor i = 0 to sk.depth - 1 loop for update/querymin_count := min !min_count v — track minimum across rowsRust Approach
Vec<Vec<u64>> — 2D counter array (depth rows, width columns)seeds: Vec<u64> — seed values for each row's hashs.bytes().fold(seed, |h, b| h.wrapping_mul(seed).wrapping_add(b as u64) ^ b as u64)wrapping_mul/wrapping_add — explicit overflow handling (OCaml wraps silently)(0..self.depth).map(...).min().unwrap_or(0) — functional min over rowsFrequencyTracker wraps sketch + total events for frequency estimationComparison Table
| Aspect | OCaml | Rust |
|---|---|---|
| Table | int array array | Vec<Vec<u64>> |
| Hash fn | Closure fun s -> ... | fn hash(seed, key) -> u64 |
| Hash accumulate | String.fold_left (fun h c -> h * seed lxor code c) seed s | s.bytes().fold(seed, \|h,b\| wrapping_mul.wrapping_add^b) |
| Overflow | Silent wrap | wrapping_mul, wrapping_add |
| Min query | for loop + ref min_count | .map().min().unwrap_or(0) |
| Multiple hashes | (string -> int) array | seeds: Vec<u64> |
| Counter type | int (63-bit) | u64 (64-bit) |
| Space | O(d × w × 8 bytes) | O(d × w × 8 bytes) |
| Error bound | ε = e/w, δ = e^(-d) | Same theoretical guarantee |
Exercises
merge(other: &CountMinSketch) that adds two sketches of equal dimensions (element-wise sum).decay(&mut self, factor: f64) that multiplies all counters by factor for time-decayed frequency estimation.N/k are "heavy hitters".CountMinSketch vs HashMap<String, u64> for 100,000 distinct keys.