978 Fundamental

978 Count Min Sketch

Functional Programming

Tutorial

The Problem

Implement a Count-Min Sketch — a probabilistic frequency estimation structure that uses O(depth × width) space independent of the number of distinct keys. Insert items and query frequency in O(depth) time. The frequency estimate equals the minimum counter across all rows, providing an upper bound with tunable error probability.

🎯 Learning Outcomes

• Implement a depth × width counter table: Vec<Vec<u64>>

• Define depth independent hash functions by parameterizing with different seeds

• Implement update(key, delta) that increments one column per row

• Implement query(key) -> u64 that returns min over all row counters — always >= true frequency

• Understand the error bound: with width w = ceil(e/ε) and depth d = ceil(ln(1/δ)), the estimate exceeds true count by more than ε * N with probability at most δ

Code Example

#![allow(clippy::all)]
// 978: Count-Min Sketch
// Frequency estimation: O(1) update/query, O(depth × width) space
// Uses d hash functions × w counters; estimate = min over d rows

// Approach 1: Count-Min Sketch with multiple hash seeds
fn hash(seed: u64, s: &str) -> u64 {
    s.bytes().fold(seed, |h, b| {
        h.wrapping_mul(seed).wrapping_add(b as u64) ^ b as u64
    })
}

pub struct CountMinSketch {
    table: Vec<Vec<u64>>, // depth rows × width cols
    seeds: Vec<u64>,
    width: usize,
    depth: usize,
}

impl CountMinSketch {
    pub fn new(width: usize, depth: usize) -> Self {
        let seeds = vec![31, 37, 41, 43, 47, 53, 59, 61, 67, 71];
        let depth_seeds: Vec<u64> = (0..depth).map(|i| seeds[i % seeds.len()]).collect();
        CountMinSketch {
            table: vec![vec![0u64; width]; depth],
            seeds: depth_seeds,
            width,
            depth,
        }
    }

    fn column(&self, row: usize, key: &str) -> usize {
        (hash(self.seeds[row], key) as usize) % self.width
    }

    /// Increment count for key by delta
    pub fn update(&mut self, key: &str, delta: u64) {
        for i in 0..self.depth {
            let col = self.column(i, key);
            self.table[i][col] += delta;
        }
    }

    /// Estimate frequency: minimum over all rows
    pub fn query(&self, key: &str) -> u64 {
        (0..self.depth)
            .map(|i| self.table[i][self.column(i, key)])
            .min()
            .unwrap_or(0)
    }

    /// Total events tracked (sum of row 0, approximate)
    pub fn total(&self) -> u64 {
        self.table[0].iter().sum()
    }
}

// Approach 2: Heavy Hitter tracking with Count-Min
pub struct FrequencyTracker {
    sketch: CountMinSketch,
    total_events: u64,
}

impl FrequencyTracker {
    pub fn new(width: usize, depth: usize) -> Self {
        FrequencyTracker {
            sketch: CountMinSketch::new(width, depth),
            total_events: 0,
        }
    }

    pub fn add(&mut self, key: &str) {
        self.sketch.update(key, 1);
        self.total_events += 1;
    }

    pub fn estimate_count(&self, key: &str) -> u64 {
        self.sketch.query(key)
    }

    pub fn estimate_frequency(&self, key: &str) -> f64 {
        if self.total_events == 0 {
            return 0.0;
        }
        self.estimate_count(key) as f64 / self.total_events as f64
    }

    pub fn total(&self) -> u64 {
        self.total_events
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_no_underestimate() {
        let mut sk = CountMinSketch::new(100, 5);
        for _ in 0..100 {
            sk.update("apple", 1);
        }
        for _ in 0..50 {
            sk.update("banana", 1);
        }
        for _ in 0..25 {
            sk.update("cherry", 1);
        }

        // Count-Min never underestimates
        assert!(sk.query("apple") >= 100);
        assert!(sk.query("banana") >= 50);
        assert!(sk.query("cherry") >= 25);
    }

    #[test]
    fn test_unseen_items_low() {
        let mut sk = CountMinSketch::new(1000, 5);
        for i in 0..100 {
            sk.update(&format!("item_{}", i), 1);
        }

        // Unseen items should have very low count
        let unseen = sk.query("completely_unseen_item_xyz");
        assert!(unseen < 5, "unseen item count too high: {}", unseen);
    }

    #[test]
    fn test_batch_update() {
        let mut sk = CountMinSketch::new(100, 4);
        sk.update("key", 100);
        assert!(sk.query("key") >= 100);
    }

    #[test]
    fn test_frequency_tracker() {
        let mut tracker = FrequencyTracker::new(200, 4);
        for _ in 0..900 {
            tracker.add("hot");
        }
        for _ in 0..100 {
            tracker.add("cold");
        }

        assert_eq!(tracker.total(), 1000);
        let hot_freq = tracker.estimate_frequency("hot");
        assert!(
            hot_freq >= 0.9,
            "hot frequency {:.3} should be >= 0.9",
            hot_freq
        );
        let cold_freq = tracker.estimate_frequency("cold");
        assert!(
            cold_freq >= 0.1,
            "cold frequency {:.3} should be >= 0.1",
            cold_freq
        );
    }

    #[test]
    fn test_multiple_deltas() {
        let mut sk = CountMinSketch::new(100, 4);
        sk.update("x", 50);
        sk.update("x", 50);
        assert!(sk.query("x") >= 100);
    }
}

(* 978: Count-Min Sketch *)
(* Frequency estimation with O(1) update/query, O(d*w) space *)
(* Uses d hash functions × w counters; estimates freq with min over d rows *)

(* Simple hash functions using different seeds *)
let make_hash seed =
  fun s ->
    String.fold_left (fun h c ->
      h * seed lxor Char.code c
    ) seed s

(* Create d different hash functions *)
let make_hashes d =
  let seeds = [| 31; 37; 41; 43; 47; 53; 59; 61 |] in
  Array.init d (fun i -> make_hash seeds.(i mod Array.length seeds))

type sketch = {
  table: int array array;  (* d rows × w columns *)
  hashes: (string -> int) array;
  width: int;
  depth: int;
}

let create ~width ~depth =
  { table = Array.make_matrix depth width 0;
    hashes = make_hashes depth;
    width;
    depth }

let update sk key delta =
  for i = 0 to sk.depth - 1 do
    let col = abs (sk.hashes.(i) key) mod sk.width in
    sk.table.(i).(col) <- sk.table.(i).(col) + delta
  done

let query sk key =
  let min_count = ref max_int in
  for i = 0 to sk.depth - 1 do
    let col = abs (sk.hashes.(i) key) mod sk.width in
    let v = sk.table.(i).(col) in
    if v < !min_count then min_count := v
  done;
  !min_count

let () =
  let sk = create ~width:100 ~depth:5 in

  (* Update with counts *)
  for _ = 1 to 100 do update sk "apple" 1 done;
  for _ = 1 to 50  do update sk "banana" 1 done;
  for _ = 1 to 25  do update sk "cherry" 1 done;

  (* Estimates are >= actual (may overestimate due to hash collisions) *)
  let apple_est = query sk "apple" in
  let banana_est = query sk "banana" in
  let cherry_est = query sk "cherry" in

  assert (apple_est >= 100);
  assert (banana_est >= 50);
  assert (cherry_est >= 25);

  (* Never underestimate (count-min property) *)
  assert (apple_est >= 100);

  (* Unseen item estimate should be close to 0 *)
  let unseen = query sk "dragon" in
  assert (unseen < 10);  (* should be very low for 100 wide, 5 deep *)

  Printf.printf "apple:  %d (actual 100)\n" apple_est;
  Printf.printf "banana: %d (actual 50)\n" banana_est;
  Printf.printf "cherry: %d (actual 25)\n" cherry_est;
  Printf.printf "dragon: %d (actual 0)\n" unseen;
  Printf.printf "✓ All tests passed\n"

Key Differences

Aspect	Rust	OCaml
Hash overflow	`wrapping_mul/add` — defined behavior	`Int64` wrapping or `land max_int`
Counter type	`u64` — no overflow for most counts	`int` — 63-bit; risk for very large counts
`min` over rows	`.min().unwrap_or(0)`	`Array.fold_left min max_int ...`
Table allocation	`vec![vec![0u64; width]; depth]`	`Array.init depth (fun _ -> Array.make ...)`

Count-Min Sketches are used in network traffic analysis, database query optimization, and streaming analytics. They trade exactness for bounded space — ideal when the key space is too large for a full frequency map.

OCaml Approach

let hash seed s =
  String.fold_left (fun h b ->
    Int64.(to_int (logxor
      (add (mul (of_int h) (of_int seed)) (of_int (Char.code b)))
      (of_int (Char.code b))))
  ) (Int64.to_int seed) s

type t = {
  table: int array array;
  seeds: int array;
  width: int;
  depth: int;
}

let create width depth =
  let seeds = [|31;37;41;43;47;53;59;61;67;71|] in
  { table = Array.init depth (fun _ -> Array.make width 0);
    seeds = Array.init depth (fun i -> seeds.(i mod 10));
    width; depth }

let update cms key delta =
  Array.iteri (fun i seed ->
    let col = (abs (hash seed key)) mod cms.width in
    cms.table.(i).(col) <- cms.table.(i).(col) + delta
  ) cms.seeds

let query cms key =
  Array.fold_left (fun acc (i, seed) ->
    min acc cms.table.(i).((abs (hash seed key)) mod cms.width)
  ) max_int (Array.mapi (fun i s -> (i, s)) cms.seeds)

OCaml's Int64 arithmetic for hashing avoids the GC tag issue with overflow. The sketch structure is identical; only the hash implementation differs in verbosity.

Full Source

#![allow(clippy::all)]
// 978: Count-Min Sketch
// Frequency estimation: O(1) update/query, O(depth × width) space
// Uses d hash functions × w counters; estimate = min over d rows

// Approach 1: Count-Min Sketch with multiple hash seeds
fn hash(seed: u64, s: &str) -> u64 {
    s.bytes().fold(seed, |h, b| {
        h.wrapping_mul(seed).wrapping_add(b as u64) ^ b as u64
    })
}

pub struct CountMinSketch {
    table: Vec<Vec<u64>>, // depth rows × width cols
    seeds: Vec<u64>,
    width: usize,
    depth: usize,
}

impl CountMinSketch {
    pub fn new(width: usize, depth: usize) -> Self {
        let seeds = vec![31, 37, 41, 43, 47, 53, 59, 61, 67, 71];
        let depth_seeds: Vec<u64> = (0..depth).map(|i| seeds[i % seeds.len()]).collect();
        CountMinSketch {
            table: vec![vec![0u64; width]; depth],
            seeds: depth_seeds,
            width,
            depth,
        }
    }

    fn column(&self, row: usize, key: &str) -> usize {
        (hash(self.seeds[row], key) as usize) % self.width
    }

    /// Increment count for key by delta
    pub fn update(&mut self, key: &str, delta: u64) {
        for i in 0..self.depth {
            let col = self.column(i, key);
            self.table[i][col] += delta;
        }
    }

    /// Estimate frequency: minimum over all rows
    pub fn query(&self, key: &str) -> u64 {
        (0..self.depth)
            .map(|i| self.table[i][self.column(i, key)])
            .min()
            .unwrap_or(0)
    }

    /// Total events tracked (sum of row 0, approximate)
    pub fn total(&self) -> u64 {
        self.table[0].iter().sum()
    }
}

// Approach 2: Heavy Hitter tracking with Count-Min
pub struct FrequencyTracker {
    sketch: CountMinSketch,
    total_events: u64,
}

impl FrequencyTracker {
    pub fn new(width: usize, depth: usize) -> Self {
        FrequencyTracker {
            sketch: CountMinSketch::new(width, depth),
            total_events: 0,
        }
    }

    pub fn add(&mut self, key: &str) {
        self.sketch.update(key, 1);
        self.total_events += 1;
    }

    pub fn estimate_count(&self, key: &str) -> u64 {
        self.sketch.query(key)
    }

    pub fn estimate_frequency(&self, key: &str) -> f64 {
        if self.total_events == 0 {
            return 0.0;
        }
        self.estimate_count(key) as f64 / self.total_events as f64
    }

    pub fn total(&self) -> u64 {
        self.total_events
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_no_underestimate() {
        let mut sk = CountMinSketch::new(100, 5);
        for _ in 0..100 {
            sk.update("apple", 1);
        }
        for _ in 0..50 {
            sk.update("banana", 1);
        }
        for _ in 0..25 {
            sk.update("cherry", 1);
        }

        // Count-Min never underestimates
        assert!(sk.query("apple") >= 100);
        assert!(sk.query("banana") >= 50);
        assert!(sk.query("cherry") >= 25);
    }

    #[test]
    fn test_unseen_items_low() {
        let mut sk = CountMinSketch::new(1000, 5);
        for i in 0..100 {
            sk.update(&format!("item_{}", i), 1);
        }

        // Unseen items should have very low count
        let unseen = sk.query("completely_unseen_item_xyz");
        assert!(unseen < 5, "unseen item count too high: {}", unseen);
    }

    #[test]
    fn test_batch_update() {
        let mut sk = CountMinSketch::new(100, 4);
        sk.update("key", 100);
        assert!(sk.query("key") >= 100);
    }

    #[test]
    fn test_frequency_tracker() {
        let mut tracker = FrequencyTracker::new(200, 4);
        for _ in 0..900 {
            tracker.add("hot");
        }
        for _ in 0..100 {
            tracker.add("cold");
        }

        assert_eq!(tracker.total(), 1000);
        let hot_freq = tracker.estimate_frequency("hot");
        assert!(
            hot_freq >= 0.9,
            "hot frequency {:.3} should be >= 0.9",
            hot_freq
        );
        let cold_freq = tracker.estimate_frequency("cold");
        assert!(
            cold_freq >= 0.1,
            "cold frequency {:.3} should be >= 0.1",
            cold_freq
        );
    }

    #[test]
    fn test_multiple_deltas() {
        let mut sk = CountMinSketch::new(100, 4);
        sk.update("x", 50);
        sk.update("x", 50);
        assert!(sk.query("x") >= 100);
    }
}

(* 978: Count-Min Sketch *)
(* Frequency estimation with O(1) update/query, O(d*w) space *)
(* Uses d hash functions × w counters; estimates freq with min over d rows *)

(* Simple hash functions using different seeds *)
let make_hash seed =
  fun s ->
    String.fold_left (fun h c ->
      h * seed lxor Char.code c
    ) seed s

(* Create d different hash functions *)
let make_hashes d =
  let seeds = [| 31; 37; 41; 43; 47; 53; 59; 61 |] in
  Array.init d (fun i -> make_hash seeds.(i mod Array.length seeds))

type sketch = {
  table: int array array;  (* d rows × w columns *)
  hashes: (string -> int) array;
  width: int;
  depth: int;
}

let create ~width ~depth =
  { table = Array.make_matrix depth width 0;
    hashes = make_hashes depth;
    width;
    depth }

let update sk key delta =
  for i = 0 to sk.depth - 1 do
    let col = abs (sk.hashes.(i) key) mod sk.width in
    sk.table.(i).(col) <- sk.table.(i).(col) + delta
  done

let query sk key =
  let min_count = ref max_int in
  for i = 0 to sk.depth - 1 do
    let col = abs (sk.hashes.(i) key) mod sk.width in
    let v = sk.table.(i).(col) in
    if v < !min_count then min_count := v
  done;
  !min_count

let () =
  let sk = create ~width:100 ~depth:5 in

  (* Update with counts *)
  for _ = 1 to 100 do update sk "apple" 1 done;
  for _ = 1 to 50  do update sk "banana" 1 done;
  for _ = 1 to 25  do update sk "cherry" 1 done;

  (* Estimates are >= actual (may overestimate due to hash collisions) *)
  let apple_est = query sk "apple" in
  let banana_est = query sk "banana" in
  let cherry_est = query sk "cherry" in

  assert (apple_est >= 100);
  assert (banana_est >= 50);
  assert (cherry_est >= 25);

  (* Never underestimate (count-min property) *)
  assert (apple_est >= 100);

  (* Unseen item estimate should be close to 0 *)
  let unseen = query sk "dragon" in
  assert (unseen < 10);  (* should be very low for 100 wide, 5 deep *)

  Printf.printf "apple:  %d (actual 100)\n" apple_est;
  Printf.printf "banana: %d (actual 50)\n" banana_est;
  Printf.printf "cherry: %d (actual 25)\n" cherry_est;
  Printf.printf "dragon: %d (actual 0)\n" unseen;
  Printf.printf "✓ All tests passed\n"

✓ Tests Rust test suite

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_no_underestimate() {
        let mut sk = CountMinSketch::new(100, 5);
        for _ in 0..100 {
            sk.update("apple", 1);
        }
        for _ in 0..50 {
            sk.update("banana", 1);
        }
        for _ in 0..25 {
            sk.update("cherry", 1);
        }

        // Count-Min never underestimates
        assert!(sk.query("apple") >= 100);
        assert!(sk.query("banana") >= 50);
        assert!(sk.query("cherry") >= 25);
    }

    #[test]
    fn test_unseen_items_low() {
        let mut sk = CountMinSketch::new(1000, 5);
        for i in 0..100 {
            sk.update(&format!("item_{}", i), 1);
        }

        // Unseen items should have very low count
        let unseen = sk.query("completely_unseen_item_xyz");
        assert!(unseen < 5, "unseen item count too high: {}", unseen);
    }

    #[test]
    fn test_batch_update() {
        let mut sk = CountMinSketch::new(100, 4);
        sk.update("key", 100);
        assert!(sk.query("key") >= 100);
    }

    #[test]
    fn test_frequency_tracker() {
        let mut tracker = FrequencyTracker::new(200, 4);
        for _ in 0..900 {
            tracker.add("hot");
        }
        for _ in 0..100 {
            tracker.add("cold");
        }

        assert_eq!(tracker.total(), 1000);
        let hot_freq = tracker.estimate_frequency("hot");
        assert!(
            hot_freq >= 0.9,
            "hot frequency {:.3} should be >= 0.9",
            hot_freq
        );
        let cold_freq = tracker.estimate_frequency("cold");
        assert!(
            cold_freq >= 0.1,
            "cold frequency {:.3} should be >= 0.1",
            cold_freq
        );
    }

    #[test]
    fn test_multiple_deltas() {
        let mut sk = CountMinSketch::new(100, 4);
        sk.update("x", 50);
        sk.update("x", 50);
        assert!(sk.query("x") >= 100);
    }
}

Deep Comparison

Count-Min Sketch — Comparison

Core Insight

A Count-Min Sketch maintains a depth × width array of counters. update(key, delta) increments one counter per row (using d different hash functions). query(key) returns the minimum counter across all rows. The minimum gives the best estimate — other counters may be inflated by collisions. Guaranteed to never underestimate (hence "count-min"). Uses O(d × w) space for any number of distinct keys.

OCaml Approach

• Array.make_matrix depth width 0 — 2D counter array

• make_hash seed returns a closure capturing the seed (higher-order function)

• make_hashes d creates an array of d hash functions

• String.fold_left (fun h c -> h * seed lxor Char.code c) seed s — polynomial hash

• for i = 0 to sk.depth - 1 loop for update/query

• min_count := min !min_count v — track minimum across rows

Rust Approach

• Vec<Vec<u64>> — 2D counter array (depth rows, width columns)

• seeds: Vec<u64> — seed values for each row's hash

• s.bytes().fold(seed, |h, b| h.wrapping_mul(seed).wrapping_add(b as u64) ^ b as u64)

• wrapping_mul/wrapping_add — explicit overflow handling (OCaml wraps silently)

• (0..self.depth).map(...).min().unwrap_or(0) — functional min over rows

• FrequencyTracker wraps sketch + total events for frequency estimation

Comparison Table

Aspect	OCaml	Rust
Table	`int array array`	`Vec<Vec<u64>>`
Hash fn	Closure `fun s -> ...`	`fn hash(seed, key) -> u64`
Hash accumulate	`String.fold_left (fun h c -> h * seed lxor code c) seed s`	`s.bytes().fold(seed, \\|h,b\\| wrapping_mul.wrapping_add^b)`
Overflow	Silent wrap	`wrapping_mul`, `wrapping_add`
Min query	`for` loop + `ref min_count`	`.map().min().unwrap_or(0)`
Multiple hashes	`(string -> int) array`	`seeds: Vec<u64>`
Counter type	`int` (63-bit)	`u64` (64-bit)
Space	O(d × w × 8 bytes)	O(d × w × 8 bytes)
Error bound	`ε = e/w`, `δ = e^(-d)`	Same theoretical guarantee

Exercises

Implement merge(other: &CountMinSketch) that adds two sketches of equal dimensions (element-wise sum).

Add decay(&mut self, factor: f64) that multiplies all counters by factor for time-decayed frequency estimation.

Verify the error bound experimentally: insert 10,000 items, query all of them, measure maximum overcount.

Implement a heavy hitter detector: items whose estimated frequency exceeds N/k are "heavy hitters".

Compare memory usage of CountMinSketch vs HashMap<String, u64> for 100,000 distinct keys.

Open Source Repos

functional-rust

View the source for this example on GitHub — OCaml and Rust side by side in the repo.

Rust