ExamplesBy LevelBy TopicLearning Paths
978 Fundamental

978 Count Min Sketch

Functional Programming

Tutorial

The Problem

Implement a Count-Min Sketch — a probabilistic frequency estimation structure that uses O(depth × width) space independent of the number of distinct keys. Insert items and query frequency in O(depth) time. The frequency estimate equals the minimum counter across all rows, providing an upper bound with tunable error probability.

🎯 Learning Outcomes

  • • Implement a depth × width counter table: Vec<Vec<u64>>
  • • Define depth independent hash functions by parameterizing with different seeds
  • • Implement update(key, delta) that increments one column per row
  • • Implement query(key) -> u64 that returns min over all row counters — always >= true frequency
  • • Understand the error bound: with width w = ceil(e/ε) and depth d = ceil(ln(1/δ)), the estimate exceeds true count by more than ε * N with probability at most δ
  • Code Example

    #![allow(clippy::all)]
    // 978: Count-Min Sketch
    // Frequency estimation: O(1) update/query, O(depth × width) space
    // Uses d hash functions × w counters; estimate = min over d rows
    
    // Approach 1: Count-Min Sketch with multiple hash seeds
    fn hash(seed: u64, s: &str) -> u64 {
        s.bytes().fold(seed, |h, b| {
            h.wrapping_mul(seed).wrapping_add(b as u64) ^ b as u64
        })
    }
    
    pub struct CountMinSketch {
        table: Vec<Vec<u64>>, // depth rows × width cols
        seeds: Vec<u64>,
        width: usize,
        depth: usize,
    }
    
    impl CountMinSketch {
        pub fn new(width: usize, depth: usize) -> Self {
            let seeds = vec![31, 37, 41, 43, 47, 53, 59, 61, 67, 71];
            let depth_seeds: Vec<u64> = (0..depth).map(|i| seeds[i % seeds.len()]).collect();
            CountMinSketch {
                table: vec![vec![0u64; width]; depth],
                seeds: depth_seeds,
                width,
                depth,
            }
        }
    
        fn column(&self, row: usize, key: &str) -> usize {
            (hash(self.seeds[row], key) as usize) % self.width
        }
    
        /// Increment count for key by delta
        pub fn update(&mut self, key: &str, delta: u64) {
            for i in 0..self.depth {
                let col = self.column(i, key);
                self.table[i][col] += delta;
            }
        }
    
        /// Estimate frequency: minimum over all rows
        pub fn query(&self, key: &str) -> u64 {
            (0..self.depth)
                .map(|i| self.table[i][self.column(i, key)])
                .min()
                .unwrap_or(0)
        }
    
        /// Total events tracked (sum of row 0, approximate)
        pub fn total(&self) -> u64 {
            self.table[0].iter().sum()
        }
    }
    
    // Approach 2: Heavy Hitter tracking with Count-Min
    pub struct FrequencyTracker {
        sketch: CountMinSketch,
        total_events: u64,
    }
    
    impl FrequencyTracker {
        pub fn new(width: usize, depth: usize) -> Self {
            FrequencyTracker {
                sketch: CountMinSketch::new(width, depth),
                total_events: 0,
            }
        }
    
        pub fn add(&mut self, key: &str) {
            self.sketch.update(key, 1);
            self.total_events += 1;
        }
    
        pub fn estimate_count(&self, key: &str) -> u64 {
            self.sketch.query(key)
        }
    
        pub fn estimate_frequency(&self, key: &str) -> f64 {
            if self.total_events == 0 {
                return 0.0;
            }
            self.estimate_count(key) as f64 / self.total_events as f64
        }
    
        pub fn total(&self) -> u64 {
            self.total_events
        }
    }
    
    #[cfg(test)]
    mod tests {
        use super::*;
    
        #[test]
        fn test_no_underestimate() {
            let mut sk = CountMinSketch::new(100, 5);
            for _ in 0..100 {
                sk.update("apple", 1);
            }
            for _ in 0..50 {
                sk.update("banana", 1);
            }
            for _ in 0..25 {
                sk.update("cherry", 1);
            }
    
            // Count-Min never underestimates
            assert!(sk.query("apple") >= 100);
            assert!(sk.query("banana") >= 50);
            assert!(sk.query("cherry") >= 25);
        }
    
        #[test]
        fn test_unseen_items_low() {
            let mut sk = CountMinSketch::new(1000, 5);
            for i in 0..100 {
                sk.update(&format!("item_{}", i), 1);
            }
    
            // Unseen items should have very low count
            let unseen = sk.query("completely_unseen_item_xyz");
            assert!(unseen < 5, "unseen item count too high: {}", unseen);
        }
    
        #[test]
        fn test_batch_update() {
            let mut sk = CountMinSketch::new(100, 4);
            sk.update("key", 100);
            assert!(sk.query("key") >= 100);
        }
    
        #[test]
        fn test_frequency_tracker() {
            let mut tracker = FrequencyTracker::new(200, 4);
            for _ in 0..900 {
                tracker.add("hot");
            }
            for _ in 0..100 {
                tracker.add("cold");
            }
    
            assert_eq!(tracker.total(), 1000);
            let hot_freq = tracker.estimate_frequency("hot");
            assert!(
                hot_freq >= 0.9,
                "hot frequency {:.3} should be >= 0.9",
                hot_freq
            );
            let cold_freq = tracker.estimate_frequency("cold");
            assert!(
                cold_freq >= 0.1,
                "cold frequency {:.3} should be >= 0.1",
                cold_freq
            );
        }
    
        #[test]
        fn test_multiple_deltas() {
            let mut sk = CountMinSketch::new(100, 4);
            sk.update("x", 50);
            sk.update("x", 50);
            assert!(sk.query("x") >= 100);
        }
    }

    Key Differences

    AspectRustOCaml
    Hash overflowwrapping_mul/add — defined behaviorInt64 wrapping or land max_int
    Counter typeu64 — no overflow for most countsint — 63-bit; risk for very large counts
    min over rows.min().unwrap_or(0)Array.fold_left min max_int ...
    Table allocationvec![vec![0u64; width]; depth]Array.init depth (fun _ -> Array.make ...)

    Count-Min Sketches are used in network traffic analysis, database query optimization, and streaming analytics. They trade exactness for bounded space — ideal when the key space is too large for a full frequency map.

    OCaml Approach

    let hash seed s =
      String.fold_left (fun h b ->
        Int64.(to_int (logxor
          (add (mul (of_int h) (of_int seed)) (of_int (Char.code b)))
          (of_int (Char.code b))))
      ) (Int64.to_int seed) s
    
    type t = {
      table: int array array;
      seeds: int array;
      width: int;
      depth: int;
    }
    
    let create width depth =
      let seeds = [|31;37;41;43;47;53;59;61;67;71|] in
      { table = Array.init depth (fun _ -> Array.make width 0);
        seeds = Array.init depth (fun i -> seeds.(i mod 10));
        width; depth }
    
    let update cms key delta =
      Array.iteri (fun i seed ->
        let col = (abs (hash seed key)) mod cms.width in
        cms.table.(i).(col) <- cms.table.(i).(col) + delta
      ) cms.seeds
    
    let query cms key =
      Array.fold_left (fun acc (i, seed) ->
        min acc cms.table.(i).((abs (hash seed key)) mod cms.width)
      ) max_int (Array.mapi (fun i s -> (i, s)) cms.seeds)
    

    OCaml's Int64 arithmetic for hashing avoids the GC tag issue with overflow. The sketch structure is identical; only the hash implementation differs in verbosity.

    Full Source

    #![allow(clippy::all)]
    // 978: Count-Min Sketch
    // Frequency estimation: O(1) update/query, O(depth × width) space
    // Uses d hash functions × w counters; estimate = min over d rows
    
    // Approach 1: Count-Min Sketch with multiple hash seeds
    fn hash(seed: u64, s: &str) -> u64 {
        s.bytes().fold(seed, |h, b| {
            h.wrapping_mul(seed).wrapping_add(b as u64) ^ b as u64
        })
    }
    
    pub struct CountMinSketch {
        table: Vec<Vec<u64>>, // depth rows × width cols
        seeds: Vec<u64>,
        width: usize,
        depth: usize,
    }
    
    impl CountMinSketch {
        pub fn new(width: usize, depth: usize) -> Self {
            let seeds = vec![31, 37, 41, 43, 47, 53, 59, 61, 67, 71];
            let depth_seeds: Vec<u64> = (0..depth).map(|i| seeds[i % seeds.len()]).collect();
            CountMinSketch {
                table: vec![vec![0u64; width]; depth],
                seeds: depth_seeds,
                width,
                depth,
            }
        }
    
        fn column(&self, row: usize, key: &str) -> usize {
            (hash(self.seeds[row], key) as usize) % self.width
        }
    
        /// Increment count for key by delta
        pub fn update(&mut self, key: &str, delta: u64) {
            for i in 0..self.depth {
                let col = self.column(i, key);
                self.table[i][col] += delta;
            }
        }
    
        /// Estimate frequency: minimum over all rows
        pub fn query(&self, key: &str) -> u64 {
            (0..self.depth)
                .map(|i| self.table[i][self.column(i, key)])
                .min()
                .unwrap_or(0)
        }
    
        /// Total events tracked (sum of row 0, approximate)
        pub fn total(&self) -> u64 {
            self.table[0].iter().sum()
        }
    }
    
    // Approach 2: Heavy Hitter tracking with Count-Min
    pub struct FrequencyTracker {
        sketch: CountMinSketch,
        total_events: u64,
    }
    
    impl FrequencyTracker {
        pub fn new(width: usize, depth: usize) -> Self {
            FrequencyTracker {
                sketch: CountMinSketch::new(width, depth),
                total_events: 0,
            }
        }
    
        pub fn add(&mut self, key: &str) {
            self.sketch.update(key, 1);
            self.total_events += 1;
        }
    
        pub fn estimate_count(&self, key: &str) -> u64 {
            self.sketch.query(key)
        }
    
        pub fn estimate_frequency(&self, key: &str) -> f64 {
            if self.total_events == 0 {
                return 0.0;
            }
            self.estimate_count(key) as f64 / self.total_events as f64
        }
    
        pub fn total(&self) -> u64 {
            self.total_events
        }
    }
    
    #[cfg(test)]
    mod tests {
        use super::*;
    
        #[test]
        fn test_no_underestimate() {
            let mut sk = CountMinSketch::new(100, 5);
            for _ in 0..100 {
                sk.update("apple", 1);
            }
            for _ in 0..50 {
                sk.update("banana", 1);
            }
            for _ in 0..25 {
                sk.update("cherry", 1);
            }
    
            // Count-Min never underestimates
            assert!(sk.query("apple") >= 100);
            assert!(sk.query("banana") >= 50);
            assert!(sk.query("cherry") >= 25);
        }
    
        #[test]
        fn test_unseen_items_low() {
            let mut sk = CountMinSketch::new(1000, 5);
            for i in 0..100 {
                sk.update(&format!("item_{}", i), 1);
            }
    
            // Unseen items should have very low count
            let unseen = sk.query("completely_unseen_item_xyz");
            assert!(unseen < 5, "unseen item count too high: {}", unseen);
        }
    
        #[test]
        fn test_batch_update() {
            let mut sk = CountMinSketch::new(100, 4);
            sk.update("key", 100);
            assert!(sk.query("key") >= 100);
        }
    
        #[test]
        fn test_frequency_tracker() {
            let mut tracker = FrequencyTracker::new(200, 4);
            for _ in 0..900 {
                tracker.add("hot");
            }
            for _ in 0..100 {
                tracker.add("cold");
            }
    
            assert_eq!(tracker.total(), 1000);
            let hot_freq = tracker.estimate_frequency("hot");
            assert!(
                hot_freq >= 0.9,
                "hot frequency {:.3} should be >= 0.9",
                hot_freq
            );
            let cold_freq = tracker.estimate_frequency("cold");
            assert!(
                cold_freq >= 0.1,
                "cold frequency {:.3} should be >= 0.1",
                cold_freq
            );
        }
    
        #[test]
        fn test_multiple_deltas() {
            let mut sk = CountMinSketch::new(100, 4);
            sk.update("x", 50);
            sk.update("x", 50);
            assert!(sk.query("x") >= 100);
        }
    }
    ✓ Tests Rust test suite
    #[cfg(test)]
    mod tests {
        use super::*;
    
        #[test]
        fn test_no_underestimate() {
            let mut sk = CountMinSketch::new(100, 5);
            for _ in 0..100 {
                sk.update("apple", 1);
            }
            for _ in 0..50 {
                sk.update("banana", 1);
            }
            for _ in 0..25 {
                sk.update("cherry", 1);
            }
    
            // Count-Min never underestimates
            assert!(sk.query("apple") >= 100);
            assert!(sk.query("banana") >= 50);
            assert!(sk.query("cherry") >= 25);
        }
    
        #[test]
        fn test_unseen_items_low() {
            let mut sk = CountMinSketch::new(1000, 5);
            for i in 0..100 {
                sk.update(&format!("item_{}", i), 1);
            }
    
            // Unseen items should have very low count
            let unseen = sk.query("completely_unseen_item_xyz");
            assert!(unseen < 5, "unseen item count too high: {}", unseen);
        }
    
        #[test]
        fn test_batch_update() {
            let mut sk = CountMinSketch::new(100, 4);
            sk.update("key", 100);
            assert!(sk.query("key") >= 100);
        }
    
        #[test]
        fn test_frequency_tracker() {
            let mut tracker = FrequencyTracker::new(200, 4);
            for _ in 0..900 {
                tracker.add("hot");
            }
            for _ in 0..100 {
                tracker.add("cold");
            }
    
            assert_eq!(tracker.total(), 1000);
            let hot_freq = tracker.estimate_frequency("hot");
            assert!(
                hot_freq >= 0.9,
                "hot frequency {:.3} should be >= 0.9",
                hot_freq
            );
            let cold_freq = tracker.estimate_frequency("cold");
            assert!(
                cold_freq >= 0.1,
                "cold frequency {:.3} should be >= 0.1",
                cold_freq
            );
        }
    
        #[test]
        fn test_multiple_deltas() {
            let mut sk = CountMinSketch::new(100, 4);
            sk.update("x", 50);
            sk.update("x", 50);
            assert!(sk.query("x") >= 100);
        }
    }

    Deep Comparison

    Count-Min Sketch — Comparison

    Core Insight

    A Count-Min Sketch maintains a depth × width array of counters. update(key, delta) increments one counter per row (using d different hash functions). query(key) returns the minimum counter across all rows. The minimum gives the best estimate — other counters may be inflated by collisions. Guaranteed to never underestimate (hence "count-min"). Uses O(d × w) space for any number of distinct keys.

    OCaml Approach

  • Array.make_matrix depth width 0 — 2D counter array
  • make_hash seed returns a closure capturing the seed (higher-order function)
  • make_hashes d creates an array of d hash functions
  • String.fold_left (fun h c -> h * seed lxor Char.code c) seed s — polynomial hash
  • for i = 0 to sk.depth - 1 loop for update/query
  • min_count := min !min_count v — track minimum across rows
  • Rust Approach

  • Vec<Vec<u64>> — 2D counter array (depth rows, width columns)
  • seeds: Vec<u64> — seed values for each row's hash
  • s.bytes().fold(seed, |h, b| h.wrapping_mul(seed).wrapping_add(b as u64) ^ b as u64)
  • wrapping_mul/wrapping_add — explicit overflow handling (OCaml wraps silently)
  • (0..self.depth).map(...).min().unwrap_or(0) — functional min over rows
  • FrequencyTracker wraps sketch + total events for frequency estimation
  • Comparison Table

    AspectOCamlRust
    Tableint array arrayVec<Vec<u64>>
    Hash fnClosure fun s -> ...fn hash(seed, key) -> u64
    Hash accumulateString.fold_left (fun h c -> h * seed lxor code c) seed ss.bytes().fold(seed, \|h,b\| wrapping_mul.wrapping_add^b)
    OverflowSilent wrapwrapping_mul, wrapping_add
    Min queryfor loop + ref min_count.map().min().unwrap_or(0)
    Multiple hashes(string -> int) arrayseeds: Vec<u64>
    Counter typeint (63-bit)u64 (64-bit)
    SpaceO(d × w × 8 bytes)O(d × w × 8 bytes)
    Error boundε = e/w, δ = e^(-d)Same theoretical guarantee

    Exercises

  • Implement merge(other: &CountMinSketch) that adds two sketches of equal dimensions (element-wise sum).
  • Add decay(&mut self, factor: f64) that multiplies all counters by factor for time-decayed frequency estimation.
  • Verify the error bound experimentally: insert 10,000 items, query all of them, measure maximum overcount.
  • Implement a heavy hitter detector: items whose estimated frequency exceeds N/k are "heavy hitters".
  • Compare memory usage of CountMinSketch vs HashMap<String, u64> for 100,000 distinct keys.
  • Open Source Repos