376 Advanced

376: Bloom Filter

Functional Programming

Tutorial Video

Text description (accessibility)

This video demonstrates the "376: Bloom Filter" functional Rust example. Difficulty level: Advanced. Key concepts covered: Functional Programming. Checking exact membership in a large set requires storing all elements, which is expensive when the set contains billions of items (URLs, email addresses, IP addresses). Key difference from OCaml: 1. **Bit manipulation**: Rust uses `Vec<u64>` with explicit bitwise indexing (`i / 64`, `1u64 << (i % 64)`); OCaml would use `Bytes` with `Bytes.get_uint8`/`lor`/`land` operations.

Tutorial

The Problem

Checking exact membership in a large set requires storing all elements, which is expensive when the set contains billions of items (URLs, email addresses, IP addresses). Burton Howard Bloom introduced a probabilistic approach in 1970: use a bit array and multiple hash functions to answer "is this element in the set?" with guaranteed no false negatives but a tunable false positive rate. A Bloom filter can represent 1 billion URLs using only ~1.2 GB at 1% false positive rate — orders of magnitude less than a hash set.

Bloom filters appear in database join optimization (BigTable, Cassandra use them to avoid disk reads), network packet routing, malware URL databases (Chrome's Safe Browsing), spell checkers, and distributed cache pre-screening.

🎯 Learning Outcomes

• Understand the mathematical relationship between bit array size, hash function count, and false positive rate

• Learn how to implement multiple independent hash functions from a single hash function using seeding

• Understand the no-false-negative guarantee and why deletion is not supported in basic Bloom filters

• See how bit manipulation with u64 arrays efficiently represents the bit array in Rust

• Understand how to compute optimal parameters from desired capacity and false positive rate

Code Example

#![allow(clippy::all)]
//! Bloom Filter
//!
//! Probabilistic data structure for set membership with no false negatives.

use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};

/// A Bloom filter for probabilistic set membership testing
pub struct BloomFilter {
    bits: Vec<u64>,
    m: usize,
    k: usize,
    count: usize,
}

impl BloomFilter {
    /// Create a new Bloom filter for given capacity and false positive rate
    pub fn new(capacity: usize, fp_rate: f64) -> Self {
        let m = (-(capacity as f64 * fp_rate.ln()) / (2f64.ln().powi(2))).ceil() as usize;
        let k = ((m as f64 / capacity as f64) * 2f64.ln()).ceil() as usize;
        let m = m.max(64);
        let k = k.max(1);
        Self {
            bits: vec![0u64; m.div_ceil(64)],
            m,
            k,
            count: 0,
        }
    }

    fn hash_val<T: Hash>(&self, item: &T, seed: u64) -> usize {
        let mut hasher = DefaultHasher::new();
        seed.hash(&mut hasher);
        item.hash(&mut hasher);
        (hasher.finish() as usize) % self.m
    }

    fn set_bit(&mut self, i: usize) {
        self.bits[i / 64] |= 1u64 << (i % 64);
    }

    fn get_bit(&self, i: usize) -> bool {
        (self.bits[i / 64] >> (i % 64)) & 1 == 1
    }

    /// Insert an item
    pub fn insert<T: Hash>(&mut self, item: &T) {
        for seed in 0..self.k as u64 {
            let idx = self.hash_val(item, seed);
            self.set_bit(idx);
        }
        self.count += 1;
    }

    /// Check if item might be in the set (may have false positives)
    pub fn contains<T: Hash>(&self, item: &T) -> bool {
        (0..self.k as u64).all(|seed| self.get_bit(self.hash_val(item, seed)))
    }

    /// Get number of items inserted
    pub fn len(&self) -> usize {
        self.count
    }

    /// Check if empty
    pub fn is_empty(&self) -> bool {
        self.count == 0
    }

    /// Get number of bits
    pub fn bits(&self) -> usize {
        self.m
    }

    /// Get number of hash functions
    pub fn hash_count(&self) -> usize {
        self.k
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_insert_contains() {
        let mut bf = BloomFilter::new(100, 0.01);
        bf.insert(&"hello");
        bf.insert(&"world");
        assert!(bf.contains(&"hello"));
        assert!(bf.contains(&"world"));
    }

    #[test]
    fn test_not_contains() {
        let bf = BloomFilter::new(100, 0.01);
        // Empty filter should not contain anything
        assert!(!bf.contains(&"missing"));
    }

    #[test]
    fn test_false_positive_rate() {
        let mut bf = BloomFilter::new(1000, 0.01);
        for i in 0..500 {
            bf.insert(&i);
        }
        // Check false positives for items not inserted
        let mut fp_count = 0;
        for i in 1000..2000 {
            if bf.contains(&i) {
                fp_count += 1;
            }
        }
        // Should be roughly 1% = 10 false positives out of 1000
        assert!(fp_count < 50); // allow some variance
    }

    #[test]
    fn test_len() {
        let mut bf = BloomFilter::new(100, 0.01);
        assert_eq!(bf.len(), 0);
        bf.insert(&1);
        bf.insert(&2);
        assert_eq!(bf.len(), 2);
    }

    #[test]
    fn test_parameters() {
        let bf = BloomFilter::new(1000, 0.01);
        assert!(bf.bits() > 0);
        assert!(bf.hash_count() > 0);
    }
}

(* OCaml: bloom filter with bitarray *)

let m = 1024  (* bits *)
let bits = Bytes.make (m/8) '\000'

let set_bit i = Bytes.set bits (i/8) (Char.chr (Char.code (Bytes.get bits (i/8)) lor (1 lsl (i mod 8))))
let get_bit i = (Char.code (Bytes.get bits (i/8)) lsr (i mod 8)) land 1 = 1

let hash1 s = (Hashtbl.hash s) mod m
let hash2 s = (Hashtbl.hash (s ^ "salt")) mod m
let hash3 s = (Hashtbl.hash ("x" ^ s)) mod m

let add s = set_bit (hash1 s); set_bit (hash2 s); set_bit (hash3 s)
let might_contain s = get_bit (hash1 s) && get_bit (hash2 s) && get_bit (hash3 s)

let () =
  List.iter add ["alice";"bob";"charlie"];
  Printf.printf "alice: %b (should be true)\n" (might_contain "alice");
  Printf.printf "dave: %b (should be false, maybe fp)\n" (might_contain "dave")

Key Differences

Bit manipulation: Rust uses Vec<u64> with explicit bitwise indexing (i / 64, 1u64 << (i % 64)); OCaml would use Bytes with Bytes.get_uint8/lor/land operations.

Hash seeding: Rust seeds DefaultHasher with an integer via the Hash trait; OCaml uses Hashtbl.seeded_hash which accepts a seed directly.

Generic bounds: Rust requires T: Hash at compile time; OCaml uses polymorphic hashing (Hashtbl.hash : 'a -> int) without trait bounds.

Memory layout: Rust's Vec<u64> is contiguous stack-friendly memory; OCaml's Bytes is a heap-allocated boxed value with GC overhead.

OCaml Approach

In OCaml, a Bloom filter would use a Bytes.t or Bigarray for the bit array and the standard Hashtbl hashing infrastructure. OCaml's Hashtbl.hash is seeded via Hashtbl.seeded_hash. The functional style would express the k-hash check as List.for_all or a fold over hash seeds, keeping the core logic declarative.

Full Source

#![allow(clippy::all)]
//! Bloom Filter
//!
//! Probabilistic data structure for set membership with no false negatives.

use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};

/// A Bloom filter for probabilistic set membership testing
pub struct BloomFilter {
    bits: Vec<u64>,
    m: usize,
    k: usize,
    count: usize,
}

impl BloomFilter {
    /// Create a new Bloom filter for given capacity and false positive rate
    pub fn new(capacity: usize, fp_rate: f64) -> Self {
        let m = (-(capacity as f64 * fp_rate.ln()) / (2f64.ln().powi(2))).ceil() as usize;
        let k = ((m as f64 / capacity as f64) * 2f64.ln()).ceil() as usize;
        let m = m.max(64);
        let k = k.max(1);
        Self {
            bits: vec![0u64; m.div_ceil(64)],
            m,
            k,
            count: 0,
        }
    }

    fn hash_val<T: Hash>(&self, item: &T, seed: u64) -> usize {
        let mut hasher = DefaultHasher::new();
        seed.hash(&mut hasher);
        item.hash(&mut hasher);
        (hasher.finish() as usize) % self.m
    }

    fn set_bit(&mut self, i: usize) {
        self.bits[i / 64] |= 1u64 << (i % 64);
    }

    fn get_bit(&self, i: usize) -> bool {
        (self.bits[i / 64] >> (i % 64)) & 1 == 1
    }

    /// Insert an item
    pub fn insert<T: Hash>(&mut self, item: &T) {
        for seed in 0..self.k as u64 {
            let idx = self.hash_val(item, seed);
            self.set_bit(idx);
        }
        self.count += 1;
    }

    /// Check if item might be in the set (may have false positives)
    pub fn contains<T: Hash>(&self, item: &T) -> bool {
        (0..self.k as u64).all(|seed| self.get_bit(self.hash_val(item, seed)))
    }

    /// Get number of items inserted
    pub fn len(&self) -> usize {
        self.count
    }

    /// Check if empty
    pub fn is_empty(&self) -> bool {
        self.count == 0
    }

    /// Get number of bits
    pub fn bits(&self) -> usize {
        self.m
    }

    /// Get number of hash functions
    pub fn hash_count(&self) -> usize {
        self.k
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_insert_contains() {
        let mut bf = BloomFilter::new(100, 0.01);
        bf.insert(&"hello");
        bf.insert(&"world");
        assert!(bf.contains(&"hello"));
        assert!(bf.contains(&"world"));
    }

    #[test]
    fn test_not_contains() {
        let bf = BloomFilter::new(100, 0.01);
        // Empty filter should not contain anything
        assert!(!bf.contains(&"missing"));
    }

    #[test]
    fn test_false_positive_rate() {
        let mut bf = BloomFilter::new(1000, 0.01);
        for i in 0..500 {
            bf.insert(&i);
        }
        // Check false positives for items not inserted
        let mut fp_count = 0;
        for i in 1000..2000 {
            if bf.contains(&i) {
                fp_count += 1;
            }
        }
        // Should be roughly 1% = 10 false positives out of 1000
        assert!(fp_count < 50); // allow some variance
    }

    #[test]
    fn test_len() {
        let mut bf = BloomFilter::new(100, 0.01);
        assert_eq!(bf.len(), 0);
        bf.insert(&1);
        bf.insert(&2);
        assert_eq!(bf.len(), 2);
    }

    #[test]
    fn test_parameters() {
        let bf = BloomFilter::new(1000, 0.01);
        assert!(bf.bits() > 0);
        assert!(bf.hash_count() > 0);
    }
}

(* OCaml: bloom filter with bitarray *)

let m = 1024  (* bits *)
let bits = Bytes.make (m/8) '\000'

let set_bit i = Bytes.set bits (i/8) (Char.chr (Char.code (Bytes.get bits (i/8)) lor (1 lsl (i mod 8))))
let get_bit i = (Char.code (Bytes.get bits (i/8)) lsr (i mod 8)) land 1 = 1

let hash1 s = (Hashtbl.hash s) mod m
let hash2 s = (Hashtbl.hash (s ^ "salt")) mod m
let hash3 s = (Hashtbl.hash ("x" ^ s)) mod m

let add s = set_bit (hash1 s); set_bit (hash2 s); set_bit (hash3 s)
let might_contain s = get_bit (hash1 s) && get_bit (hash2 s) && get_bit (hash3 s)

let () =
  List.iter add ["alice";"bob";"charlie"];
  Printf.printf "alice: %b (should be true)\n" (might_contain "alice");
  Printf.printf "dave: %b (should be false, maybe fp)\n" (might_contain "dave")

✓ Tests Rust test suite

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_insert_contains() {
        let mut bf = BloomFilter::new(100, 0.01);
        bf.insert(&"hello");
        bf.insert(&"world");
        assert!(bf.contains(&"hello"));
        assert!(bf.contains(&"world"));
    }

    #[test]
    fn test_not_contains() {
        let bf = BloomFilter::new(100, 0.01);
        // Empty filter should not contain anything
        assert!(!bf.contains(&"missing"));
    }

    #[test]
    fn test_false_positive_rate() {
        let mut bf = BloomFilter::new(1000, 0.01);
        for i in 0..500 {
            bf.insert(&i);
        }
        // Check false positives for items not inserted
        let mut fp_count = 0;
        for i in 1000..2000 {
            if bf.contains(&i) {
                fp_count += 1;
            }
        }
        // Should be roughly 1% = 10 false positives out of 1000
        assert!(fp_count < 50); // allow some variance
    }

    #[test]
    fn test_len() {
        let mut bf = BloomFilter::new(100, 0.01);
        assert_eq!(bf.len(), 0);
        bf.insert(&1);
        bf.insert(&2);
        assert_eq!(bf.len(), 2);
    }

    #[test]
    fn test_parameters() {
        let bf = BloomFilter::new(1000, 0.01);
        assert!(bf.bits() > 0);
        assert!(bf.hash_count() > 0);
    }
}

Deep Comparison

OCaml vs Rust: Bloom Filter

Both use bit arrays with multiple hash functions.

Exercises

Counting Bloom filter: Add a counter array alongside the bit array so items can be deleted. Each insert increments counters; delete decrements them and clears bits when count reaches zero.

False positive measurement: Insert N random strings, then query N different random strings and count actual false positives. Plot how the rate changes with fp_rate parameter values of 0.01, 0.05, and 0.10.

Scalable Bloom filter: Implement a filter that grows by adding new sub-filters when the current one exceeds capacity, maintaining the overall false positive guarantee by tightening each sub-filter's target rate.

Open Source Repos

functional-rust

View the source for this example on GitHub — OCaml and Rust side by side in the repo.

Rust