376: Bloom Filter
Tutorial Video
Text description (accessibility)
This video demonstrates the "376: Bloom Filter" functional Rust example. Difficulty level: Advanced. Key concepts covered: Functional Programming. Checking exact membership in a large set requires storing all elements, which is expensive when the set contains billions of items (URLs, email addresses, IP addresses). Key difference from OCaml: 1. **Bit manipulation**: Rust uses `Vec<u64>` with explicit bitwise indexing (`i / 64`, `1u64 << (i % 64)`); OCaml would use `Bytes` with `Bytes.get_uint8`/`lor`/`land` operations.
Tutorial
The Problem
Checking exact membership in a large set requires storing all elements, which is expensive when the set contains billions of items (URLs, email addresses, IP addresses). Burton Howard Bloom introduced a probabilistic approach in 1970: use a bit array and multiple hash functions to answer "is this element in the set?" with guaranteed no false negatives but a tunable false positive rate. A Bloom filter can represent 1 billion URLs using only ~1.2 GB at 1% false positive rate — orders of magnitude less than a hash set.
Bloom filters appear in database join optimization (BigTable, Cassandra use them to avoid disk reads), network packet routing, malware URL databases (Chrome's Safe Browsing), spell checkers, and distributed cache pre-screening.
🎯 Learning Outcomes
u64 arrays efficiently represents the bit array in RustCode Example
#![allow(clippy::all)]
//! Bloom Filter
//!
//! Probabilistic data structure for set membership with no false negatives.
use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};
/// A Bloom filter for probabilistic set membership testing
pub struct BloomFilter {
bits: Vec<u64>,
m: usize,
k: usize,
count: usize,
}
impl BloomFilter {
/// Create a new Bloom filter for given capacity and false positive rate
pub fn new(capacity: usize, fp_rate: f64) -> Self {
let m = (-(capacity as f64 * fp_rate.ln()) / (2f64.ln().powi(2))).ceil() as usize;
let k = ((m as f64 / capacity as f64) * 2f64.ln()).ceil() as usize;
let m = m.max(64);
let k = k.max(1);
Self {
bits: vec![0u64; m.div_ceil(64)],
m,
k,
count: 0,
}
}
fn hash_val<T: Hash>(&self, item: &T, seed: u64) -> usize {
let mut hasher = DefaultHasher::new();
seed.hash(&mut hasher);
item.hash(&mut hasher);
(hasher.finish() as usize) % self.m
}
fn set_bit(&mut self, i: usize) {
self.bits[i / 64] |= 1u64 << (i % 64);
}
fn get_bit(&self, i: usize) -> bool {
(self.bits[i / 64] >> (i % 64)) & 1 == 1
}
/// Insert an item
pub fn insert<T: Hash>(&mut self, item: &T) {
for seed in 0..self.k as u64 {
let idx = self.hash_val(item, seed);
self.set_bit(idx);
}
self.count += 1;
}
/// Check if item might be in the set (may have false positives)
pub fn contains<T: Hash>(&self, item: &T) -> bool {
(0..self.k as u64).all(|seed| self.get_bit(self.hash_val(item, seed)))
}
/// Get number of items inserted
pub fn len(&self) -> usize {
self.count
}
/// Check if empty
pub fn is_empty(&self) -> bool {
self.count == 0
}
/// Get number of bits
pub fn bits(&self) -> usize {
self.m
}
/// Get number of hash functions
pub fn hash_count(&self) -> usize {
self.k
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_insert_contains() {
let mut bf = BloomFilter::new(100, 0.01);
bf.insert(&"hello");
bf.insert(&"world");
assert!(bf.contains(&"hello"));
assert!(bf.contains(&"world"));
}
#[test]
fn test_not_contains() {
let bf = BloomFilter::new(100, 0.01);
// Empty filter should not contain anything
assert!(!bf.contains(&"missing"));
}
#[test]
fn test_false_positive_rate() {
let mut bf = BloomFilter::new(1000, 0.01);
for i in 0..500 {
bf.insert(&i);
}
// Check false positives for items not inserted
let mut fp_count = 0;
for i in 1000..2000 {
if bf.contains(&i) {
fp_count += 1;
}
}
// Should be roughly 1% = 10 false positives out of 1000
assert!(fp_count < 50); // allow some variance
}
#[test]
fn test_len() {
let mut bf = BloomFilter::new(100, 0.01);
assert_eq!(bf.len(), 0);
bf.insert(&1);
bf.insert(&2);
assert_eq!(bf.len(), 2);
}
#[test]
fn test_parameters() {
let bf = BloomFilter::new(1000, 0.01);
assert!(bf.bits() > 0);
assert!(bf.hash_count() > 0);
}
}Key Differences
Vec<u64> with explicit bitwise indexing (i / 64, 1u64 << (i % 64)); OCaml would use Bytes with Bytes.get_uint8/lor/land operations.DefaultHasher with an integer via the Hash trait; OCaml uses Hashtbl.seeded_hash which accepts a seed directly.T: Hash at compile time; OCaml uses polymorphic hashing (Hashtbl.hash : 'a -> int) without trait bounds.Vec<u64> is contiguous stack-friendly memory; OCaml's Bytes is a heap-allocated boxed value with GC overhead.OCaml Approach
In OCaml, a Bloom filter would use a Bytes.t or Bigarray for the bit array and the standard Hashtbl hashing infrastructure. OCaml's Hashtbl.hash is seeded via Hashtbl.seeded_hash. The functional style would express the k-hash check as List.for_all or a fold over hash seeds, keeping the core logic declarative.
Full Source
#![allow(clippy::all)]
//! Bloom Filter
//!
//! Probabilistic data structure for set membership with no false negatives.
use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};
/// A Bloom filter for probabilistic set membership testing
pub struct BloomFilter {
bits: Vec<u64>,
m: usize,
k: usize,
count: usize,
}
impl BloomFilter {
/// Create a new Bloom filter for given capacity and false positive rate
pub fn new(capacity: usize, fp_rate: f64) -> Self {
let m = (-(capacity as f64 * fp_rate.ln()) / (2f64.ln().powi(2))).ceil() as usize;
let k = ((m as f64 / capacity as f64) * 2f64.ln()).ceil() as usize;
let m = m.max(64);
let k = k.max(1);
Self {
bits: vec![0u64; m.div_ceil(64)],
m,
k,
count: 0,
}
}
fn hash_val<T: Hash>(&self, item: &T, seed: u64) -> usize {
let mut hasher = DefaultHasher::new();
seed.hash(&mut hasher);
item.hash(&mut hasher);
(hasher.finish() as usize) % self.m
}
fn set_bit(&mut self, i: usize) {
self.bits[i / 64] |= 1u64 << (i % 64);
}
fn get_bit(&self, i: usize) -> bool {
(self.bits[i / 64] >> (i % 64)) & 1 == 1
}
/// Insert an item
pub fn insert<T: Hash>(&mut self, item: &T) {
for seed in 0..self.k as u64 {
let idx = self.hash_val(item, seed);
self.set_bit(idx);
}
self.count += 1;
}
/// Check if item might be in the set (may have false positives)
pub fn contains<T: Hash>(&self, item: &T) -> bool {
(0..self.k as u64).all(|seed| self.get_bit(self.hash_val(item, seed)))
}
/// Get number of items inserted
pub fn len(&self) -> usize {
self.count
}
/// Check if empty
pub fn is_empty(&self) -> bool {
self.count == 0
}
/// Get number of bits
pub fn bits(&self) -> usize {
self.m
}
/// Get number of hash functions
pub fn hash_count(&self) -> usize {
self.k
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_insert_contains() {
let mut bf = BloomFilter::new(100, 0.01);
bf.insert(&"hello");
bf.insert(&"world");
assert!(bf.contains(&"hello"));
assert!(bf.contains(&"world"));
}
#[test]
fn test_not_contains() {
let bf = BloomFilter::new(100, 0.01);
// Empty filter should not contain anything
assert!(!bf.contains(&"missing"));
}
#[test]
fn test_false_positive_rate() {
let mut bf = BloomFilter::new(1000, 0.01);
for i in 0..500 {
bf.insert(&i);
}
// Check false positives for items not inserted
let mut fp_count = 0;
for i in 1000..2000 {
if bf.contains(&i) {
fp_count += 1;
}
}
// Should be roughly 1% = 10 false positives out of 1000
assert!(fp_count < 50); // allow some variance
}
#[test]
fn test_len() {
let mut bf = BloomFilter::new(100, 0.01);
assert_eq!(bf.len(), 0);
bf.insert(&1);
bf.insert(&2);
assert_eq!(bf.len(), 2);
}
#[test]
fn test_parameters() {
let bf = BloomFilter::new(1000, 0.01);
assert!(bf.bits() > 0);
assert!(bf.hash_count() > 0);
}
}#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_insert_contains() {
let mut bf = BloomFilter::new(100, 0.01);
bf.insert(&"hello");
bf.insert(&"world");
assert!(bf.contains(&"hello"));
assert!(bf.contains(&"world"));
}
#[test]
fn test_not_contains() {
let bf = BloomFilter::new(100, 0.01);
// Empty filter should not contain anything
assert!(!bf.contains(&"missing"));
}
#[test]
fn test_false_positive_rate() {
let mut bf = BloomFilter::new(1000, 0.01);
for i in 0..500 {
bf.insert(&i);
}
// Check false positives for items not inserted
let mut fp_count = 0;
for i in 1000..2000 {
if bf.contains(&i) {
fp_count += 1;
}
}
// Should be roughly 1% = 10 false positives out of 1000
assert!(fp_count < 50); // allow some variance
}
#[test]
fn test_len() {
let mut bf = BloomFilter::new(100, 0.01);
assert_eq!(bf.len(), 0);
bf.insert(&1);
bf.insert(&2);
assert_eq!(bf.len(), 2);
}
#[test]
fn test_parameters() {
let bf = BloomFilter::new(1000, 0.01);
assert!(bf.bits() > 0);
assert!(bf.hash_count() > 0);
}
}
Deep Comparison
OCaml vs Rust: Bloom Filter
Both use bit arrays with multiple hash functions.
Exercises
fp_rate parameter values of 0.01, 0.05, and 0.10.