963 Bloom Filter
Tutorial
The Problem
Implement a Bloom filter — a space-efficient probabilistic data structure that tests set membership with no false negatives but controllable false positives. Use three independent hash functions and a bit array of u64 words. Implement insert (set k bits) and contains (check k bits). Demonstrate that contains never returns false for inserted items but may return true for non-inserted items.
🎯 Learning Outcomes
hash % num_bits → word index + bit offsetset_bit(pos) and test_bit(pos) using bitwise operations on u64 wordsinsert (set all k bit positions) and contains (test all k positions)contains can be wrong but insert makes contains always correct for inserted itemsCode Example
#![allow(clippy::all)]
// 963: Bloom Filter
// Probabilistic membership test: no false negatives, possible false positives
// Uses 3 independent hash functions + bit array (u64 words)
// Approach 1: Three hash functions (djb2, sdbm, fnv-like)
fn hash1(s: &str) -> usize {
s.bytes().fold(5381usize, |h, b| {
h.wrapping_mul(31).wrapping_add(b as usize)
})
}
fn hash2(s: &str) -> usize {
s.bytes().fold(0usize, |h, b| {
(b as usize)
.wrapping_add(h.wrapping_shl(6))
.wrapping_add(h.wrapping_shl(16))
.wrapping_sub(h)
})
}
fn hash3(s: &str) -> usize {
s.bytes()
.fold(0usize, |h, b| h.wrapping_mul(33) ^ (b as usize))
}
// Approach 2: Bloom filter with u64 bit array
pub struct BloomFilter {
bits: Vec<u64>,
num_bits: usize,
}
impl BloomFilter {
pub fn new(num_bits: usize) -> Self {
let words = num_bits.div_ceil(64);
BloomFilter {
bits: vec![0u64; words],
num_bits,
}
}
fn set_bit(&mut self, i: usize) {
let idx = i % self.num_bits;
let word = idx / 64;
let bit = idx % 64;
self.bits[word] |= 1u64 << bit;
}
fn get_bit(&self, i: usize) -> bool {
let idx = i % self.num_bits;
let word = idx / 64;
let bit = idx % 64;
(self.bits[word] >> bit) & 1 == 1
}
pub fn add(&mut self, s: &str) {
self.set_bit(hash1(s));
self.set_bit(hash2(s));
self.set_bit(hash3(s));
}
pub fn might_contain(&self, s: &str) -> bool {
self.get_bit(hash1(s)) && self.get_bit(hash2(s)) && self.get_bit(hash3(s))
}
pub fn count_set_bits(&self) -> u32 {
self.bits.iter().map(|w| w.count_ones()).sum()
}
/// Estimated false positive rate given n items inserted
pub fn false_positive_rate(&self, n: usize) -> f64 {
let k = 3.0_f64; // number of hash functions
let m = self.num_bits as f64;
(1.0 - (-k * n as f64 / m).exp()).powf(k)
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_inserted_items_found() {
let mut bf = BloomFilter::new(1024);
let words = ["apple", "banana", "cherry", "dog", "elephant"];
for w in &words {
bf.add(w);
}
for w in &words {
assert!(bf.might_contain(w), "must contain '{}'", w);
}
}
#[test]
fn test_false_positive_rate_reasonable() {
let mut bf = BloomFilter::new(10_000);
for i in 0..100 {
bf.add(&format!("item_{}", i));
}
// Check 1000 non-inserted items, count false positives
let fp: usize = (0..1000)
.filter(|i| bf.might_contain(&format!("not_inserted_{}", i)))
.count();
// With 10000 bits, 3 hashes, 100 items → FP rate ~ 0.001
// Allow up to 5% false positives in test
assert!(fp < 50, "too many false positives: {}/1000", fp);
}
#[test]
fn test_bit_operations() {
let mut bf = BloomFilter::new(128);
bf.set_bit(0);
bf.set_bit(63);
bf.set_bit(64);
bf.set_bit(127);
assert!(bf.get_bit(0));
assert!(bf.get_bit(63));
assert!(bf.get_bit(64));
assert!(bf.get_bit(127));
assert!(!bf.get_bit(1));
assert!(!bf.get_bit(100));
}
#[test]
fn test_count_set_bits() {
let mut bf = BloomFilter::new(64);
assert_eq!(bf.count_set_bits(), 0);
bf.add("test");
assert!(bf.count_set_bits() > 0);
assert!(bf.count_set_bits() <= 3);
}
}Key Differences
| Aspect | Rust | OCaml |
|---|---|---|
| Bit array | Vec<u64> — 64 bits per word | bytes — 8 bits per byte |
| Overflow | wrapping_mul/add | land max_int or unchecked |
| Bit operations | \|=, &, 1u64 << | lor, land, lsl |
div_ceil | Built-in method | (n + 63) / 64 manually |
Bloom filters trade memory for exact membership. They cannot remove elements (clearing a bit might affect other entries). BloomFilter::contains returning true means "possibly inserted"; returning false means "definitely not inserted".
OCaml Approach
let hash1 s =
String.fold_left (fun h b ->
((h * 31) + Char.code b) land max_int
) 5381 s
type bloom_filter = {
mutable bits: bytes;
num_bits: int;
}
let create num_bits =
{ bits = Bytes.make ((num_bits + 7) / 8) '\000'; num_bits }
let set_bit bf pos =
let pos = pos mod bf.num_bits in
let byte_idx = pos / 8 in
let bit_off = pos mod 8 in
let current = Char.code (Bytes.get bf.bits byte_idx) in
Bytes.set bf.bits byte_idx (Char.chr (current lor (1 lsl bit_off)))
let test_bit bf pos =
let pos = pos mod bf.num_bits in
let byte_idx = pos / 8 in
let bit_off = pos mod 8 in
Char.code (Bytes.get bf.bits byte_idx) land (1 lsl bit_off) <> 0
OCaml's bytes type provides mutable byte arrays; Rust uses Vec<u64> with 64-bit words for 8× fewer memory operations per bit access. OCaml lacks wrapping_mul — land max_int truncates to 62-bit positive integers.
Full Source
#![allow(clippy::all)]
// 963: Bloom Filter
// Probabilistic membership test: no false negatives, possible false positives
// Uses 3 independent hash functions + bit array (u64 words)
// Approach 1: Three hash functions (djb2, sdbm, fnv-like)
fn hash1(s: &str) -> usize {
s.bytes().fold(5381usize, |h, b| {
h.wrapping_mul(31).wrapping_add(b as usize)
})
}
fn hash2(s: &str) -> usize {
s.bytes().fold(0usize, |h, b| {
(b as usize)
.wrapping_add(h.wrapping_shl(6))
.wrapping_add(h.wrapping_shl(16))
.wrapping_sub(h)
})
}
fn hash3(s: &str) -> usize {
s.bytes()
.fold(0usize, |h, b| h.wrapping_mul(33) ^ (b as usize))
}
// Approach 2: Bloom filter with u64 bit array
pub struct BloomFilter {
bits: Vec<u64>,
num_bits: usize,
}
impl BloomFilter {
pub fn new(num_bits: usize) -> Self {
let words = num_bits.div_ceil(64);
BloomFilter {
bits: vec![0u64; words],
num_bits,
}
}
fn set_bit(&mut self, i: usize) {
let idx = i % self.num_bits;
let word = idx / 64;
let bit = idx % 64;
self.bits[word] |= 1u64 << bit;
}
fn get_bit(&self, i: usize) -> bool {
let idx = i % self.num_bits;
let word = idx / 64;
let bit = idx % 64;
(self.bits[word] >> bit) & 1 == 1
}
pub fn add(&mut self, s: &str) {
self.set_bit(hash1(s));
self.set_bit(hash2(s));
self.set_bit(hash3(s));
}
pub fn might_contain(&self, s: &str) -> bool {
self.get_bit(hash1(s)) && self.get_bit(hash2(s)) && self.get_bit(hash3(s))
}
pub fn count_set_bits(&self) -> u32 {
self.bits.iter().map(|w| w.count_ones()).sum()
}
/// Estimated false positive rate given n items inserted
pub fn false_positive_rate(&self, n: usize) -> f64 {
let k = 3.0_f64; // number of hash functions
let m = self.num_bits as f64;
(1.0 - (-k * n as f64 / m).exp()).powf(k)
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_inserted_items_found() {
let mut bf = BloomFilter::new(1024);
let words = ["apple", "banana", "cherry", "dog", "elephant"];
for w in &words {
bf.add(w);
}
for w in &words {
assert!(bf.might_contain(w), "must contain '{}'", w);
}
}
#[test]
fn test_false_positive_rate_reasonable() {
let mut bf = BloomFilter::new(10_000);
for i in 0..100 {
bf.add(&format!("item_{}", i));
}
// Check 1000 non-inserted items, count false positives
let fp: usize = (0..1000)
.filter(|i| bf.might_contain(&format!("not_inserted_{}", i)))
.count();
// With 10000 bits, 3 hashes, 100 items → FP rate ~ 0.001
// Allow up to 5% false positives in test
assert!(fp < 50, "too many false positives: {}/1000", fp);
}
#[test]
fn test_bit_operations() {
let mut bf = BloomFilter::new(128);
bf.set_bit(0);
bf.set_bit(63);
bf.set_bit(64);
bf.set_bit(127);
assert!(bf.get_bit(0));
assert!(bf.get_bit(63));
assert!(bf.get_bit(64));
assert!(bf.get_bit(127));
assert!(!bf.get_bit(1));
assert!(!bf.get_bit(100));
}
#[test]
fn test_count_set_bits() {
let mut bf = BloomFilter::new(64);
assert_eq!(bf.count_set_bits(), 0);
bf.add("test");
assert!(bf.count_set_bits() > 0);
assert!(bf.count_set_bits() <= 3);
}
}#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_inserted_items_found() {
let mut bf = BloomFilter::new(1024);
let words = ["apple", "banana", "cherry", "dog", "elephant"];
for w in &words {
bf.add(w);
}
for w in &words {
assert!(bf.might_contain(w), "must contain '{}'", w);
}
}
#[test]
fn test_false_positive_rate_reasonable() {
let mut bf = BloomFilter::new(10_000);
for i in 0..100 {
bf.add(&format!("item_{}", i));
}
// Check 1000 non-inserted items, count false positives
let fp: usize = (0..1000)
.filter(|i| bf.might_contain(&format!("not_inserted_{}", i)))
.count();
// With 10000 bits, 3 hashes, 100 items → FP rate ~ 0.001
// Allow up to 5% false positives in test
assert!(fp < 50, "too many false positives: {}/1000", fp);
}
#[test]
fn test_bit_operations() {
let mut bf = BloomFilter::new(128);
bf.set_bit(0);
bf.set_bit(63);
bf.set_bit(64);
bf.set_bit(127);
assert!(bf.get_bit(0));
assert!(bf.get_bit(63));
assert!(bf.get_bit(64));
assert!(bf.get_bit(127));
assert!(!bf.get_bit(1));
assert!(!bf.get_bit(100));
}
#[test]
fn test_count_set_bits() {
let mut bf = BloomFilter::new(64);
assert_eq!(bf.count_set_bits(), 0);
bf.add("test");
assert!(bf.count_set_bits() > 0);
assert!(bf.count_set_bits() <= 3);
}
}
Deep Comparison
Bloom Filter — Comparison
Core Insight
A Bloom filter is a bit array + k hash functions. add(x) sets k bits; might_contain(x) checks all k bits are set. Both languages use the same math. The difference: OCaml uses a bool array (simple, 1 byte per bit) while Rust uses Vec<u64> (8x more space-efficient via bitwise operations). Rust also requires explicit wrapping_* arithmetic to avoid overflow panics.
OCaml Approach
bool array — one bool per bit slot (simple but memory-wasteful)String.fold_left for hash accumulationabs h mod bf.size for index calculationbf.bits.(i) <- true for bit settinglsl, lor, lxor, lsr — OCaml bitwise operatorsRust Approach
Vec<u64> — 64 bits per word, 8x more compactwrapping_mul, wrapping_add, wrapping_shl, wrapping_sub — explicit overflow handlingbits[word] |= 1u64 << bit — set a bit in a u64 word(bits[word] >> bit) & 1 == 1 — test a bitcount_ones() on u64 — popcount (hardware-accelerated)(1 - e^(-k*n/m))^kComparison Table
| Aspect | OCaml | Rust |
|---|---|---|
| Bit storage | bool array (1 byte/bit) | Vec<u64> (1 bit/bit) |
| Hash fold | String.fold_left (fun h c -> h*31 + code c) | s.bytes().fold(h, \|h, b\| h.wrapping_mul(31).wrapping_add(b)) |
| Overflow | Silent (OCaml int wraps) | Explicit wrapping_* methods |
| Bit set | arr.(i) <- true | arr[word] \|= 1u64 << bit |
| Bit test | arr.(i) | (arr[word] >> bit) & 1 == 1 |
| Popcount | Manual count | u64::count_ones() (hardware) |
| Bitwise ops | lsl, lor, lxor | <<, \|, ^, >> |
Exercises
union of two same-size Bloom filters by ORing their bit arrays.BloomFilter::with_error_rate(n_items, error_rate) that computes optimal bit count and hash count.Vec<u8> counters instead of bits, supporting delete operations.HashSet::contains for 10,000 items — measure memory usage of each.