ExamplesBy LevelBy TopicLearning Paths
963 Advanced

963 Bloom Filter

Functional Programming

Tutorial

The Problem

Implement a Bloom filter — a space-efficient probabilistic data structure that tests set membership with no false negatives but controllable false positives. Use three independent hash functions and a bit array of u64 words. Implement insert (set k bits) and contains (check k bits). Demonstrate that contains never returns false for inserted items but may return true for non-inserted items.

🎯 Learning Outcomes

  • • Implement three independent hash functions (djb2, sdbm, fnv-like) using wrapping arithmetic
  • • Map hash values to bit positions: hash % num_bits → word index + bit offset
  • • Implement set_bit(pos) and test_bit(pos) using bitwise operations on u64 words
  • • Implement insert (set all k bit positions) and contains (test all k positions)
  • • Understand the false-positive probability and why contains can be wrong but insert makes contains always correct for inserted items
  • Code Example

    #![allow(clippy::all)]
    // 963: Bloom Filter
    // Probabilistic membership test: no false negatives, possible false positives
    // Uses 3 independent hash functions + bit array (u64 words)
    
    // Approach 1: Three hash functions (djb2, sdbm, fnv-like)
    fn hash1(s: &str) -> usize {
        s.bytes().fold(5381usize, |h, b| {
            h.wrapping_mul(31).wrapping_add(b as usize)
        })
    }
    
    fn hash2(s: &str) -> usize {
        s.bytes().fold(0usize, |h, b| {
            (b as usize)
                .wrapping_add(h.wrapping_shl(6))
                .wrapping_add(h.wrapping_shl(16))
                .wrapping_sub(h)
        })
    }
    
    fn hash3(s: &str) -> usize {
        s.bytes()
            .fold(0usize, |h, b| h.wrapping_mul(33) ^ (b as usize))
    }
    
    // Approach 2: Bloom filter with u64 bit array
    pub struct BloomFilter {
        bits: Vec<u64>,
        num_bits: usize,
    }
    
    impl BloomFilter {
        pub fn new(num_bits: usize) -> Self {
            let words = num_bits.div_ceil(64);
            BloomFilter {
                bits: vec![0u64; words],
                num_bits,
            }
        }
    
        fn set_bit(&mut self, i: usize) {
            let idx = i % self.num_bits;
            let word = idx / 64;
            let bit = idx % 64;
            self.bits[word] |= 1u64 << bit;
        }
    
        fn get_bit(&self, i: usize) -> bool {
            let idx = i % self.num_bits;
            let word = idx / 64;
            let bit = idx % 64;
            (self.bits[word] >> bit) & 1 == 1
        }
    
        pub fn add(&mut self, s: &str) {
            self.set_bit(hash1(s));
            self.set_bit(hash2(s));
            self.set_bit(hash3(s));
        }
    
        pub fn might_contain(&self, s: &str) -> bool {
            self.get_bit(hash1(s)) && self.get_bit(hash2(s)) && self.get_bit(hash3(s))
        }
    
        pub fn count_set_bits(&self) -> u32 {
            self.bits.iter().map(|w| w.count_ones()).sum()
        }
    
        /// Estimated false positive rate given n items inserted
        pub fn false_positive_rate(&self, n: usize) -> f64 {
            let k = 3.0_f64; // number of hash functions
            let m = self.num_bits as f64;
            (1.0 - (-k * n as f64 / m).exp()).powf(k)
        }
    }
    
    #[cfg(test)]
    mod tests {
        use super::*;
    
        #[test]
        fn test_inserted_items_found() {
            let mut bf = BloomFilter::new(1024);
            let words = ["apple", "banana", "cherry", "dog", "elephant"];
            for w in &words {
                bf.add(w);
            }
            for w in &words {
                assert!(bf.might_contain(w), "must contain '{}'", w);
            }
        }
    
        #[test]
        fn test_false_positive_rate_reasonable() {
            let mut bf = BloomFilter::new(10_000);
            for i in 0..100 {
                bf.add(&format!("item_{}", i));
            }
            // Check 1000 non-inserted items, count false positives
            let fp: usize = (0..1000)
                .filter(|i| bf.might_contain(&format!("not_inserted_{}", i)))
                .count();
            // With 10000 bits, 3 hashes, 100 items → FP rate ~ 0.001
            // Allow up to 5% false positives in test
            assert!(fp < 50, "too many false positives: {}/1000", fp);
        }
    
        #[test]
        fn test_bit_operations() {
            let mut bf = BloomFilter::new(128);
            bf.set_bit(0);
            bf.set_bit(63);
            bf.set_bit(64);
            bf.set_bit(127);
            assert!(bf.get_bit(0));
            assert!(bf.get_bit(63));
            assert!(bf.get_bit(64));
            assert!(bf.get_bit(127));
            assert!(!bf.get_bit(1));
            assert!(!bf.get_bit(100));
        }
    
        #[test]
        fn test_count_set_bits() {
            let mut bf = BloomFilter::new(64);
            assert_eq!(bf.count_set_bits(), 0);
            bf.add("test");
            assert!(bf.count_set_bits() > 0);
            assert!(bf.count_set_bits() <= 3);
        }
    }

    Key Differences

    AspectRustOCaml
    Bit arrayVec<u64> — 64 bits per wordbytes — 8 bits per byte
    Overflowwrapping_mul/addland max_int or unchecked
    Bit operations\|=, &, 1u64 <<lor, land, lsl
    div_ceilBuilt-in method(n + 63) / 64 manually

    Bloom filters trade memory for exact membership. They cannot remove elements (clearing a bit might affect other entries). BloomFilter::contains returning true means "possibly inserted"; returning false means "definitely not inserted".

    OCaml Approach

    let hash1 s =
      String.fold_left (fun h b ->
        ((h * 31) + Char.code b) land max_int
      ) 5381 s
    
    type bloom_filter = {
      mutable bits: bytes;
      num_bits: int;
    }
    
    let create num_bits =
      { bits = Bytes.make ((num_bits + 7) / 8) '\000'; num_bits }
    
    let set_bit bf pos =
      let pos = pos mod bf.num_bits in
      let byte_idx = pos / 8 in
      let bit_off  = pos mod 8 in
      let current  = Char.code (Bytes.get bf.bits byte_idx) in
      Bytes.set bf.bits byte_idx (Char.chr (current lor (1 lsl bit_off)))
    
    let test_bit bf pos =
      let pos = pos mod bf.num_bits in
      let byte_idx = pos / 8 in
      let bit_off  = pos mod 8 in
      Char.code (Bytes.get bf.bits byte_idx) land (1 lsl bit_off) <> 0
    

    OCaml's bytes type provides mutable byte arrays; Rust uses Vec<u64> with 64-bit words for 8× fewer memory operations per bit access. OCaml lacks wrapping_mulland max_int truncates to 62-bit positive integers.

    Full Source

    #![allow(clippy::all)]
    // 963: Bloom Filter
    // Probabilistic membership test: no false negatives, possible false positives
    // Uses 3 independent hash functions + bit array (u64 words)
    
    // Approach 1: Three hash functions (djb2, sdbm, fnv-like)
    fn hash1(s: &str) -> usize {
        s.bytes().fold(5381usize, |h, b| {
            h.wrapping_mul(31).wrapping_add(b as usize)
        })
    }
    
    fn hash2(s: &str) -> usize {
        s.bytes().fold(0usize, |h, b| {
            (b as usize)
                .wrapping_add(h.wrapping_shl(6))
                .wrapping_add(h.wrapping_shl(16))
                .wrapping_sub(h)
        })
    }
    
    fn hash3(s: &str) -> usize {
        s.bytes()
            .fold(0usize, |h, b| h.wrapping_mul(33) ^ (b as usize))
    }
    
    // Approach 2: Bloom filter with u64 bit array
    pub struct BloomFilter {
        bits: Vec<u64>,
        num_bits: usize,
    }
    
    impl BloomFilter {
        pub fn new(num_bits: usize) -> Self {
            let words = num_bits.div_ceil(64);
            BloomFilter {
                bits: vec![0u64; words],
                num_bits,
            }
        }
    
        fn set_bit(&mut self, i: usize) {
            let idx = i % self.num_bits;
            let word = idx / 64;
            let bit = idx % 64;
            self.bits[word] |= 1u64 << bit;
        }
    
        fn get_bit(&self, i: usize) -> bool {
            let idx = i % self.num_bits;
            let word = idx / 64;
            let bit = idx % 64;
            (self.bits[word] >> bit) & 1 == 1
        }
    
        pub fn add(&mut self, s: &str) {
            self.set_bit(hash1(s));
            self.set_bit(hash2(s));
            self.set_bit(hash3(s));
        }
    
        pub fn might_contain(&self, s: &str) -> bool {
            self.get_bit(hash1(s)) && self.get_bit(hash2(s)) && self.get_bit(hash3(s))
        }
    
        pub fn count_set_bits(&self) -> u32 {
            self.bits.iter().map(|w| w.count_ones()).sum()
        }
    
        /// Estimated false positive rate given n items inserted
        pub fn false_positive_rate(&self, n: usize) -> f64 {
            let k = 3.0_f64; // number of hash functions
            let m = self.num_bits as f64;
            (1.0 - (-k * n as f64 / m).exp()).powf(k)
        }
    }
    
    #[cfg(test)]
    mod tests {
        use super::*;
    
        #[test]
        fn test_inserted_items_found() {
            let mut bf = BloomFilter::new(1024);
            let words = ["apple", "banana", "cherry", "dog", "elephant"];
            for w in &words {
                bf.add(w);
            }
            for w in &words {
                assert!(bf.might_contain(w), "must contain '{}'", w);
            }
        }
    
        #[test]
        fn test_false_positive_rate_reasonable() {
            let mut bf = BloomFilter::new(10_000);
            for i in 0..100 {
                bf.add(&format!("item_{}", i));
            }
            // Check 1000 non-inserted items, count false positives
            let fp: usize = (0..1000)
                .filter(|i| bf.might_contain(&format!("not_inserted_{}", i)))
                .count();
            // With 10000 bits, 3 hashes, 100 items → FP rate ~ 0.001
            // Allow up to 5% false positives in test
            assert!(fp < 50, "too many false positives: {}/1000", fp);
        }
    
        #[test]
        fn test_bit_operations() {
            let mut bf = BloomFilter::new(128);
            bf.set_bit(0);
            bf.set_bit(63);
            bf.set_bit(64);
            bf.set_bit(127);
            assert!(bf.get_bit(0));
            assert!(bf.get_bit(63));
            assert!(bf.get_bit(64));
            assert!(bf.get_bit(127));
            assert!(!bf.get_bit(1));
            assert!(!bf.get_bit(100));
        }
    
        #[test]
        fn test_count_set_bits() {
            let mut bf = BloomFilter::new(64);
            assert_eq!(bf.count_set_bits(), 0);
            bf.add("test");
            assert!(bf.count_set_bits() > 0);
            assert!(bf.count_set_bits() <= 3);
        }
    }
    ✓ Tests Rust test suite
    #[cfg(test)]
    mod tests {
        use super::*;
    
        #[test]
        fn test_inserted_items_found() {
            let mut bf = BloomFilter::new(1024);
            let words = ["apple", "banana", "cherry", "dog", "elephant"];
            for w in &words {
                bf.add(w);
            }
            for w in &words {
                assert!(bf.might_contain(w), "must contain '{}'", w);
            }
        }
    
        #[test]
        fn test_false_positive_rate_reasonable() {
            let mut bf = BloomFilter::new(10_000);
            for i in 0..100 {
                bf.add(&format!("item_{}", i));
            }
            // Check 1000 non-inserted items, count false positives
            let fp: usize = (0..1000)
                .filter(|i| bf.might_contain(&format!("not_inserted_{}", i)))
                .count();
            // With 10000 bits, 3 hashes, 100 items → FP rate ~ 0.001
            // Allow up to 5% false positives in test
            assert!(fp < 50, "too many false positives: {}/1000", fp);
        }
    
        #[test]
        fn test_bit_operations() {
            let mut bf = BloomFilter::new(128);
            bf.set_bit(0);
            bf.set_bit(63);
            bf.set_bit(64);
            bf.set_bit(127);
            assert!(bf.get_bit(0));
            assert!(bf.get_bit(63));
            assert!(bf.get_bit(64));
            assert!(bf.get_bit(127));
            assert!(!bf.get_bit(1));
            assert!(!bf.get_bit(100));
        }
    
        #[test]
        fn test_count_set_bits() {
            let mut bf = BloomFilter::new(64);
            assert_eq!(bf.count_set_bits(), 0);
            bf.add("test");
            assert!(bf.count_set_bits() > 0);
            assert!(bf.count_set_bits() <= 3);
        }
    }

    Deep Comparison

    Bloom Filter — Comparison

    Core Insight

    A Bloom filter is a bit array + k hash functions. add(x) sets k bits; might_contain(x) checks all k bits are set. Both languages use the same math. The difference: OCaml uses a bool array (simple, 1 byte per bit) while Rust uses Vec<u64> (8x more space-efficient via bitwise operations). Rust also requires explicit wrapping_* arithmetic to avoid overflow panics.

    OCaml Approach

  • bool array — one bool per bit slot (simple but memory-wasteful)
  • String.fold_left for hash accumulation
  • abs h mod bf.size for index calculation
  • bf.bits.(i) <- true for bit setting
  • • Integer arrays for compact bit representation (manual bit ops)
  • lsl, lor, lxor, lsr — OCaml bitwise operators
  • Rust Approach

  • Vec<u64> — 64 bits per word, 8x more compact
  • wrapping_mul, wrapping_add, wrapping_shl, wrapping_sub — explicit overflow handling
  • bits[word] |= 1u64 << bit — set a bit in a u64 word
  • (bits[word] >> bit) & 1 == 1 — test a bit
  • count_ones() on u64 — popcount (hardware-accelerated)
  • • FP rate formula: (1 - e^(-k*n/m))^k
  • Comparison Table

    AspectOCamlRust
    Bit storagebool array (1 byte/bit)Vec<u64> (1 bit/bit)
    Hash foldString.fold_left (fun h c -> h*31 + code c)s.bytes().fold(h, \|h, b\| h.wrapping_mul(31).wrapping_add(b))
    OverflowSilent (OCaml int wraps)Explicit wrapping_* methods
    Bit setarr.(i) <- truearr[word] \|= 1u64 << bit
    Bit testarr.(i)(arr[word] >> bit) & 1 == 1
    PopcountManual countu64::count_ones() (hardware)
    Bitwise opslsl, lor, lxor<<, \|, ^, >>

    Exercises

  • Compute the false-positive rate for your filter: insert 100 items, test 10,000 random strings, count false positives.
  • Implement a union of two same-size Bloom filters by ORing their bit arrays.
  • Add a constructor BloomFilter::with_error_rate(n_items, error_rate) that computes optimal bit count and hash count.
  • Implement a Counting Bloom Filter using Vec<u8> counters instead of bits, supporting delete operations.
  • Benchmark Bloom filter lookup vs HashSet::contains for 10,000 items — measure memory usage of each.
  • Open Source Repos