Bloom Filters

The probabilistic bouncer

Prerequisites

graph TD
  Prob[Probability basics] --> Bloom[Bloom Filters]
  Hash[Hash functions] --> Bloom
  Bloom --> CMS[Count-Min Sketch]
  Bloom --> HLL[HyperLogLog]

Quickstart

cd topics/BLOOM_FILTERS
python bloom_filter.py
# Optional: pytest test_bloom_parity.py -q
cargo test --manifest-path ../website/wasm/bloom_filter/Cargo.toml  # after WASM crate exists

Simple Fundamental Explanation

Imagine you’re running an exclusive club. You have a list of millions of VIP members. Checking the giant guestbook every time someone arrives takes too long. Instead, you use a clever system:

When someone becomes a VIP, you tell three different bouncers their name. Each bouncer has their own unique way of remembering names. They each make a checkmark on a giant chalkboard. When someone tries to enter, you ask the three bouncers, “Is this person on your chalkboard?”

If any bouncer says “No,” you know for 100% certain the person is NOT a VIP.
If all bouncers say “Yes,” they are probably a VIP.

There’s a tiny chance the bouncers confused them with other people whose checkmarks overlapped, but you’ll never accidentally turn away a true VIP.

This is a Bloom Filter. It is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. It gives you two possible answers:

Definitely Not in the set (No false negatives)
Probably in the set (Possible false positives)

Deep Dive: How It Works Under the Hood

A Bloom Filter consists of two main parts:

A Bit Array of size $m$ , initialized to all zeros.
A set of $k$ different Hash Functions. Each function maps an input value to one of the $m$ array positions with a uniform random distribution.

Insertion

When you want to add an item to the Bloom Filter:

Feed the item to all $k$ hash functions.
Get $k$ array indices.
Set the bits at all these indices to 1.

Querying (Membership Test)

When you want to check if an item exists:

Feed the item to the same $k$ hash functions.
Check the bits at the resulting indices.
If any bit is 0, the item is definitely not in the set.
If all bits are 1, the item is probably in the set.

Visual Diagram

Initial State (m=10 bits, k=2 hash functions):
[ 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 ]

Insert "Apple":
Hash1("Apple") = 2
Hash2("Apple") = 7
[ 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 ]
          ^                   ^

Insert "Banana":
Hash1("Banana") = 5
Hash2("Banana") = 7  <-- Collision with Apple!
[ 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 ]
                      ^       ^

Check "Cherry" (Not inserted):
Hash1("Cherry") = 2
Hash2("Cherry") = 5
Even though bits 2 and 5 are both 1 (set by Apple and Banana),
the filter says "Probably Present" -> False Positive!

Check "Orange" (Not inserted):
Hash1("Orange") = 4
Hash2("Orange") = 7
Bit 4 is 0.
The filter says "Definitely Not Present" -> True Negative!

Practical Example: Malicious URL Checker

Browsers like Google Chrome use Bloom Filters to check if a URL is malicious before loading it.

Storing every known malicious URL locally would take gigabytes of RAM. Instead, they download a tiny Bloom Filter (a few megabytes) representing all malicious URLs.

When you click a link:

Chrome checks the URL against the local Bloom Filter.
If it says “Definitely Not”, the page loads instantly. (This happens 99.9% of the time).
If it says “Probably Yes”, Chrome makes a slower network request to Google’s servers to get the exact answer (resolving the false positive).

The Mathematics: Equations and In-Depth Analysis

The core mathematical challenge of a Bloom Filter is minimizing the False Positive Probability ( $P$ ).

The probability of a false positive depends on three variables:

$m$ : Number of bits in the array
$n$ : Number of items inserted
$k$ : Number of hash functions used

1. Probability a bit is still 0

When inserting a single item, a specific hash function sets one bit. The probability it does not set a specific bit is: $1 - \frac{1}{m}$

If we have $k$ hash functions, the probability that a specific bit is still 0 after one insertion is: $(1 - \frac{1}{m})^{k}$

After inserting $n$ items, the probability that a specific bit is still 0 is: $(1 - \frac{1}{m})^{k n}$

2. Probability a bit is 1

Therefore, the probability that a specific bit is 1 after $n$ insertions is: $1 - (1 - \frac{1}{m})^{k n}$

For large $m$ , this can be approximated using the Taylor series expansion $e^{- x} \approx 1 - x$ : $1 - e^{- k n / m}$

3. False Positive Rate ( $P$ )

A false positive occurs if, when checking an item, all $k$ hashed bits happen to be 1. The probability of this is the probability that one bit is 1, raised to the power of $k$ :

P \approx (1 - e^{- k n / m})^{k}

Optimizing $k$

To minimize the false positive rate for a given $m$ and $n$ , the optimal number of hash functions $k$ is:

k = \frac{m}{n} ln (2)

This means that in an optimal Bloom Filter, roughly 50% of the bits are set to 1.

Why It Scales

The space required $m$ scales linearly with $n$ . If you want to maintain a 1% false positive rate ( $P = 0.01$ ), you need roughly $m = 9.6 n$ bits. This is less than 10 bits per item, regardless of the size of the item! Storing a billion 100-byte strings exactly takes 100GB. A Bloom filter with a 1% error rate takes about 1.2GB.

Hash Functions under the Hood (Implementation Parity)

There is a subtle, high-performance optimization in how Bloom Filters are built in real-world environments compared to simple simulations:

Independent Salted Hashing (Python & JS Sim): In the Python code (bloom_filter.py), $k$ independent hash functions are simulated by appending a seed integer i to the input key string and hashing it with MD5: $Hash_{i} (x) = MD5 (x + str (i)) (mod m)$ While simple and easy to implement, executing MD5 $k$ times for every single stream event can become a computational bottleneck.
Double Hashing (Rust Port): In the Rust implementation (bloom_filter.rs), we utilize the Kirsch-Mitzenmacher optimization (Double Hashing). This technique shows that one can simulate an arbitrary number of independent hash functions using just two base hash values ( $h_{1}$ and $h_{2}$ ) by combining them linearly: $Hash_{i} (x) = (h_{1} (x) + i \times h_{2} (x)) (mod m)$ In our Rust code, $h_{1}$ and $h_{2}$ are generated using standard DefaultHasher (SipHash) with two different seeds. This requires only two hashing passes instead of $k$ , vastly improving CPU throughput at scale.

[!NOTE] Due to this optimization, a key inserted in Python or Javascript will set different bit indices than the same key in Rust. While both filters maintain identical mathematical false-positive rates, their individual bit arrays will differ.

Benchmarks

Scenario	m	n	k	Notes
Baseline	1024	500	7	1% target FP

Reproduce (Python): pytest topics/BLOOM_FILTERS/test_bloom_parity.py --benchmark-only (after adding pytest-benchmark).

Interactive version: Bloom filters (web) — update hostname after Cloudflare custom domain is set.

Lab

A Bloom filter can say definitely not or maybe yes — never a false negative. Crank fill or use too few bits and absent keys start looking present.

m — bit array size40 bits

k — hash functions2 hash functions

Optimal k for current fill ≈ 3 — try matching it.

Inserted keys (n = 10)MD5-based hashes, same recipe as the Python topic code.

bit set probe path empty

With 10 keys in 40 bits and k = 2, estimated false positive rate is 15.48%. Probe "bluff": Bloom → maybe yes; exact set → not a member.

Probe a key

Before checking: for key "bluff", will the Bloom Filter report it as maybe present or definitely not present?

Bloom Filters

The probabilistic bouncer

Prerequisites

Quickstart

Simple Fundamental Explanation

Deep Dive: How It Works Under the Hood

Insertion

Querying (Membership Test)

Visual Diagram

Practical Example: Malicious URL Checker

The Mathematics: Equations and In-Depth Analysis

1. Probability a bit is still 0

2. Probability a bit is 1

3. False Positive Rate ( $P$ )

Optimizing $k$

Why It Scales

Hash Functions under the Hood (Implementation Parity)

Benchmarks

Lab

Probe a key

Run implementations

Source

Output

Bloom Filters

The probabilistic bouncer

Prerequisites

Quickstart

Simple Fundamental Explanation

Deep Dive: How It Works Under the Hood

Insertion

Querying (Membership Test)

Visual Diagram

Practical Example: Malicious URL Checker

The Mathematics: Equations and In-Depth Analysis

1. Probability a bit is still 0

2. Probability a bit is 1

3. False Positive Rate (P)

Optimizing k

Why It Scales

Hash Functions under the Hood (Implementation Parity)

Benchmarks

Lab

Probe a key

Run implementations

Source

Output

3. False Positive Rate ( $P$ )

Optimizing $k$