Bloom Filter Strategy¶
Design decisions for bloom filters in nori-sstable.
Key Decisions¶
- 10 bits per key (0.9% false positive rate)
- xxHash64 with double hashing
- Whole-file bloom (not per-block)
- Loaded on open (always in RAM)
Why 10 Bits Per Key?¶
False positive rate formula:
Trade-off analysis:
100K key SSTable:
10 bits/key:
Size: 125KB
FP rate: 0.9%
Wasted reads: 900 per 100K misses
12 bits/key:
Size: 150KB (+25KB = +20%)
FP rate: 0.3%
Wasted reads: 300 per 100K misses
Savings: 600 wasted reads
Cost: 25KB more memory
Value: 600 * 100µs = 60ms saved
Verdict: Not worth 20% more memory for 60ms
Decision: 10 bits/key is the sweet spot.
Why xxHash64?¶
Requirements: - Fast (billions of keys to hash) - Good distribution (low collision) - Non-cryptographic OK (not for security)
Comparison:
| Hash | Speed | Quality | Choice |
|---|---|---|---|
| CRC32 | 10 GB/s | Medium | Too simple |
| xxHash64 | 10 GB/s | Excellent | Chosen |
| SipHash | 3 GB/s | Excellent | Too slow |
| SHA-256 | 0.5 GB/s | Excellent | Overkill |
xxHash64 wins: Speed of CRC32, quality of cryptographic hashes.
Double Hashing Trick¶
Problem: Need k=7 hashes per key.
Naive: Compute 7 independent hashes (slow)
Double hashing: Generate k hashes from 2:
Speed: - Compute 2 hashes: 2 * 100ns = 200ns - Generate 7 positions: 7 * 5ns = 35ns - Total: 235ns (vs 700ns for 7 independent hashes)
Trade-off: Slightly higher collision probability (negligible in practice).
Whole-File vs Per-Block Bloom¶
Alternatives:
1. Whole-File Bloom (Chosen)¶
Pros: - Simple implementation - Lower FP rate (larger filter) - Always in memory (no lookup needed)
Cons: - Must load entire bloom on open - Memory proportional to SSTable size
2. Per-Block Bloom¶
One bloom filter per 4KB block
Size: 10 bits * keys_per_block * num_blocks
Load: On-demand per block
Pros: - Only load needed blooms - Better locality
Cons: - Higher FP rate (smaller filters) - More complex implementation - Need to load bloom before checking
Decision: Whole-file for simplicity. Bloom size (125KB per 100K keys) is acceptable memory overhead.
Load on Open vs On-Demand¶
Always load on open:
impl SSTableReader {
pub async fn open(path: PathBuf) -> Result<Self> {
// Load bloom filter into memory
let bloom = read_bloom_filter(&file).await?;
Ok(Self { bloom, ... })
}
}
Benefits: - ~67ns check (in-memory) - No lazy loading complexity - Predictable memory usage
Cost: Initial load time (~1ms per MB of bloom)
Alternative: Lazy load bloom on first get() - More complex - Unpredictable first-query latency - Bloom small enough to always load
Summary¶
Design choices:
10 bits/key (0.9% FP, 125KB per 100K keys) xxHash64 (10 GB/s, excellent distribution) Double hashing (2 hashes → 7 positions) Whole-file bloom (simple, low FP rate) Loaded on open (always in RAM, ~67ns checks)
Result: ~67ns checks prevent 100µs disk reads with <1% false positives.