Block-Based Storage¶

Why SSTables organize data into fixed-size blocks and how this design enables efficient I/O and caching.

The Core Design¶

SSTables divide data into fixed-size blocks (typically 4KB). Each block contains multiple key-value entries, compressed independently, and serves as the atomic unit of I/O and caching.

SSTable File:
┌──────────────┐
│  Block 0     │ ← 4KB (compressed)
│  Block 1     │ ← 4KB (compressed)
│  Block 2     │ ← 4KB (compressed)
│     ...      │
│  Block N     │ ← 4KB (compressed)
├──────────────┤
│  Index       │ ← Maps keys → block numbers
│  Bloom       │
│  Footer      │
└──────────────┘

Why Blocks?¶

Alternative 1: Variable-Length Entries (No Blocks)¶

File: [entry1][entry2][entry3]...[entryN]

Problems: - Read key 1000 → Must scan from beginning - Random access requires full file index (memory overhead) - Cache entire file or nothing (no granularity)

Alternative 2: One Block Per Entry¶

File: [block:entry1][block:entry2]...[block:entryN]

Problems: - Massive overhead (footer per entry) - Poor compression (can't compress across entries) - Index size explodes (one entry per key)

Solution: Fixed-Size Blocks¶

Block: [entry1][entry2]...[entryK] ← ~4KB total

Benefits: - Random access via block index (log B lookups) - Cache at block granularity (balance memory vs I/O) - Good compression (compress ~100 entries together) - Reasonable index size (one entry per block)

Block Size Trade-Offs¶

Small Blocks (1KB-2KB)¶

Advantages: - Fine-grained caching (less wasted memory) - Lower read amplification per key - Faster decompression per block

Disadvantages: - More index entries (larger index) - Poor compression ratio (less data to compress together) - More overhead (footer, checksum per block)

Large Blocks (16KB-64KB)¶

Advantages: - Better compression ratio (more context) - Smaller index (fewer blocks) - Less overhead per byte

Disadvantages: - Coarse-grained caching (waste memory on unused entries) - Higher read amplification (read 64KB to get 1 key) - Slower decompression (more data to decompress)

The Sweet Spot: 4KB¶

nori-sstable uses 4KB blocks by default:

pub const DEFAULT_BLOCK_SIZE: usize = 4096;

Why 4KB? - Matches OS page size (aligned I/O) - Contains ~50-200 entries (good compression) - Small enough for efficient caching - Large enough to amortize overhead

Performance: - Compression ratio: 2-3x (LZ4) - Decompress time: < 1µs (4KB @ 3.9 GB/s) - Cache overhead: Acceptable for 64MB cache (~16K blocks)

Block Internal Structure¶

Entry Layout Within Block¶

Block (4KB):
┌────────────────────────────────────────┐
│ Restart Points Array                   │
│   restart[0] = offset 0                │
│   restart[1] = offset 512              │
│   restart[2] = offset 1024             │
│   ...                                  │
├────────────────────────────────────────┤
│ Entries (prefix-compressed)            │
│                                        │
│  Entry 0: shared=0, unshared=10, val=5 │
│           key="user:alice", val="..."  │
│                                        │
│  Entry 1: shared=5, unshared=3, val=5  │
│           key="bob" (→ "user:bob")     │
│                                        │
│  Entry 2: shared=5, unshared=7, val=5  │
│           key="charlie" (→ "user:charlie")│
│  ...                                   │
│                                        │
│  Entry 16: shared=0, unshared=11, val=5│ ← Restart point
│            key="user:david", val="..." │
│  ...                                   │
└────────────────────────────────────────┘

Restart Points¶

Purpose: Enable binary search within compressed blocks.

// Every 16 entries = restart point (full key)
pub const RESTART_INTERVAL: usize = 16;

Why needed? - Entries are prefix-compressed - To decode entry N, need entry N-1 - Restart points have full keys (no dependency) - Binary search jumps to restart points

Trade-off: - More restarts = faster search, worse compression - Fewer restarts = better compression, slower search - Default 16 = good balance

Block-Level I/O¶

Reading a Single Key¶

1. User: sst.get(b"user:bob")
2. Check bloom filter (~67ns)
3. Search index → Block 5 contains range ["user:alice" ... "user:david"]
4. Read block 5 (4KB)
   - Check cache first
   - If miss: read from disk
5. Decompress block 5 (<1µs)
6. Binary search within block
7. Return value

Key insight: Block is the atomic unit of I/O. Even if you need 1 entry, you read the whole 4KB block.

Read Amplification¶

Key size: 20 bytes
Value size: 100 bytes
Entry size: ~120 bytes

Useful data: 120 bytes
Actual read: 4096 bytes

Read amplification: 4096 / 120 = 34x

Mitigation: Caching amortizes this over multiple reads.

Block-Level Caching¶

Why Cache Blocks (Not Entries)?¶

// Good: Cache at block granularity
cache.insert(block_id, decompressed_block);

// Bad: Cache individual entries
cache.insert(key, value);  //

Reasons:

Spatial locality: If you read "user:bob", likely to read "user:alice" next
Decompression cost: Decompress once, use many times
Memory efficiency: 4KB block contains ~50-200 entries
Simpler eviction: LRU at block level

Cache Hit Patterns¶

Cold cache:
  get("key1") → cache miss → read block 0 → decompress
  get("key2") → cache miss → read block 0 again → decompress

Warm cache:
  get("key1") → cache miss → read block 0 → decompress → cache
  get("key2") → cache HIT → return from cache (~100ns)
  get("key3") → cache HIT (if in same block)

80/20 rule: 20% of blocks contain 80% of keys accessed.

Block-Level Compression¶

Why Compress Per-Block?¶

Alternative 1: Compress whole file
  - Read 1 key → decompress entire file (slow!)

Alternative 2: Compress per-entry
  - Poor ratio (no cross-entry patterns)
  - More overhead

Block-level:
  - Read 1 key → decompress 4KB (fast)
  - Good ratio (compress ~100 entries together)

Compression + Caching Synergy¶

Disk: Block 5 (compressed, 1.2KB LZ4)
        ↓ read from disk
RAM: Block 5 (decompressed, 4KB in cache)
        ↓ many reads
Users get 4KB of decompressed data at <100ns latency

Key insight: Cache stores decompressed blocks. Decompress once, serve many times.

Example: LZ4 on 4KB Block¶

Uncompressed block: 4096 bytes
  Entry 1: "user:alice" → "data..."
  Entry 2: "user:bob"   → "data..."
  ...
  Entry 50: "user:zach" → "data..."

LZ4 compressed: ~1400 bytes (2.9x ratio)
  - Common prefix "user:" compressed
  - Repeated value patterns compressed
  - Dictionary built across all entries

Decompress time: ~1µs (4KB @ 3.9 GB/s)

Index Structure¶

Block Index¶

Index (in SSTable footer region):
┌────────────────────────────────────┐
│ Block 0: first_key="aaa", offset=0    │
│ Block 1: first_key="bbb", offset=4096 │
│ Block 2: first_key="ccc", offset=8192 │
│ ...                                │
│ Block N: first_key="zzz", offset=...  │
└────────────────────────────────────┘

Search algorithm:

fn find_block(key: &[u8]) -> BlockId {
    // Binary search in index
    index.binary_search(|block| block.first_key.cmp(key))
}

Complexity: O(log B) where B = number of blocks

Example: - 1MB SSTable / 4KB blocks = 256 blocks - Binary search: log2(256) = 8 comparisons

Comparison: Entry-Level Index¶

Alternative: Index every key
  - 100K entries → 100K index entries
  - Index size: ~10MB (key + offset per entry)
  - Memory overhead: 10x data size!

Block-level index:
  - 100K entries / 50 per block = 2000 blocks
  - Index size: ~200KB
  - Memory overhead: 2% of data size Yes

Real-World Example¶

Scenario: 1GB SSTable¶

Configuration:
  - Block size: 4KB
  - Average entry: 120 bytes
  - Entries per block: ~34
  - Compression (LZ4): 2.5x

Math:
  - Uncompressed blocks: 1GB / 4KB = 256K blocks
  - Compressed on disk: 1GB / 2.5 = 400MB
  - Index size: 256K * 32 bytes = 8MB
  - Total file size: 400MB + 8MB = 408MB

Read Performance¶

Point lookup (cache miss):
  1. Bloom filter check: ~67ns
  2. Index binary search: log2(256K) = 18 comparisons (~500ns)
  3. Read compressed block from SSD: ~100µs
  4. Decompress 4KB block: ~1µs
  5. Binary search in block: ~200ns
  Total: ~102µs

Point lookup (cache hit):
  1. Bloom filter: ~67ns
  2. Index search: ~500ns
  3. Cache lookup: ~100ns
  4. Binary search in cached block: ~200ns
  Total: ~900ns (100x faster!)

Design Alternatives Considered¶

1. Variable-Size Blocks¶

Idea: Pack blocks until ~4KB, but allow variance.

Pros: - No entry splitting across blocks - Potentially better compression

Cons: - Complex index (size per block) - Unpredictable cache sizing - Harder to align with OS pages

Decision: Fixed 4KB for simplicity and predictability.

2. Nested Blocks (2-Level)¶

Idea: 64KB "mega-blocks" containing 16x 4KB sub-blocks.

Pros: - Even better compression (more context) - Smaller index (fewer mega-blocks)

Cons: - Higher read amplification - More complex implementation - 64KB cache entries wasteful

Decision: Single-level 4KB blocks for simplicity.

3. Adaptive Block Sizing¶

Idea: Small blocks for hot data, large for cold.

Pros: - Optimize for access patterns

Cons: - Very complex - Unknown access patterns at write time - Index becomes complex

Decision: Fixed 4KB, let caching handle hot/cold.

Performance Impact¶

Benchmark: Block Size Comparison¶

Block Size	Index Size	Compression Ratio	Cache Hit Latency	Miss Latency
1KB	32MB	2.0x	400ns	80µs
4KB	8MB	2.5x	900ns	102µs
16KB	2MB	3.0x	2.5µs	150µs
64KB	512KB	3.5x	8µs	280µs

Optimal: 4KB balances all factors.

Implementation Details¶

Block Building¶

pub struct BlockBuilder {
    buffer: Vec<u8>,
    restarts: Vec<u32>,
    counter: usize,
    last_key: Bytes,
}

impl BlockBuilder {
    pub fn add(&mut self, entry: &Entry) {
        if self.counter % RESTART_INTERVAL == 0 {
            // Restart point: full key
            self.restarts.push(self.buffer.len() as u32);
            self.write_entry(entry, 0); // shared_len = 0
        } else {
            // Compressed entry
            let shared = common_prefix_len(&self.last_key, &entry.key);
            self.write_entry(entry, shared);
        }
        self.counter += 1;
        self.last_key = entry.key.clone();
    }

    pub fn finish(self) -> Vec<u8> {
        let mut block = self.buffer;
        // Append restart points
        for offset in self.restarts {
            block.extend_from_slice(&offset.to_le_bytes());
        }
        block.extend_from_slice(&(self.restarts.len() as u32).to_le_bytes());
        block
    }
}

Best Practices¶

1. Use Default 4KB¶

// Recommended
SSTableConfig {
    block_size: 4096,  // Default
    ..Default::default()
}

// Only change if:
// - Ultra-small keys/values (try 2KB)
// - Ultra-large values (try 8KB)
// - Profiling shows bottleneck

2. Size Cache Based on Working Set¶

Working set = hot blocks * 4KB
Cache size ≈ 1.5x working set

Example:
  10K hot keys / 50 per block = 200 hot blocks
  200 blocks * 4KB = 800KB working set
  Cache size: 1-2MB sufficient

3. Monitor Block Cache Hit Rate¶

let hit_rate = cache_hits / (cache_hits + cache_misses);

// Target: >80% for hot workloads
if hit_rate < 0.8 {
    // Increase cache size
}

Summary¶

Block-based storage is fundamental to SSTable design:

Efficient I/O - 4KB aligned with OS pages Good compression - Compress ~50-200 entries together Practical caching - Cache at block granularity Fast lookups - Small block index (O(log B)) Simple design - Fixed size, no complex logic

Key insight: Blocks are the quantum of I/O. Choose block size to balance compression ratio, cache efficiency, and read amplification.

Next Steps¶

Understand the file format: See How It Works: File Format for detailed format specification.

Learn about caching: Check Caching for how the LRU cache works with blocks.

Optimize compression: Read Compression for how blocks are compressed.