What is an SSTable?¶

Understanding the fundamental concept behind Sorted String Tables.

Definition¶

An SSTable (Sorted String Table) is an immutable, on-disk data structure that stores a sorted collection of key-value pairs. SSTables are a core component of LSM-tree (Log-Structured Merge-tree) storage engines, used in databases like LevelDB, RocksDB, Cassandra, and now NoriKV.

Key Properties¶

Sorted: Keys are stored in lexicographic order
Immutable: Once written, the file never changes
Self-contained: Includes data, index, bloom filter, and metadata
Block-based: Data organized into fixed-size blocks (typically 4KB)

Why SSTables Exist¶

The Problem: Random Writes are Slow¶

Traditional B-tree databases perform random writes directly to disk: - Each write requires reading a page, modifying it, writing it back - Causes disk head seeks (slow on HDDs) - Even on SSDs, random writes have higher write amplification

The Solution: Sequential Writes¶

SSTables enable a write-optimized storage approach:

Writes go to memory first (memtable)
When memory fills, flush sorted data to a new SSTable
All writes to an SSTable are sequential (fast!)
Reads check multiple SSTables and merge results

Result: 10-100x faster write throughput

How SSTables Work¶

Write Path¶

User writes → Memtable (in-memory sorted tree)
                  ↓ (when full)
              SSTableBuilder
                  ↓
    ┌─────────────┴─────────────┐
    │  1. Sort all entries      │
    │  2. Build blocks (4KB)    │
    │  3. Build bloom filter    │
    │  4. Build index           │
    │  5. Write footer          │
    └───────────────────────────┘
                  ↓
         Immutable SSTable file

Read Path¶

User reads key "user:42"
    ↓
Check bloom filter (67ns)
    ↓ (probably exists)
Search index (binary search)
    ↓ (find block 15)
Read block 15 from disk or cache
    ↓ (decompress if needed)
Binary search within block
    ↓
Return value

Typical read latency: ~5-10µs (cache hit), ~100µs-1ms (disk read)

SSTable File Structure¶

┌────────────────────────────────────┐
│ Data Blocks (sorted key-value)    │
│                                    │
│  Block 0: [aaa...azz] (4KB)       │
│  Block 1: [baa...bzz] (4KB)       │
│  Block 2: [caa...czz] (4KB)       │
│  ...                               │
│  Block N: [yaa...zzz] (4KB)       │
│                                    │
├────────────────────────────────────┤
│ Index (points to blocks)           │
│  "aaa" → Block 0, offset 0         │
│  "baa" → Block 1, offset 4096      │
│  "caa" → Block 2, offset 8192      │
│  ...                               │
├────────────────────────────────────┤
│ Bloom Filter                       │
│  (probabilistic set membership)    │
├────────────────────────────────────┤
│ Footer (64 bytes)                  │
│  - Magic number (validation)       │
│  - Index offset/size               │
│  - Bloom filter offset/size        │
│  - Compression type                │
│  - CRC32C checksum                 │
└────────────────────────────────────┘

Size: Typically 1MB-256MB per SSTable

SSTables in LSM-Trees¶

SSTables power LSM-tree storage engines by enabling:

Leveled Compaction¶

Level 0: [SST1] [SST2] [SST3]  ← Recent, may overlap
            ↓ compact
Level 1: [SST4 ─────] [SST5 ─────]  ← No overlaps, 10x size
            ↓ compact
Level 2: [SST6 ──────────────]  ← No overlaps, 100x size

Level 0: Fresh SSTables from memtable flushes
Level 1+: Merged, non-overlapping SSTables
Compaction: Background process that merges SSTables

Point Reads¶

// Check newest to oldest
if let Some(val) = memtable.get(key) { return val; }
for sst in level_0.iter() {
    if sst.bloom_contains(key) && sst.get(key).is_some() {
        return val;
    }
}
for sst in level_1.iter() { /* ... */ }
// ...

Range Scans¶

// Merge iterators from all levels
let mut heap = MergeHeap::new();
heap.push(memtable.iter());
for sst in all_sstables {
    heap.push(sst.iter_range(start, end));
}

while let Some((key, value)) = heap.pop() {
    // Return newest version of each key
    yield (key, value);
}

Comparison to B-Trees¶

Aspect	SSTable (LSM)	B-Tree
Write pattern	Sequential	Random
Write throughput	High (10-100K/sec)	Medium (1-10K/sec)
Read throughput	Medium (need to check multiple files)	High (single lookup)
Space amplification	Medium (1.1-1.5x)	Low (1.0-1.1x)
Compaction	Background cost	None
Use case	Write-heavy, append-mostly	Read-heavy, updates

Key insight: SSTables trade read complexity for write performance.

Real-World Analogies¶

Library Card Catalog¶

Think of each SSTable as a card catalog drawer:

Each drawer contains cards (key-value pairs) in alphabetical order
When you add new books, you create a new drawer (don't modify old ones)
Periodically, you merge drawers to consolidate and remove duplicates
To find a book, you check drawers newest to oldest

Git Commits¶

SSTables are like git commits:

Each commit (SSTable) is immutable
New commits don't change old ones
Occasionally, you rebase/compact to clean up history
To get current state, you merge all commits

Benefits of Immutability¶

SSTables are write-once, never modified. This provides:

1. Simple Concurrency¶

// Multiple readers, no locks needed!
let reader = Arc::new(SSTableReader::open("data.sst").await?);

tokio::spawn({
    let r = reader.clone();
    async move { r.get(b"key1").await }
});
tokio::spawn({
    let r = reader.clone();
    async move { r.get(b"key2").await }
});

No locks, no coordination, no reader-writer conflicts.

2. Easy Caching¶

Block 15 in cache? Yes
├─ It will NEVER change
└─ Cache it forever (until evicted)

3. Simple Crash Recovery¶

SSTable file exists?
├─ Yes Footer checksum valid → File is good
└─ No Checksum invalid → Discard (write failed mid-creation)

4. Snapshot Isolation¶

Snapshot = "Read from SSTable set at time T"
├─ SSTables never change
├─ Just keep references alive
└─ Perfect MVCC (multi-version concurrency control)

Common Misconceptions¶

"SSTables are slow for reads"¶

Reality: With bloom filters and caching, SSTables can achieve < 10µs reads for hot data. Bloom filters skip 99%+ of unnecessary disk reads.

"SSTables waste space with duplicates"¶

Reality: Compaction merges SSTables and removes old versions. Space amplification is typically only 1.1-1.5x.

"Compaction causes unpredictable latency"¶

Reality: Modern LSM engines use rate limiting and tiered compaction to keep p99 latency under SLO.

When to Use SSTables¶

Great Fit¶

Write-heavy workloads (logging, time-series, events)
Append-mostly data (few updates/deletes)
Range scans (sequential iteration)
Hot key patterns (caching helps a lot)
Large datasets that don't fit in memory

Not the Right Tool¶

Ultra-low latency (< 1µs) requirements
Random updates to same keys (causes many tombstones)
Tiny datasets (< 1MB) where overhead dominates
No compaction budget (need background I/O for merges)

Next Steps¶

Understand immutability: Read about Immutability and why it's central to SSTable design.

Learn about organization: See Block-Based Storage for how data is structured.

Optimize reads: Check out Bloom Filters to understand how we avoid disk I/O.

Dive into implementation: Jump to How It Works for file format details.