Immutability¶
Why SSTables are write-once, immutable files and the profound benefits this provides.
The Core Principle¶
Once an SSTable is written, it never changes.
This single design decision cascades into numerous benefits: - Lock-free concurrent reads - Simple caching semantics - Crash-safe by design - Perfect snapshot isolation - Zero write amplification on reads
Why Immutability?¶
Problem: Mutable Files Require Coordination¶
In a traditional B-tree database:
Writer updates page 42
↓
1. Read page 42 from disk
2. Modify in memory
3. Write back to disk
4. Flush to ensure durability
Issues: - Readers need locks (can't read page being written) - Cache invalidation (did page change?) - Crash during write = corrupted page - Write amplification (read-modify-write)
Solution: Write New, Never Modify¶
In an SSTable-based system:
New data arrives
↓
1. Accumulate in memtable (RAM)
2. When full, write NEW SSTable
3. Old SSTables unchanged
4. Background: merge old SSTables → new SSTable
Benefits: - Readers never block (old files still valid) - Cache forever (data never changes) - Crash = incomplete file, just delete it - Pure sequential writes
Concurrency Benefits¶
Lock-Free Reads¶
// Thread 1
let reader1 = Arc::new(SSTableReader::open("v1.sst").await?);
reader1.get(b"key").await?;
// Thread 2
let reader2 = reader1.clone(); // Just clone Arc
reader2.get(b"other_key").await?; // No locks!
// Thread 3
let reader3 = reader1.clone();
reader3.iter().await?; // Still no locks!
No mutexes, no RwLocks, no coordination overhead.
Simple Atomic Updates¶
// Update "database" by atomically swapping file list
struct DB {
sstables: Arc<RwLock<Vec<Arc<SSTableReader>>>>,
}
impl DB {
async fn compact(&self) {
// 1. Create new merged SSTable
let new_sst = compact([old1, old2, old3]).await?;
// 2. Atomically update list
let mut tables = self.sstables.write();
tables.retain(|sst| sst.path != old1.path && ...);
tables.push(Arc::new(new_sst));
// 3. Old readers still see old files (Arc keeps them alive)
// 4. New readers see new files
}
}
Caching Benefits¶
Forever Caching¶
// Cache a block from SSTable
cache.insert(
key: (sst_id, block_index),
value: decompressed_block,
ttl: FOREVER // Block will NEVER change!
);
No cache invalidation logic needed. Block 15 from file_42.sst will always contain the same data.
Multi-Level Caching¶
All levels can cache aggressively because data never changes.
Crash Safety¶
Atomic File Creation¶
Write SSTable:
1. Write data blocks → temp_file.sst.tmp
2. Write index → temp_file.sst.tmp
3. Write bloom filter → temp_file.sst.tmp
4. Write footer + CRC → temp_file.sst.tmp
5. fsync()
6. rename() to final.sst ← Atomic!
If crash occurs: - Before rename: temp file deleted, no corruption - After rename: file complete and valid
Validation on Open¶
impl SSTableReader {
pub async fn open(path: PathBuf) -> Result<Self> {
let file = File::open(path).await?;
// Read footer
let footer = read_footer(&file).await?;
// Validate CRC
if !footer.validate_crc() {
return Err(Error::CorruptedFile);
}
// File is valid!
Ok(Self { file, footer, ... })
}
}
If corruption detected, just delete the file. No recovery needed.
Snapshot Isolation (MVCC)¶
Perfect Snapshots¶
struct Snapshot {
sstables: Vec<Arc<SSTableReader>>, // Immutable!
timestamp: u64,
}
// Create snapshot
let snap = db.snapshot();
// Later... database changed, but snapshot unchanged
snap.get(b"key").await?; // Sees data as of snapshot time
SSTables are immutable → snapshot is just a list of Arc references.
Time-Travel Queries¶
// Read data as of 1 hour ago
let snapshot = db.snapshot_at(now - 3600);
snapshot.scan(b"user:", b"user:~").await?;
Possible because old SSTables are kept until no longer referenced.
Deletion Handling: Tombstones¶
Problem: How to Delete if Immutable?¶
Solution: Write a tombstone (delete marker)
// Delete "user:42"
memtable.insert(b"user:42", Entry::tombstone());
// When memtable flushes
sstable.write(Entry {
key: b"user:42",
value: b"",
tombstone: true, // ← Marker
});
Tombstone Lifecycle¶
1. Delete request → Tombstone in memtable
2. Flush → Tombstone in L0 SSTable
3. Read "user:42" → Sees tombstone first (newest) → Return None
4. Compaction → Merges tombstone with older live value → Removed
Tombstone shadows all older versions during reads.
Compaction: Making Immutability Practical¶
The Accumulation Problem¶
Day 1: SST1 (1MB)
Day 2: SST1, SST2 (2MB total)
Day 3: SST1, SST2, SST3 (3MB total)
...
Day 100: SST1...SST100 (100MB total)
Problem: Reads must check 100 files!
Solution: Background Compaction¶
// Merge multiple SSTables into one
async fn compact(inputs: Vec<SSTableReader>) -> SSTable {
let mut builder = SSTableBuilder::new(config).await?;
// Merge-sort all inputs
let mut heap = MergeHeap::new();
for sst in inputs {
heap.push(sst.iter());
}
while let Some((key, value, tombstone)) = heap.pop() {
// Take newest version of each key
if !already_seen(key) {
builder.add(&Entry { key, value, tombstone }).await?;
}
}
builder.finish().await
}
Creates new SSTable with merged data. Old SSTables deleted once unreferenced.
Update Handling¶
Problem: How to Update a Key?¶
Solution: Write the new value. Newest version wins.
Time 0: Write "user:42" = "alice" → SST1
Time 1: Write "user:42" = "bob" → SST2
Read "user:42":
Check SST2 (newer) → "bob" Yes (return this)
Check SST1 (older) → "alice" (ignored)
Space Amplification¶
Same key updated N times = N copies on disk
↓
Compaction merges → Keep only newest
↓
Space amplification = (data_on_disk / live_data)
Typical: 1.1-1.5x with regular compaction
Comparison: Mutable vs Immutable¶
| Aspect | Mutable (B-tree) | Immutable (SSTable) |
|---|---|---|
| Concurrency | Locks required | Lock-free reads |
| Caching | Invalidation logic | Cache forever |
| Crash recovery | Complex (WAL replay) | Simple (validate CRC) |
| Snapshots | Copy-on-write | Free (just Arc refs) |
| Write pattern | Random (slow) | Sequential (fast) |
| Space overhead | Low | Medium (until compaction) |
| Compaction | Not needed | Required |
Trade-Offs¶
Advantages¶
- Write throughput: 10-100x faster than random writes
- Concurrency: Lock-free, scales to many readers
- Crash safety: Atomic, no corruption possible
- Snapshots: Zero-cost MVCC
- Caching: Aggressive caching with no invalidation
Disadvantages¶
- Space amplification: Old versions accumulate until compaction
- Read amplification: Must check multiple files
- Compaction cost: Background I/O and CPU
- Latency spikes: During compaction (can be rate-limited)
Best Practices¶
1. Regular Compaction¶
// Schedule compaction to keep read amplification low
tokio::spawn(async {
loop {
tokio::time::sleep(Duration::from_secs(3600)).await;
db.compact().await?;
}
});
2. Reference Counting for Deletion¶
// Only delete SSTables when all readers done
struct SSTable {
reader: Arc<SSTableReader>, // Ref-counted
}
// When compaction creates new file:
// 1. Add new SSTable to list
// 2. Remove old SSTables from list
// 3. Old files deleted when Arc count → 0
3. Bounded Retention¶
// Don't keep infinite history
db.compact_with_policy(CompactionPolicy {
max_versions: 1, // Keep only latest version
tombstone_ttl: Duration::from_days(7), // Drop old tombstones
});
Real-World Analogy¶
Think of version control (like Git):
- Each commit is immutable
- New changes = new commit (don't modify old ones)
- To see current state, merge all commits
- Periodically, you rebase/squash to clean up history
- Old commits stay around until unreferenced
SSTables are like commits for your database.
Next Steps¶
Learn about organization: See Block-Based Storage for how immutable data is structured.
Explore implementation: Jump to File Format to see how immutability is enforced.