Append-Only Architecture¶
Why WALs are append-only, what that means for your application, and the trade-offs involved.
Table of contents¶
What Does "Append-Only" Mean?¶
An append-only data structure is one where you can only add new data at the end. You cannot:
- Modify existing data in place
- Delete data in place
- Insert data in the middle
You can only append to the end.
┌─────────────────────────────────────────┐
│ [Record 1][Record 2][Record 3] │ ← Can only append here
└─────────────────────────────────────────┘
↑
Cannot modify or delete these!
In nori-wal, this means: - New records are always written at the end of the current segment - Old records are never modified - Deletions are represented as tombstone records (append a delete marker) - Updates are represented as new versions (append a new PUT for the same key)
Why Append-Only?¶
Append-only seems limiting at first. Why not just update records in place?
Reason 1: Sequential I/O is Fast¶
Modern storage (SSDs, HDDs, NVMe) is orders of magnitude faster for sequential writes than random writes.
Hard Disk Drives (HDD): - Sequential write: ~100-200 MB/s - Random write: ~1-5 MB/s (40x slower!)
Solid State Drives (SSD): - Sequential write: ~500-3000 MB/s - Random write: ~50-300 MB/s (10x slower) - Plus: Random writes cause write amplification, shortening SSD lifespan
NVMe SSD: - Sequential write: ~3000-7000 MB/s - Random write: ~500-2000 MB/s (still slower!)
By appending to a log, we only do sequential writes. This is the fastest possible I/O pattern.
graph LR
A[Application] -->|append| B[Log File]
B -->|sequential writes| C[Disk]
style C fill:#90EE90
Reason 2: No Partial Write Corruption¶
When you write to a file in place, you can corrupt data if you crash mid-write:
In-Place Update (BAD):
Before: [Record A: "user:1=alice"]
Writing: [Record A: "user:1=bob--"] ← CRASH!
After: [Record A: "user:1=bob--"] ← Corrupted!
You now have a partially written record. Is the user "bob" or "alice"? You don't know.
Append-Only (GOOD):
Before: [Record 1: "user:1=alice"][empty space]
Writing: [Record 1: "user:1=alice"][Record 2: "user:1=bob--"] ← CRASH!
After: [Record 1: "user:1=alice"][partial garbage]
Recovery: Scan log, validate CRCs
- Record 1: Yes Valid CRC → Keep
- Record 2: No Invalid CRC → Truncate
Final: [Record 1: "user:1=alice"]
With append-only, you can never corrupt old data. The worst that can happen is you have a partial record at the tail, which you detect with CRC32C and truncate.
Reason 3: Simple and Correct¶
Append-only is conceptually simple. There's no complex logic for:
- Managing free space within a file
- Defragmenting after deletes
- Handling variable-length records
- Ensuring atomic in-place updates
The code is straightforward:
// Append-only: Simple!
async fn append(&mut self, record: &Record) -> Result<Position> {
let position = self.current_offset;
let bytes = serialize(record)?;
self.file.write_all(&bytes).await?;
self.current_offset += bytes.len() as u64;
Ok(position)
}
Compare this to the complexity of in-place updates:
// In-place update: Complex!
async fn update_in_place(&mut self, position: u64, new_record: &Record) -> Result<()> {
let new_bytes = serialize(new_record)?;
let old_bytes = self.read_at(position).await?;
// What if new record is larger than old?
if new_bytes.len() > old_bytes.len() {
// Need to relocate! But where?
// Need to update all pointers to this record!
// What if we crash during relocation?
// ...this is getting complicated
}
// What if new record is smaller?
if new_bytes.len() < old_bytes.len() {
// Fragmentation! Need a free list? Compaction?
}
// Atomic update: need double-buffering or copy-on-write?
// Error handling is now much harder...
}
Simple code is correct code.
Reason 4: Crash Recovery is Easier¶
With append-only, recovery is a simple scan:
async fn recover(&mut self) -> Result<RecoveryInfo> {
let mut position = 0;
let mut valid_records = 0;
loop {
// Try to read a record at this position
match self.read_record_at(position).await {
Ok((record, size)) => {
// Valid record, keep it
valid_records += 1;
position += size;
}
Err(_) => {
// Invalid record (partial write or corruption)
// Truncate everything after this point
self.truncate(position).await?;
break;
}
}
}
Ok(RecoveryInfo {
valid_records,
bytes_truncated: self.file_size - position,
})
}
This is called "prefix-valid" recovery: keep all valid records from the beginning until we hit corruption, then truncate the rest.
With in-place updates, recovery is much harder: - Which version of a record is correct? - How do you know if an update completed? - Do you need a transaction log? (Isn't that just... a WAL?)
How Append-Only Affects Your Application¶
If nori-wal is append-only, how do you handle updates and deletes?
Updates: Write a New Version¶
When you update a key, you append a new record with the same key:
// Initial write
let record = Record::put(b"user:1", b"alice@example.com");
wal.append(&record).await?;
// Update: append a new version
let record = Record::put(b"user:1", b"bob@example.com"); // Same key!
wal.append(&record).await?;
On disk:
During recovery, you keep the latest version:
let mut state = HashMap::new();
while let Some((record, _)) = reader.next_record().await? {
// Later records shadow earlier ones
state.insert(record.key, record.value);
}
// Result: user:1 → bob@example.com
Deletes: Write a Tombstone¶
When you delete a key, you append a tombstone record:
A tombstone is a record with the tombstone flag set and no value:
[Record 1: user:1=alice@example.com]
[Record 2: user:1=bob@example.com]
[Record 3: user:1=(tombstone)] ← Deleted
During recovery:
while let Some((record, _)) = reader.next_record().await? {
if record.tombstone {
state.remove(&record.key); // Delete from state
} else {
state.insert(record.key, record.value);
}
}
// Result: user:1 does not exist
The Cost: Garbage Collection¶
Append-only has a major downside: the log grows forever.
Every update and delete appends a new record, but the old records are still on disk:
[user:1=v1][user:1=v2][user:1=v3][user:2=v1][user:1=(deleted)]
↑ ↑ ↑ ↑
Old Old Old Tombstone
Actual state: user:2=v1
Wasted space: 80% of the log!
You need a garbage collection (GC) strategy to reclaim space.
Strategy 1: Delete Old Segments¶
If you only care about recent data (e.g., event logs with a retention policy), you can delete entire old segments:
// Keep only last 7 days of data
let cutoff = Position {
segment_id: old_segment_id,
offset: 0,
};
let deleted = wal.delete_segments_before(cutoff).await?;
This is simple and works well for append-only workloads (logs, events, time-series).
Strategy 2: Compaction¶
If you need to keep all keys but remove old versions, you need compaction:
Before compaction:
Segment 0: [k1=v1][k2=v1][k1=v2][k3=v1]
Segment 1: [k2=v2][k1=v3][k4=v1]
After compaction:
Segment 2 (compacted): [k1=v3][k2=v2][k3=v1][k4=v1]
Segments 0, 1: deleted
Compaction: 1. Reads multiple old segments 2. Merges records, keeping only the latest version of each key 3. Writes a new compacted segment 4. Deletes the old segments
This is what LSM-tree databases (RocksDB, Cassandra, nori-lsm) do.
Important: nori-wal itself does not implement compaction. It provides the primitives (delete_segments_before()) but you decide when and how to compact.
See the Recipes section for implementation examples.
Strategy 3: Snapshotting¶
If your state is relatively small, you can:
- Periodically write a snapshot of your entire state to a separate file
- Delete all WAL segments before the snapshot
- On recovery: Load snapshot, then replay WAL from the snapshot point
Snapshot at position 1000: {k1=v3, k2=v2, k3=v1}
WAL segments before position 1000: DELETE
WAL segments after position 1000: KEEP (replay these on recovery)
This is what databases like Redis and PostgreSQL do.
Append-Only in Practice¶
Let's see how append-only affects a real application.
Example: Building a Simple KV Store¶
use std::collections::HashMap;
use nori_wal::{Wal, WalConfig, Record, Position};
struct SimpleKV {
wal: Wal,
memtable: HashMap<Vec<u8>, Vec<u8>>,
}
impl SimpleKV {
async fn open(config: WalConfig) -> Result<Self> {
let (wal, recovery_info) = Wal::open(config).await?;
// Replay WAL to build in-memory state
let mut memtable = HashMap::new();
let mut reader = wal.read_from(Position { segment_id: 0, offset: 0 }).await?;
while let Some((record, _)) = reader.next_record().await? {
if record.tombstone {
memtable.remove(&record.key);
} else {
memtable.insert(record.key, record.value);
}
}
Ok(Self { wal, memtable })
}
async fn put(&mut self, key: &[u8], value: &[u8]) -> Result<()> {
// Write to WAL first (durability)
let record = Record::put(key, value);
self.wal.append(&record).await?;
self.wal.sync().await?;
// Update in-memory state
self.memtable.insert(key.to_vec(), value.to_vec());
Ok(())
}
async fn delete(&mut self, key: &[u8]) -> Result<()> {
// Write tombstone to WAL
let record = Record::delete(key);
self.wal.append(&record).await?;
self.wal.sync().await?;
// Update in-memory state
self.memtable.remove(key);
Ok(())
}
fn get(&self, key: &[u8]) -> Option<&[u8]> {
// Read from in-memory state (NOT the WAL!)
self.memtable.get(key).map(|v| v.as_slice())
}
}
Key points: 1. Writes append to the WAL, then update the memtable 2. Deletes append a tombstone, then remove from memtable 3. Reads come from the memtable, not the WAL 4. Recovery replays the WAL to reconstruct the memtable
This is essentially how LevelDB, RocksDB, and nori-lsm work!
Growing Disk Usage¶
As you use this KV store, the WAL grows:
After 1000 writes:
wal/000000.wal: 50 KB
After 10000 writes:
wal/000000.wal: 500 KB
After 100000 writes:
wal/000000.wal: 5 MB
After 1M writes (many updates to same keys):
wal/000000.wal: 50 MB (but only 10K unique keys!)
At some point, you need to compact. For example:
impl SimpleKV {
async fn compact(&mut self) -> Result<()> {
// 1. Write a snapshot of current state
let snapshot_file = File::create("snapshot.dat").await?;
for (key, value) in &self.memtable {
snapshot_file.write_all(key).await?;
snapshot_file.write_all(value).await?;
}
snapshot_file.sync_all().await?;
// 2. Create a new WAL, abandoning the old one
let new_config = WalConfig {
dir: PathBuf::from("wal-new"),
..Default::default()
};
let (new_wal, _) = Wal::open(new_config).await?;
// 3. Swap to new WAL
self.wal = new_wal;
// 4. Delete old WAL directory
std::fs::remove_dir_all("wal")?;
std::fs::rename("wal-new", "wal")?;
Ok(())
}
}
This is a simple snapshot-based compaction strategy.
Append-Only vs. In-Place Updates: The Trade-Off¶
Let's compare the two approaches:
| Aspect | Append-Only | In-Place Updates |
|---|---|---|
| Write performance | ⭐⭐⭐⭐⭐ Fast (sequential I/O) | ⭐⭐⭐ Slower (random I/O) |
| Read performance | ⭐⭐⭐⭐ Fast (if indexed) | ⭐⭐⭐⭐ Fast |
| Crash safety | ⭐⭐⭐⭐⭐ Excellent (prefix-valid recovery) | ⭐⭐ Complex (need transactions) |
| Disk usage | ⭐⭐ Grows forever (need GC) | ⭐⭐⭐⭐ Stable |
| Code complexity | ⭐⭐⭐⭐⭐ Simple | ⭐⭐ Complex |
| SSD wear | ⭐⭐⭐⭐⭐ Minimal (sequential) | ⭐⭐ Higher (random writes) |
Append-only is better for: - High write throughput - Simplicity and correctness - Crash recovery - SSD longevity
In-place updates are better for: - Stable disk usage - Workloads with very few updates
For most modern applications, append-only wins.
Common Misconceptions¶
"Append-only wastes disk space"¶
Response: Only if you don't implement garbage collection. With compaction or snapshotting, disk usage is proportional to your actual data size, not write volume.
"Append-only is slow because you have to scan the entire log"¶
Response: You don't read from the WAL during normal operation. You read from an in-memory index (memtable, B-tree, hash table). The WAL is only scanned during recovery.
"Tombstones are leaked forever"¶
Response: No. During compaction, you can drop tombstones that shadow older versions. Once there are no older versions below a tombstone, it's safe to remove.
Before compaction:
Level 0: [k1=(tombstone)]
Level 1: [k1=v1]
After compaction to Level 2:
Level 2: (k1 completely removed)
Key Takeaways¶
- Append-only means you only add data at the end, never modify in place
- Updates → append a new version
-
Deletes → append a tombstone
-
Append-only is fast and simple
- Sequential I/O is 10-40x faster than random I/O
- No partial write corruption
-
Simple recovery: scan and validate CRCs
-
The cost is growing disk usage
- You need a garbage collection strategy
-
Options: delete old segments, compaction, snapshotting
-
Don't read from the WAL during normal operation
- Use an in-memory index (memtable, hash table, B-tree)
-
WAL is for durability and recovery, not queries
-
Append-only is the foundation of modern storage systems
- LSM-trees (RocksDB, LevelDB, Cassandra, nori-lsm)
- Event sourcing
- Distributed logs (Kafka, Raft)
What's Next?¶
Now that you understand append-only architecture, explore:
- Fsync Policies - How to balance durability and performance
- Recovery Guarantees - What happens after a crash
- When to Use a WAL - Scenarios where WALs shine
Or dive into implementation details in How It Works.