What is a Write-Ahead Log?¶
The fundamental concept behind WALs, why they exist, and how they power modern systems.
Table of contents¶
The Core Idea¶
A write-ahead log (WAL) is a simple but powerful concept:
Write changes to a durable log before modifying your main data structures.
That's it. That's the entire concept.
But this simple idea solves one of the hardest problems in computing: making changes durable without sacrificing performance.
Why WALs Exist¶
Imagine you're building a database. You have a nice in-memory data structure (maybe a B-tree or hash table) that's blazingly fast. But there's a problem:
What happens when the power goes out?
Your in-memory structure is gone. All data lost.
The Naive Solution: Sync After Every Change¶
// Don't do this!
fn set(key: &str, value: &str) {
data_structure.insert(key, value);
// Write the ENTIRE data structure to disk
write_entire_tree_to_disk(&data_structure);
fsync(); // Force to physical disk
}
Problems: - Slow: Writing a 10GB B-tree to disk for a 10-byte update is absurd - Complex: How do you write a tree to disk atomically? What if you crash mid-write? - Fragmented: Random updates → random disk writes → terrible performance
The Better Solution: Write-Ahead Logging¶
// Much better!
fn set(key: &str, value: &str) {
// 1. Append change to log (sequential, fast)
wal.append(Record::put(key, value));
wal.sync(); // Just sync the log (cheap!)
// 2. Update in-memory structure (fast)
data_structure.insert(key, value);
// Done! If we crash, we can replay the log to rebuild the data structure.
}
Benefits:
- Fast: Appending to a log is sequential I/O (the fastest kind)
- Simple: No complex tree serialization
- Durable: After sync(), the change is on disk, guaranteed
- Recoverable: Replay the log after a crash to rebuild state
This is the insight that powers virtually every database you've ever used.
Historical Context¶
The write-ahead log is one of the oldest ideas in database systems. It dates back to the 1970s when researchers were figuring out how to make databases reliable.
Key Milestones¶
1976 - IBM System R introduces WAL for ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) - First formal recovery algorithm using WAL - Still the foundation for many modern databases
1980s - WAL becomes standard in ACID databases - PostgreSQL, Oracle, DB2 all adopt WAL-based recovery - "Write-ahead" becomes a fundamental durability guarantee
1990s-2000s - WAL spreads beyond traditional databases - Journaling filesystems (ext3, NTFS) use WAL concepts - Message queues (Kafka) built entirely on logs
2010s-Present - "Log is the database" - Event sourcing: the log is the primary data structure - Distributed consensus (Raft, Multi-Paxos) built on replicated logs - Modern KV stores (RocksDB, LevelDB) use WAL + LSM trees
How WALs Work (High Level)¶
Here's the lifecycle of a write in a WAL-based system:
sequenceDiagram
participant App as Application
participant Mem as In-Memory State
participant WAL as Write-Ahead Log
participant Disk as Disk Storage
App->>WAL: 1. Append record
WAL->>Disk: 2. Write to log file
WAL->>Disk: 3. fsync() - force to disk
WAL-->>App: 4. Return success
App->>Mem: 5. Update in-memory state
Note over WAL,Disk: Change is now durable!
Note over Mem: In-memory update happens AFTER durability
The Key Invariant¶
At all times, one of these is true:
- Change is in the log on disk → Can recover it
- Change hasn't been acknowledged to the user → User will retry
This is how WALs provide durability while keeping writes fast.
WAL in Different Systems¶
The WAL pattern appears everywhere, but with different flavors:
Traditional Databases (PostgreSQL, MySQL)¶
[WAL Segment 1] [WAL Segment 2] [WAL Segment 3]
↓
On checkpoint: Flush pages to main database files
After flush: Delete old WAL segments
Purpose: Recover in-memory page cache after crash
Key operations:
- INSERT/UPDATE/DELETE → Write to WAL first, then modify pages
- Crash → Replay WAL to reconstruct page cache
- Checkpoint → Flush dirty pages, truncate old WAL
Log-Structured Storage (RocksDB, Cassandra, nori-wal)¶
Purpose: Durability for in-memory write buffer (memtable)
Key operations: - Write → Append to WAL + insert to memtable - Memtable full → Flush to SSTable on disk - Crash → Replay WAL to reconstruct memtable
Message Queues (Kafka)¶
[Log Segment 0] [Log Segment 1] [Log Segment 2] ...
↓ ↓ ↓
Consumers read from offsets (positions in log)
Purpose: The log is the primary data structure
Key operations: - Produce → Append to log - Consume → Read from offset, advance offset - Retention → Delete old segments after time/size threshold
Distributed Consensus (Raft, Multi-Paxos)¶
Leader: [Entry 1] [Entry 2] [Entry 3] [Entry 4]
Yes Yes Yes (uncommitted)
Replicate entries to followers →
When majority has entry → Commit it
Purpose: Replicate state machine transitions
Key operations: - Command → Leader appends to log - Replicate → Send to followers - Commit → When majority has it - Apply → Execute against state machine
The "Write-Ahead" Guarantee¶
The term "write-ahead" is crucial. It means:
The log must be written to durable storage BEFORE acknowledging the operation.
Example: Bank Transfer¶
// CORRECT: Write-ahead
async fn transfer(from: u64, to: u64, amount: u64) -> Result<()> {
// 1. Log the operation FIRST
let record = Record::put(
b"transfer",
format!("{}→{}: ${}", from, to, amount).as_bytes()
);
wal.append(&record).await?;
wal.sync().await?; // CRITICAL: Must sync before continuing!
// 2. Now update in-memory balances
balances.get_mut(&from).unwrap() -= amount;
balances.get_mut(&to).unwrap() += amount;
// 3. Acknowledge to user
Ok(())
}
If we crash after step 1 (log sync) but before step 2 (in-memory update), that's fine! We can replay the log on recovery and redo the transfer.
Violation: Write-Behind (WRONG)¶
// WRONG: Write-behind (don't do this!)
async fn transfer_wrong(from: u64, to: u64, amount: u64) -> Result<()> {
// 1. Update in-memory first
balances.get_mut(&from).unwrap() -= amount;
balances.get_mut(&to).unwrap() += amount;
// 2. Acknowledge to user
println!("Transfer complete!");
// 3. Log it later (WRONG!)
wal.append(&record).await?;
// If we crash here, the transfer is LOST!
// But we already told the user it succeeded!
}
Never do this. Always write to the log before acknowledging success.
What WALs Are NOT¶
It's helpful to clarify what WALs are not:
NOT a Replacement for Your Main Data Structure¶
The WAL is not where you read from during normal operation. It's a recovery mechanism.
// BAD: Scanning the entire WAL to find a key
async fn get_bad(key: &[u8]) -> Option<Vec<u8>> {
let mut reader = wal.read_from(Position::start()).await?;
let mut latest = None;
// This is O(log size) - terrible!
while let Some((record, _)) = reader.next_record().await? {
if record.key == key {
latest = Some(record.value);
}
}
latest
}
// GOOD: Use an index (memtable, SSTable, B-tree, etc.)
async fn get_good(key: &[u8]) -> Option<Vec<u8>> {
// O(log n) or O(1) lookup
memtable.get(key).or_else(|| sstables.get(key))
}
Use the WAL for: - Durability - Recovery - Replication
Don't use the WAL for: - Point reads - Range queries - Indexes
NOT Always the Fastest Storage¶
For some workloads, a WAL might be slower than direct writes:
WAL is fast when: - Sequential writes (appending) - Small writes (kilobytes) - Need durability guarantees
WAL might be slower when: - Need random reads (use an index) - Large batch writes (direct file write might be faster) - Don't care about durability (can skip fsync)
NOT a Silver Bullet¶
WALs solve durability, but they don't solve:
- Concurrency control (you need locks, MVCC, or STM)
- Query optimization (you need indexes)
- Distributed consensus (you need Raft/Paxos on top of WAL)
- Schema migrations (you need application-level logic)
A WAL is one tool in your toolbox, not a complete database.
Why nori-wal Exists¶
There are many WAL implementations out there. Why build another one?
Existing Options¶
RocksDB WAL: Great, but tightly coupled to RocksDB. Can't use it standalone.
Kafka: Designed for message queues, not general-purpose storage.
PostgreSQL WAL: Internal to PostgreSQL, not reusable.
Custom implementation: Most projects roll their own, often poorly.
What nori-wal Provides¶
Standalone: Use it in any Rust project, no dependencies on a larger system
Production-ready: Comprehensive error handling, recovery, observability
Flexible: Choose your durability vs. performance trade-off (fsync policies)
Well-documented: You're reading it!
Tested: 37 tests including property tests for crash scenarios
Fast: 110K writes/sec with batched fsync (see benchmarks)
Real-World Use Cases¶
Here's where you'd actually use a WAL:
1. Building a Key-Value Store¶
struct KvStore {
memtable: BTreeMap<Vec<u8>, Vec<u8>>,
wal: Wal,
}
impl KvStore {
async fn put(&mut self, key: &[u8], value: &[u8]) -> Result<()> {
// Write-ahead: log first
let record = Record::put(key, value);
self.wal.append(&record).await?;
self.wal.sync().await?;
// Then update memtable
self.memtable.insert(key.to_vec(), value.to_vec());
Ok(())
}
async fn recover(&mut self) -> Result<()> {
// Replay WAL to reconstruct memtable
let mut reader = self.wal.read_from(Position::start()).await?;
while let Some((record, _)) = reader.next_record().await? {
if record.tombstone {
self.memtable.remove(&record.key);
} else {
self.memtable.insert(record.key, record.value);
}
}
Ok(())
}
}
See Recipes: Key-Value Store for a complete example.
2. Event Sourcing¶
struct EventStore {
wal: Wal,
}
impl EventStore {
async fn append_event(&self, event: &Event) -> Result<Position> {
let record = Record::put(
event.aggregate_id.as_bytes(),
serde_json::to_vec(event)?
);
let pos = self.wal.append(&record).await?;
Ok(pos)
}
async fn replay(&self, aggregate_id: &str) -> Result<Vec<Event>> {
let mut events = vec![];
let mut reader = self.wal.read_from(Position::start()).await?;
while let Some((record, _)) = reader.next_record().await? {
if record.key == aggregate_id.as_bytes() {
let event: Event = serde_json::from_slice(&record.value)?;
events.push(event);
}
}
Ok(events)
}
}
See Recipes: Event Sourcing for details.
3. Replication¶
struct ReplicatedLog {
wal: Wal,
}
impl ReplicatedLog {
async fn replicate_to_follower(&self, follower: &mut TcpStream) -> Result<()> {
let mut reader = self.wal.read_from(Position::start()).await?;
// Stream entire log to follower
while let Some((record, position)) = reader.next_record().await? {
follower.write_all(&serialize(&record)?).await?;
}
Ok(())
}
}
This is how Raft and other consensus protocols work under the hood!
Key Takeaways¶
Before moving on, make sure you understand these core points:
- WALs provide durability without sacrificing write performance
- Sequential appends are fast
-
Only need to sync the log, not the entire data structure
-
Write-ahead means log BEFORE acknowledgment
- Never acknowledge success until the log is synced
-
If you crash after logging, you can recover
-
The log is not your primary data structure
- Use it for recovery, not for reads
-
Build indexes (memtables, B-trees, hash tables) for queries
-
WALs are everywhere
- Databases, filesystems, message queues, consensus protocols
-
Understanding WALs helps you understand modern distributed systems
-
nori-wal is a production-ready, standalone WAL implementation
- Use it to add durability to your Rust projects
- Flexible, tested, documented
What's Next?¶
Now that you understand what a WAL is, let's explore the key design decisions:
- Append-Only Architecture - Why WALs are append-only and what that means
- Fsync Policies - How to balance durability and performance
- Recovery Guarantees - What happens after a crash
- When to Use a WAL - Scenarios where WALs shine
Or jump ahead to see how nori-wal works internally in How It Works.