Recovery Process¶
Detailed walkthrough of the recovery algorithm: scanning, validation, truncation, and rebuild.
Table of contents¶
Recovery Overview¶
When you open a WAL, nori-wal automatically recovers by scanning all segment files, validating records, and truncating any corruption.
This process implements a prefix-valid recovery strategy:
Keep all valid records from the beginning until corruption is detected, then truncate.
graph TD
A[Open WAL] --> B[Discover Segment Files]
B --> C[Sort Segments by ID]
C --> D[For Each Segment]
D --> E[Read Entire Segment into Memory]
E --> F[Scan Records Sequentially]
F --> G{Valid Record?}
G -->|Yes| H[Count as Valid]
H --> I{More Data?}
I -->|Yes| F
I -->|No| J[Segment Complete]
G -->|No - CRC Mismatch| K[Truncate at This Position]
G -->|No - Incomplete| K
K --> L[Emit Corruption Event]
L --> J
J --> M{More Segments?}
M -->|Yes| D
M -->|No| N[Return RecoveryInfo]
style H fill:#90EE90
style K fill:#FFB6C1
style N fill:#90EE90
RecoveryInfo Structure¶
Recovery returns comprehensive information about what was found:
pub struct RecoveryInfo {
/// Total number of valid records recovered.
pub valid_records: u64,
/// Number of segments scanned.
pub segments_scanned: u64,
/// Total bytes truncated due to corruption.
pub bytes_truncated: u64,
/// Position of the last valid record.
pub last_valid_position: Option<Position>,
/// Whether any corruption was detected and truncated.
pub corruption_detected: bool,
}
Example:
let (wal, recovery_info) = Wal::open(config).await?;
println!("Recovered {} records from {} segments",
recovery_info.valid_records,
recovery_info.segments_scanned);
if recovery_info.corruption_detected {
log::warn!("Corruption detected! Truncated {} bytes",
recovery_info.bytes_truncated);
}
Step-by-Step Recovery Algorithm¶
Step 1: Discover Segment Files¶
Scan the WAL directory for .wal files:
async fn find_all_segments(dir: &Path) -> Result<Vec<u64>, SegmentError> {
let mut entries = tokio::fs::read_dir(dir).await?;
let mut segment_ids = Vec::new();
while let Some(entry) = entries.next_entry().await? {
if let Some(id) = parse_segment_id_from_path(&entry.path()) {
segment_ids.push(id);
}
}
Ok(segment_ids)
}
fn parse_segment_id_from_path(path: &Path) -> Option<u64> {
// Check extension is "wal"
if path.extension()?.to_str()? != "wal" {
return None;
}
// Parse filename stem as u64
let stem_str = path.file_stem()?.to_str()?;
stem_str.parse::<u64>().ok()
}
Example:
wal/
000000.wal → segment_id = 0
000001.wal → segment_id = 1
000005.wal → segment_id = 5
temp.txt → Ignored (not .wal)
README.md → Ignored
Discovered segments: [0, 1, 5]
Step 2: Sort Segments by ID¶
Process segments in order (oldest to newest):
Why sort? - Records must be replayed in order (append-only guarantee) - Later records shadow earlier ones (last write wins) - Recovery must preserve this ordering
Step 3: Recover Each Segment¶
For each segment, validate all records and truncate corruption:
async fn recover_segment(
wal_dir: &Path,
segment_id: u64,
meter: Arc<dyn Meter>,
node_id: u32,
) -> Result<SegmentRecoveryInfo, SegmentError> {
let path = segment_path(wal_dir, segment_id);
let mut file = File::open(&path).await?;
// Read entire segment into memory for fast validation
let file_size = file.metadata().await?.len();
let mut buffer = vec![0u8; file_size as usize];
file.read_exact(&mut buffer).await?;
// Scan for valid records
let (valid_records, last_valid_offset) = scan_valid_records(&buffer, file_size);
let bytes_truncated = file_size - last_valid_offset;
// Truncate if needed
if bytes_truncated > 0 {
truncate_segment_atomically(&path, &buffer, last_valid_offset).await?;
// Emit observability event
meter.emit(VizEvent::Wal(WalEvt {
node: node_id,
seg: segment_id,
kind: WalKind::CorruptionTruncated,
}));
}
Ok(SegmentRecoveryInfo {
valid_records,
bytes_truncated,
last_valid_position: if valid_records > 0 {
Some(Position { segment_id, offset: last_valid_offset })
} else {
None
},
})
}
Step 4: Scan and Validate Records¶
Read records sequentially, validating CRC32C for each:
fn scan_valid_records(buffer: &[u8], file_size: u64) -> (u64, u64) {
let mut offset = 0u64;
let mut valid_records = 0u64;
let mut last_valid_offset = 0u64;
while offset < file_size {
let remaining = &buffer[offset as usize..];
match Record::decode(remaining) {
Ok((_record, size)) => {
// Valid record - keep going
valid_records += 1;
offset += size as u64;
last_valid_offset = offset;
}
Err(RecordError::Incomplete) => {
// Partial record at tail - expected during crash
break;
}
Err(RecordError::CrcMismatch { .. }) | Err(_) => {
// Corruption detected - truncate here
break;
}
}
}
(valid_records, last_valid_offset)
}
Example:
Segment buffer (100 bytes):
[0..20] Record 1 (CRC valid) → valid_records = 1, last_valid_offset = 20
[20..45] Record 2 (CRC valid) → valid_records = 2, last_valid_offset = 45
[45..70] Record 3 (CRC valid) → valid_records = 3, last_valid_offset = 70
[70..85] Record 4 (CRC INVALID) → STOP, truncate at offset 70
[85..100] Garbage → Discarded
Result: (valid_records=3, last_valid_offset=70)
Bytes truncated: 100 - 70 = 30 bytes
Step 5: Atomic Truncation¶
If corruption is detected, truncate the segment atomically:
async fn truncate_segment_atomically(
path: &PathBuf,
buffer: &[u8],
last_valid_offset: u64,
) -> Result<(), SegmentError> {
let temp_path = path.with_extension("wal.tmp");
// 1. Write valid data to temp file
let mut temp_file = OpenOptions::new()
.create(true)
.write(true)
.truncate(true)
.open(&temp_path)
.await?;
temp_file.write_all(&buffer[..last_valid_offset as usize]).await?;
temp_file.sync_all().await?;
drop(temp_file);
// 2. Atomic rename (replaces original file)
tokio::fs::rename(&temp_path, path).await?;
Ok(())
}
Why atomic?
If we crash during truncation: - Without atomic rename: Original file may be partially overwritten (corrupted) - With atomic rename: Either old file survives (untouched) or new file is complete
Implementation:
1. Write valid data to 000000.wal.tmp
2. Sync temp file to disk
3. Rename 000000.wal.tmp → 000000.wal (atomic on all platforms)
Recovery Scenarios¶
Scenario 1: Clean Recovery (No Corruption)¶
// Before crash: wrote 10 records, synced all
[Record 1][Record 2]...[Record 10]
// On recovery:
Scan: All 10 records have valid CRCs
Result: RecoveryInfo {
valid_records: 10,
segments_scanned: 1,
bytes_truncated: 0,
corruption_detected: false,
}
All data recovered
Scenario 2: Partial Write at Tail¶
// Before crash: wrote 10 records, synced 9, writing 10th
[R1][R2]...[R9][R10: partial, only 15 of 30 bytes]
// On recovery:
Scan records 1-9: Valid CRCs → Keep
Scan record 10: RecordError::Incomplete → Truncate here
Result: RecoveryInfo {
valid_records: 9,
bytes_truncated: 15,
corruption_detected: true,
}
Lost record 10 (was never fully written)
Scenario 3: Corrupted Data (CRC Mismatch)¶
// Before crash: corruption in the middle of a record
[R1][R2][R3: corrupted data, CRC mismatch][R4][R5]
// On recovery:
Scan R1-R2: Valid → Keep
Scan R3: CrcMismatch → Truncate here
R4-R5: Never scanned (truncated)
Result: RecoveryInfo {
valid_records: 2,
bytes_truncated: size_of(R3 + R4 + R5),
corruption_detected: true,
}
Lost R3, R4, R5 (everything from first corruption onward)
Scenario 4: Empty Segment¶
// Segment file exists but contains no data
File size: 0 bytes
// On recovery:
No records to scan
Result: RecoveryInfo {
valid_records: 0,
bytes_truncated: 0,
corruption_detected: false,
}
Valid (no data lost, segment is just empty)
Scenario 5: Multiple Segments with Mixed Corruption¶
Segment 0: [R1][R2][R3] (all valid)
Segment 1: [R4][R5: partial]
Segment 2: [R6][R7] (never scanned, segment 1 stopped at R5)
Recovery:
Segment 0: 3 valid records, 0 bytes truncated
Segment 1: 1 valid record (R4), truncate at R5
Segment 2: 2 valid records, 0 bytes truncated
Result: RecoveryInfo {
valid_records: 6 (R1-R4 from seg 0-1, R6-R7 from seg 2),
segments_scanned: 3,
bytes_truncated: size_of(partial R5),
corruption_detected: true,
}
Recovered 6 out of 7 records
Error Handling¶
RecordError Types¶
During recovery, Record::decode() may return:
| Error | Meaning | Action |
|---|---|---|
Ok((record, size)) |
Valid record | Keep scanning |
Err(Incomplete) |
Not enough bytes (partial write) | Truncate here |
Err(CrcMismatch) |
Data corruption | Truncate here |
Err(InvalidCompression) |
Unknown compression type | Truncate here |
Err(DecompressionFailed) |
Corrupted compressed data | Truncate here |
All errors result in truncation (prefix-valid recovery).
I/O Errors¶
match File::open(&path).await {
Ok(file) => { /* proceed */ }
Err(e) if e.kind() == ErrorKind::NotFound => {
// Segment was deleted - skip it
}
Err(e) => {
// Other I/O error - fail recovery
return Err(SegmentError::Io(e));
}
}
Common I/O errors: - NotFound: Segment was deleted (skip it) - PermissionDenied: No read access (fail) - IOError: Disk failure (fail)
Performance Characteristics¶
Recovery Time¶
Recovery speed depends on: 1. Number of segments 2. Total data size 3. Disk speed
Typical performance (SSD):
| Scenario | Segments | Data Size | Recovery Time |
|---|---|---|---|
| Small | 1 | 1 MB | ~5 ms |
| Medium | 10 | 100 MB | ~150 ms |
| Large | 100 | 10 GB | ~15 s |
| Very Large | 1000 | 100 GB | ~150 s |
Bottleneck: Reading data from disk (I/O bound)
Memory Usage¶
Recovery reads entire segments into memory:
Why read entire segment? - Faster than multiple small reads - Simplifies code (no buffering complexity) - 128 MB is acceptable for modern systems
Alternative: Streaming validation (not implemented) - Would use less memory (~1 MB buffer) - Would require more complex code - Would be slower (more syscalls)
Optimization: Parallel Segment Validation¶
Current: Segments validated sequentially
Possible optimization: Validate segments in parallel
// Hypothetical parallel recovery
let tasks: Vec<_> = segments.into_iter()
.map(|id| tokio::spawn(recover_segment(dir, id, meter, node_id)))
.collect();
let results = futures::future::join_all(tasks).await;
Benefits: - 2-4x faster for many segments - Better utilization of I/O bandwidth
Trade-offs: - More complex code - Higher memory usage (multiple segments in RAM) - Not needed for most use cases (recovery is already fast)
Testing Recovery¶
Test 1: Clean Recovery¶
#[tokio::test]
async fn test_recovery_clean() {
// Write 10 records, sync all
let (wal, _) = Wal::open(config).await?;
for i in 0..10 {
wal.append(&Record::put(format!("key{}", i), b"value")).await?;
}
wal.sync().await?;
drop(wal);
// Reopen and verify recovery
let (wal, info) = Wal::open(config).await?;
assert_eq!(info.valid_records, 10);
assert_eq!(info.bytes_truncated, 0);
assert!(!info.corruption_detected);
}
Test 2: Partial Write¶
#[tokio::test]
async fn test_recovery_partial_write() {
// Write 5 records, sync
let (wal, _) = Wal::open(config).await?;
for i in 0..5 {
wal.append(&Record::put(format!("key{}", i), b"value")).await?;
}
wal.sync().await?;
drop(wal);
// Append garbage (simulate crash mid-write)
let seg_path = dir.join("000000.wal");
let mut file = OpenOptions::new().append(true).open(&seg_path).await?;
file.write_all(b"PARTIAL_GARBAGE").await?;
file.sync_all().await?;
// Recover - should truncate garbage
let (wal, info) = Wal::open(config).await?;
assert_eq!(info.valid_records, 5);
assert!(info.bytes_truncated > 0);
assert!(info.corruption_detected);
}
Test 3: CRC Mismatch¶
#[tokio::test]
async fn test_recovery_crc_mismatch() {
// Write and sync 5 records
// ... (same as above)
// Corrupt data in last record
let seg_path = dir.join("000000.wal");
let mut data = tokio::fs::read(&seg_path).await?;
data[data.len() - 15] ^= 0xFF; // Flip bits in value
tokio::fs::write(&seg_path, &data).await?;
// Recover - should detect corruption and truncate
let (wal, info) = Wal::open(config).await?;
assert!(info.valid_records < 5); // Lost corrupted record
assert!(info.bytes_truncated > 0);
assert!(info.corruption_detected);
}
Observability¶
Recovery emits events for monitoring:
// Corruption detected and truncated
VizEvent::Wal(WalEvt {
node: node_id,
seg: segment_id,
kind: WalKind::CorruptionTruncated,
})
What to monitor: - Frequency of corruption events: Should be rare (only after crashes) - Bytes truncated: Large truncations indicate serious issues - Recovery time: Should be fast (< 1s for most workloads)
Alert on: - Corruption events during normal operation (not after restart) - Large amounts of truncated data (> 1% of total) - Slow recovery (> 10s)
Design Decisions¶
Why Prefix-Valid Recovery?¶
Alternatives: 1. Reject all if any corruption: Too strict (lose all data on single bad record) 2. Try to skip corrupted records: Unsafe (may violate ordering guarantees) 3. Prefix-valid: Keep all good data, truncate from first corruption
Prefix-valid is the right trade-off: - Maximizes data recovery - Simple to implement and reason about - Preserves ordering guarantees - Standard approach (used by PostgreSQL, MySQL, etc.)
Why Read Entire Segment?¶
Alternatives: 1. Stream validation (buffer 1 MB at a time) - Less memory - More complex code - Slower (more syscalls)
- Read entire segment
- Simple code
- Fast (one large read)
- Higher memory (128 MB)
Current choice: Read entire segment - 128 MB is acceptable on modern systems - Simplicity > micro-optimization - Can change later if needed (not a breaking change)
Why Atomic Truncation?¶
Without atomic rename:
1. Open segment for writing
2. Truncate to valid offset
3. CRASH HERE → File may be corrupted or empty
4. Sync
With atomic rename:
1. Write valid data to temp file
2. Sync temp file
3. CRASH HERE → Original file untouched
4. Rename temp → original (atomic)
5. CRASH HERE → New file is complete
Atomic rename ensures either old file or new file exists, never a partial state.
Key Takeaways¶
- Recovery is automatic on WAL open
- No manual intervention required
-
Scans all segments, validates all records
-
Prefix-valid recovery strategy
- Keep all valid records from the beginning
-
Truncate from first corruption/incompleteness
-
CRC32C provides strong validation
- Detects all corruption during recovery
-
Hardware-accelerated for speed
-
Truncation is atomic
- Temp file + rename pattern
-
Safe even if crash during recovery
-
RecoveryInfo tells you what happened
- Check
corruption_detectedandbytes_truncated -
Log warnings if corruption found
-
Recovery is fast
- ~150ms for 100 MB
- ~15s for 10 GB
- I/O bound, not CPU bound
What's Next?¶
Now that you understand recovery, explore:
- Concurrency Model - How concurrent access is handled
- Recovery Guarantees - High-level guarantees
Or dive into the implementation in crates/nori-wal/src/recovery.rs on GitHub.