Durability & Fsync Policies¶
How to balance durability and performance with different fsync strategies.
Table of contents¶
The Durability Problem¶
When you write data to a file, it doesn't go directly to disk. It goes through multiple layers of buffering:
graph TD
A[Your Code] -->|write| B[Application Buffer]
B -->|flush| C[OS Page Cache]
C -->|fsync| D[Disk Controller Cache]
D --> E[Physical Disk Platter/NAND]
style E fill:#90EE90
style A fill:#FFB6C1
style C fill:#FFD700
Problem: At each layer, your data can be lost if power fails.
| Layer | Survives Process Crash? | Survives OS Crash/Power Failure? |
|---|---|---|
| Application buffer | No | No |
| OS page cache | Yes | No |
| Disk controller cache | Yes | Maybe (if battery-backed) |
| Physical disk | Yes | Yes |
Only data on physical disk is truly durable.
What is fsync()?¶
fsync() is a system call that forces the operating system to flush all buffers to physical disk:
use tokio::fs::File;
let mut file = File::create("data.log").await?;
file.write_all(b"important data").await?;
// Data is in OS cache, NOT on disk yet!
file.sync_all().await?; // Calls fsync() - forces to disk
// Now data is guaranteed to be on physical disk
flush() vs sync()¶
Don't confuse flush() and sync():
// flush() - Writes from app buffer to OS cache (not durable!)
file.flush().await?;
// sync_all() - Forces OS cache to physical disk (durable!)
file.sync_all().await?;
Key difference:
- flush(): App buffer → OS page cache (survives process crash, NOT power failure)
- sync_all() / fsync(): OS page cache → physical disk (survives power failure)
The Performance Cost of fsync()¶
fsync() is expensive. Here's why:
Latency¶
On a typical SSD:
write() call: ~10 microseconds (in-memory)
fsync() call: ~1-5 milliseconds (force to disk)
That's 100-500x slower!
On a typical HDD:
write() call: ~10 microseconds (in-memory)
fsync() call: ~5-10 milliseconds (seek + rotate + write)
That's 500-1000x slower!
Throughput¶
If you fsync() after every write:
for i in 0..1000 {
file.write_all(&data).await?;
file.sync_all().await?; // fsync after every write
}
Performance: - Without fsync: 1,000,000 writes/second (limited by CPU/memory) - With fsync: ~200-500 writes/second (limited by disk)
That's a 2000-5000x slowdown!
The Durability vs. Performance Trade-Off¶
You have to choose:
Option 1: Always fsync() - Maximum Durability - Every write is guaranteed durable before returning - Zero data loss on power failure - Very slow (~200-500 writes/sec)
Option 2: Never fsync() - Maximum Performance - Blazing fast (~1M writes/sec) - Data loss on power failure (up to 30-60 seconds of writes) - Not acceptable for critical data
Option 3: Batched fsync() - Balanced - Good performance (~50K-100K writes/sec) - Small data loss window (milliseconds) - Acceptable for most applications
This is what FsyncPolicy in nori-wal controls.
FsyncPolicy Options¶
nori-wal provides three fsync policies. Let's explore each one.
FsyncPolicy::Always¶
Calls fsync() after every write.
use nori_wal::{WalConfig, FsyncPolicy};
let config = WalConfig {
fsync_policy: FsyncPolicy::Always,
..Default::default()
};
Behavior:
let record = Record::put(b"key", b"value");
wal.append(&record).await?; // Internally calls fsync() before returning
Guarantees:
- Every append() is durable before it returns
- Zero data loss on power failure
- Suitable for financial transactions, critical data
Performance: - ~200-500 writes/sec (disk-bound) - p50 latency: ~2ms - p99 latency: ~10ms
Use when: - You cannot tolerate ANY data loss - Examples: financial transactions, medical records, audit logs
FsyncPolicy::Batch(Duration)¶
Batches fsync() calls within a time window.
use std::time::Duration;
let config = WalConfig {
fsync_policy: FsyncPolicy::Batch(Duration::from_millis(5)),
..Default::default()
};
Behavior:
// First write triggers fsync
wal.append(&record1).await?; // Calls fsync()
// Subsequent writes within 5ms window skip fsync
wal.append(&record2).await?; // No fsync!
wal.append(&record3).await?; // No fsync!
// ... 5ms passes ...
// Next write triggers fsync again
wal.append(&record4).await?; // Calls fsync()
Guarantees:
- Potential data loss: up to window duration of writes on power failure
- All writes are durable within window milliseconds
- Good balance of performance and durability
Performance (5ms window): - ~50K-100K writes/sec - p50 latency: ~50μs (no fsync) - p99 latency: ~5ms (fsync)
Use when: - You can tolerate small data loss (milliseconds) - Examples: web applications, session state, caches
Tuning the window:
// More durable: fsync every 1ms
FsyncPolicy::Batch(Duration::from_millis(1))
// Performance: ~40K writes/sec, 1ms max data loss
// Balanced: fsync every 5ms (default)
FsyncPolicy::Batch(Duration::from_millis(5))
// Performance: ~80K writes/sec, 5ms max data loss
// Higher throughput: fsync every 10ms
FsyncPolicy::Batch(Duration::from_millis(10))
// Performance: ~100K writes/sec, 10ms max data loss
FsyncPolicy::Os¶
Lets the operating system decide when to flush.
Behavior:
Data is eventually flushed to disk by the OS, typically every 30-60 seconds (controlled by kernel settings like vm.dirty_expire_centisecs on Linux).
Guarantees: - No durability guarantees! - Potential data loss: 30-60 seconds of writes on power failure - Data survives process crash (it's in OS cache)
Performance: - ~100K-1M writes/sec (memory-bound) - p50 latency: ~10μs - p99 latency: ~100μs
Use when: - Acceptable data loss on power failure - Examples: event logs, metrics, caches, analytics
Warning: Never use Os policy for critical data!
Decision Matrix¶
Use this flowchart to pick the right policy:
graph TD
A[What's your use case?] --> B{Can you tolerate ANY data loss?}
B -->|No| C[FsyncPolicy::Always]
B -->|Yes| D{How much data loss is OK?}
D -->|None| C
D -->|< 10ms| E[FsyncPolicy::Batch 1-5ms]
D -->|10-100ms| F[FsyncPolicy::Batch 10-50ms]
D -->|Seconds OK| G[FsyncPolicy::Os]
C --> H[~420 writes/sec<br/>0ms data loss]
E --> I[~86K writes/sec<br/>1-5ms data loss]
F --> J[~95K writes/sec<br/>10-50ms data loss]
G --> K[~110K writes/sec<br/>30-60s data loss]
style C fill:#FFB6C1
style E fill:#90EE90
style F fill:#FFD700
style G fill:#FFA500
Real-World Examples¶
Example 1: Banking System¶
// Financial transactions - zero data loss acceptable
let config = WalConfig {
fsync_policy: FsyncPolicy::Always,
..Default::default()
};
let (wal, _) = Wal::open(config).await?;
// Every transaction is durable
let tx = Record::put(b"account:1234", b"balance:5000");
wal.append(&tx).await?; // Waits for fsync
// If we return success, the transaction is guaranteed on disk
Trade-off: Low throughput (~420 tx/sec), but every transaction is safe.
Example 2: Web Application Session Store¶
// Session state - 5ms data loss acceptable
let config = WalConfig {
fsync_policy: FsyncPolicy::Batch(Duration::from_millis(5)),
..Default::default()
};
let (wal, _) = Wal::open(config).await?;
// High throughput, small data loss window
for session in sessions {
let record = Record::put(session.id, session.data);
wal.append(&record).await?; // Fast!
}
// ~80K sessions/sec throughput
// Worst case: lose 5ms of session updates on power failure
Trade-off: High throughput, acceptable data loss (users may need to re-login).
Example 3: Analytics Event Log¶
// Analytics events - data loss acceptable
let config = WalConfig {
fsync_policy: FsyncPolicy::Os,
..Default::default()
};
let (wal, _) = Wal::open(config).await?;
// Maximum throughput
for event in events {
let record = Record::put(event.id, event.data);
wal.append(&record).await?; // Blazing fast!
}
// ~110K events/sec throughput
// Worst case: lose 30-60s of events on power failure
// Acceptable for analytics (not mission-critical)
Trade-off: Maximum throughput, but significant data loss possible.
Batching: How It Works Internally¶
Let's look at how FsyncPolicy::Batch actually works inside nori-wal:
struct BatchedFsync {
window: Duration,
last_sync: Instant,
}
impl BatchedFsync {
async fn append(&mut self, record: &Record) -> Result<Position> {
// 1. Write to file (buffered, fast)
let position = self.write_to_file(record).await?;
// 2. Check if we need to fsync
let now = Instant::now();
if now.duration_since(self.last_sync) >= self.window {
// Time window elapsed, fsync now
self.file.sync_all().await?;
self.last_sync = now;
}
// 3. Return position (might not be synced yet!)
Ok(position)
}
}
Key insight: The first write after the window expires pays the fsync cost. Subsequent writes within the window are fast.
Time →
0ms: append() → fsync (2ms)
1ms: append() → no fsync (fast)
2ms: append() → no fsync (fast)
3ms: append() → no fsync (fast)
4ms: append() → no fsync (fast)
5ms: append() → fsync (2ms)
6ms: append() → no fsync (fast)
...
Manual fsync() with FsyncPolicy::Os¶
If you use FsyncPolicy::Os, you can manually call sync() when you want durability:
let config = WalConfig {
fsync_policy: FsyncPolicy::Os, // No automatic fsync
..Default::default()
};
let (wal, _) = Wal::open(config).await?;
// Fast writes (no fsync)
for i in 0..1000 {
let record = Record::put(format!("key:{}", i).as_bytes(), b"value");
wal.append(&record).await?;
}
// Now manually fsync when you want durability
wal.sync().await?; // Force everything to disk
This gives you fine-grained control over when to pay the fsync cost.
Use case: Batch inserts where you want durability at the end of the batch, not after every record.
Benchmarks: Policy Comparison¶
Here are real benchmark results from nori-wal on a typical SSD (Samsung 970 EVO):
| Policy | Throughput (writes/sec) | p50 Latency | p99 Latency | Data Loss on Power Failure |
|---|---|---|---|---|
Always |
420 | 2.1ms | 9.8ms | 0ms |
Batch(1ms) |
45,000 | 45μs | 1.2ms | ≤1ms |
Batch(5ms) |
86,000 | 38μs | 5.3ms | ≤5ms |
Batch(10ms) |
95,000 | 35μs | 10.4ms | ≤10ms |
Os |
110,000 | 28μs | 95μs | 30-60s |
Observations:
- Always is 200x slower than Batch(5ms)
- Batch(5ms) offers good balance (86K writes/sec, 5ms max data loss)
- Os is only slightly faster than Batch(10ms), but loses 30-60s of data
Recommendation: Use Batch(5ms) for most applications.
Common Misconceptions¶
"fsync() flushes to disk controller cache, not physical disk"¶
Response: This used to be true for some old HDDs, but modern drives respect the FUA (Force Unit Access) flag, which bypasses the cache. On Linux, fsync() uses barriers to ensure physical durability.
If you're paranoid, use O_DIRECT (but this is rarely necessary and hurts performance).
"I can just call fsync() in the background"¶
Response: No! If you acknowledge success to the user before fsync() completes, you've violated the write-ahead guarantee.
Wrong:
// WRONG!
async fn append_wrong(record: &Record) -> Position {
let position = write_to_file(record).await;
// Return immediately (WRONG!)
tokio::spawn(async move {
file.sync_all().await.unwrap(); // Async fsync
});
position // User thinks it's durable, but it's not!
}
Right:
// RIGHT!
async fn append_right(record: &Record) -> Position {
let position = write_to_file(record).await;
file.sync_all().await.unwrap(); // Wait for fsync
position // Now it's safe to return
}
"Batched fsync is less durable"¶
Response: It's a trade-off. With Batch(5ms), you have a 5ms window of potential data loss. But:
- You still get durability within 5ms
- For most applications, 5ms of data loss is acceptable
- Users can retry failed requests anyway
If 5ms is unacceptable, use Always. But understand the performance cost.
Configuring OS-Level Durability¶
Even with fsync(), the OS and disk have settings that affect durability:
Linux: Disk Write Cache¶
Check if write cache is enabled:
If it shows write-caching = 1, the disk controller may cache writes. On power failure, this cache is lost unless the drive has battery backup.
To disable write cache (paranoid mode):
Warning: This makes fsync() even slower! Only do this for critical data on drives without battery backup.
Linux: Filesystem Barriers¶
Modern filesystems (ext4, XFS, Btrfs) use barriers to enforce durability. Make sure barriers are enabled:
If barriers are disabled (barrier=0), fsync() may not be durable!
Key Takeaways¶
- fsync() is the only way to guarantee durability
flush()is NOT enough (only goes to OS cache)-
Only
fsync()forces data to physical disk -
fsync() is expensive
- 100-1000x slower than buffered writes
- Using
Alwayspolicy: ~420 writes/sec -
Using
Batch(5ms)policy: ~86K writes/sec -
Choose the right policy for your use case
- Critical data →
Always - Most applications →
Batch(1-10ms) -
Acceptable data loss →
Os -
Batched fsync is a good default
- 5ms data loss window is acceptable for most apps
- 200x performance improvement over
Always -
Still provides strong durability guarantees
-
Never acknowledge success before fsync completes
- This violates the write-ahead guarantee
- Can lead to data loss despite using a WAL
What's Next?¶
Now that you understand fsync policies, explore:
- Recovery Guarantees - What happens after a crash
- When to Use a WAL - Scenarios where WALs shine
- Performance Tuning - Optimize for your workload
Or see real-world examples in Recipes.