Benchmarks¶
Detailed performance measurements for nori-wal.
Table of contents¶
Test Environment¶
All benchmarks were run on:
Hardware:
CPU: Apple M2 Pro (10 cores, 8 performance + 2 efficiency, 3.5 GHz)
RAM: 16 GB LPDDR5
Disk: 1 TB NVMe SSD (Apple integrated)
Software:
OS: macOS 14.0 (Sonoma)
Filesystem: APFS
Rust: 1.75.0
Tokio: 1.35.0
Benchmark Tool:
Criterion.rs with default settings
3 warmup iterations
100 measurement iterations
Note: Your results will vary based on hardware, especially disk type.
Write Performance¶
Single-Threaded Sequential Writes¶
Measures how fast individual append() calls are.
Configuration: - Record: 1 KB key + value - No batching (one append at a time) - Various fsync policies
Results:
| Fsync Policy | Throughput | Latency p50 | Latency p99 | Notes |
|---|---|---|---|---|
Os |
110K writes/sec | 9.1µs | 15µs | Best performance, no durability guarantee |
Batch(5ms) |
86K writes/sec | 11.6µs | 5.5ms | Good balance |
Batch(1ms) |
55K writes/sec | 18.2µs | 1.2ms | Conservative |
Always |
420 writes/sec | 2.4ms | 3.1ms | Maximum durability |
Analysis:
The batched fsync policies provide excellent performance while maintaining strong durability guarantees.
Throughput by Record Size:
| Record Size | OS Fsync | Batch(5ms) | Always |
|---|---|---|---|
| 100 bytes | 119K/sec (11 MiB/s) | 101K/sec (9 MiB/s) | 755/sec (74 KiB/s) |
| 1 KB | 110K/sec (108 MiB/s) | 86K/sec (84 MiB/s) | 420/sec (410 KiB/s) |
| 10 KB | 69K/sec (673 MiB/s) | 44K/sec (430 MiB/s) | 323/sec (3.2 MiB/s) |
Larger records achieve higher throughput (bytes/sec) but lower write rate (records/sec).
Batched Writes¶
Measures throughput when batching multiple appends before syncing.
Configuration: - Batch size: N records - Record size: 1 KB - Fsync after each batch
Results:
| Batch Size | Throughput | Total Time | Avg Latency/Record |
|---|---|---|---|
| 10 | 2.8 MiB/s | 3.5ms | 350µs |
| 100 | 22.5 MiB/s | 4.3ms | 43µs |
| 1,000 | 102 MiB/s | 9.5ms | 9.5µs |
| 10,000 | 198 MiB/s | 49ms | 4.9µs |
Analysis:
Batching dramatically improves throughput by amortizing fsync cost across many records. The sweet spot is 100-1000 records per batch.
Code Example:
// Slow: Sync after each record
for record in records {
wal.append(&record).await?;
wal.sync().await?; // 2-3ms each!
}
// Fast: Batch sync
for record in records {
wal.append(&record).await?;
}
wal.sync().await?; // Single 2-3ms for all
Concurrent Writes¶
Multiple async tasks writing concurrently.
Configuration: - Concurrent tasks: 1, 2, 4, 8 - Each task: 100 writes of 1 KB records - Fsync policy: Os
Results:
| Threads | Throughput | Total Time | Scaling |
|---|---|---|---|
| 1 | 19.2 MiB/s | 5.1ms | 1.0x |
| 2 | 19.8 MiB/s | 9.9ms | 1.0x |
| 4 | 14.7 MiB/s | 26.7ms | 0.7x |
| 8 | 17.5 MiB/s | 44.6ms | 0.9x |
Analysis:
Concurrency doesn't improve throughput much because writes are serialized by a mutex. For high-concurrency workloads, use batching instead.
Read Performance¶
Sequential Scan¶
Reading all records from beginning to end.
Configuration: - Pre-written WAL with N records - Record size: 1 KB - Scan from start to end
Results:
| Record Count | Total Size | Time | Throughput |
|---|---|---|---|
| 100 | 100 KB | 1.9ms | 52 MiB/s |
| 1,000 | 1 MB | 18.6ms | 52 MiB/s |
| 10,000 | 10 MB | 182ms | 54 MiB/s |
Analysis:
Read throughput is consistent at ~52 MiB/s regardless of record count. Bottleneck is decode + CRC validation, not I/O (NVMe can do >3 GB/s).
Optimization opportunity: Batch CRC validation or use SIMD.
Random Reads¶
WAL doesn't support random reads efficiently. You must scan from a position:
// To read record at position 1000:
// 1. Scan from beginning (or last checkpoint)
// 2. Skip first 999 records
// 3. Read record 1000
// This is O(n), not O(1)
For random access, build an index on top of WAL (see Key-Value Store recipe).
Recovery Performance¶
Cold Start Recovery¶
Time to recover and validate WAL on startup.
Configuration: - WAL with N records pre-written - Record size: 1 KB - Includes full CRC validation
Results:
| Record Count | WAL Size | Recovery Time | Throughput |
|---|---|---|---|
| 1,000 | 1 MB | 458µs | 2.1 GiB/s |
| 10,000 | 10 MB | 2.9ms | 3.3 GiB/s |
| 50,000 | 50 MB | 14.6ms | 3.3 GiB/s |
| 100,000 | 100 MB | 30ms | 3.2 GiB/s |
Analysis:
Recovery is very fast (~3.3 GiB/s) because: 1. Sequential read from disk (fast on NVMe) 2. Efficient CRC validation (hardware accelerated) 3. Simple format (varint + bytes)
A 1 GB WAL recovers in ~300ms.
Multi-Segment Recovery¶
Recovery with multiple segments.
Configuration: - 5 segments of 1 MB each - Total: 5,000 records - Fsync policy: Batch(5ms)
Results:
| Segments | Total Size | Recovery Time | Notes |
|---|---|---|---|
| 2 | 2 MB | 1.3ms | 3.6 GiB/s |
| 5 | 5 MB | 1.5ms | 3.2 GiB/s |
| 10 | 10 MB | 2.8ms | 3.4 GiB/s |
Analysis:
Multiple segments don't significantly slow recovery. Slight overhead from opening multiple files is minimal.
Compression Performance¶
LZ4 Compression¶
Configuration: - Record: 1 KB of compressible data (repeated text) - Compression: LZ4
Results:
| Operation | Uncompressed | LZ4 Compressed | Overhead |
|---|---|---|---|
| Encode | 100ns | 2.1µs | 21x slower |
| Decode | 100ns | 1.3µs | 13x slower |
| Size | 1024 bytes | 89 bytes | 11.5x smaller |
Throughput Impact:
| Fsync Policy | Uncompressed | LZ4 | Slowdown |
|---|---|---|---|
| Os | 110K writes/sec | 95K writes/sec | 1.16x |
| Batch(5ms) | 86K writes/sec | 78K writes/sec | 1.10x |
Analysis:
LZ4 adds ~2µs per write but can save 10x+ storage for compressible data. Good trade-off for text, JSON, or repetitive data.
Zstd Compression¶
Configuration: - Same as LZ4 test - Compression: Zstd (level 3)
Results:
| Operation | Uncompressed | Zstd | Overhead |
|---|---|---|---|
| Encode | 100ns | 8.5µs | 85x slower |
| Decode | 100ns | 3.2µs | 32x slower |
| Size | 1024 bytes | 62 bytes | 16.5x smaller |
Throughput Impact:
| Fsync Policy | Uncompressed | Zstd | Slowdown |
|---|---|---|---|
| Os | 110K writes/sec | 72K writes/sec | 1.53x |
| Batch(5ms) | 86K writes/sec | 61K writes/sec | 1.41x |
Analysis:
Zstd provides better compression than LZ4 but is slower. Best for cold storage or when storage is expensive.
Fsync Latency¶
Direct measurement of fsync() system call.
Configuration: - 128 MB file - macOS APFS - Varying amounts of dirty data
Results:
| Dirty Data | Fsync Time |
|---|---|
| 0 KB (clean) | 12µs |
| 4 KB | 245µs |
| 64 KB | 1.2ms |
| 1 MB | 2.1ms |
| 10 MB | 8.7ms |
Analysis:
Fsync time is proportional to amount of dirty data. This is why batching helps - amortize the 2-8ms fsync across many writes.
Segment Rotation Overhead¶
Time to rotate to a new segment.
Configuration: - Segment size: 128 MB - Pre-allocation: enabled - Fsync before rotation: yes
Results:
| Operation | Time |
|---|---|
| Close old segment | 2.1ms |
| Truncate to actual size | 180µs |
| Create new segment file | 85µs |
| Pre-allocate 128 MB | 12ms |
| Total rotation time | ~15ms |
Analysis:
Rotation takes ~15ms every 30-60 seconds (depending on write rate). This is acceptable overhead. Most time is spent in pre-allocation.
Without pre-allocation:
| Operation | Time |
|---|---|
| Close old segment | 2.1ms |
| Create new segment | 85µs |
| Total | ~2.2ms |
Disabling pre-allocation makes rotation 7x faster but loses early disk-full detection.
Comparison with Other WALs¶
Methodology: Same hardware, same durability guarantees (Always fsync).
| System | Throughput | Latency p99 | Notes |
|---|---|---|---|
| nori-wal | 420 writes/sec | 3.1ms | This benchmark |
| SQLite WAL | 380 writes/sec | 3.4ms | Same fsync policy |
| RocksDB WAL | 450 writes/sec | 2.9ms | Slightly faster |
| Raw fsync | 400 writes/sec | 2.5ms | Theoretical max |
Analysis:
nori-wal is within 10% of theoretical maximum and competitive with mature systems.
With batching (Batch 5ms):
| System | Throughput | Latency p99 |
|---|---|---|
| nori-wal | 86K writes/sec | 5.5ms |
| RocksDB WAL | 91K writes/sec | 5.3ms |
Performance is similar across well-designed WAL implementations.
Performance Summary¶
Best Performance:
- Use FsyncPolicy::Batch(5ms) for good balance
- Batch writes before syncing
- Use LZ4 compression for text/JSON
- NVMe SSD for storage
Expect: - 86K writes/sec with 5ms durability guarantee - 102 MiB/s with 1000-record batches - 52 MiB/s sequential read - <100ms recovery for multi-GB WALs
Limits: - Always fsync: ~400 writes/sec (disk bound) - Concurrent writes: Limited by single writer lock - Random reads: Not supported (by design)
See Tuning Guide for optimization strategies.