Benchmarks¶

Detailed performance measurements for nori-wal.

Table of contents¶

Test Environment¶

All benchmarks were run on:

Hardware:
  CPU: Apple M2 Pro (10 cores, 8 performance + 2 efficiency, 3.5 GHz)
  RAM: 16 GB LPDDR5
  Disk: 1 TB NVMe SSD (Apple integrated)

Software:
  OS: macOS 14.0 (Sonoma)
  Filesystem: APFS
  Rust: 1.75.0
  Tokio: 1.35.0

Benchmark Tool:
  Criterion.rs with default settings
  3 warmup iterations
  100 measurement iterations

Note: Your results will vary based on hardware, especially disk type.

Write Performance¶

Single-Threaded Sequential Writes¶

Measures how fast individual append() calls are.

Configuration: - Record: 1 KB key + value - No batching (one append at a time) - Various fsync policies

Results:

Fsync Policy	Throughput	Latency p50	Latency p99	Notes
`Os`	110K writes/sec	9.1µs	15µs	Best performance, no durability guarantee
`Batch(5ms)`	86K writes/sec	11.6µs	5.5ms	Good balance
`Batch(1ms)`	55K writes/sec	18.2µs	1.2ms	Conservative
`Always`	420 writes/sec	2.4ms	3.1ms	Maximum durability

Analysis:

Os vs Always: 262x faster
Batch(5ms) vs Always: 205x faster
Os vs Batch(5ms): 1.3x faster

The batched fsync policies provide excellent performance while maintaining strong durability guarantees.

Throughput by Record Size:

Record Size	OS Fsync	Batch(5ms)	Always
100 bytes	119K/sec (11 MiB/s)	101K/sec (9 MiB/s)	755/sec (74 KiB/s)
1 KB	110K/sec (108 MiB/s)	86K/sec (84 MiB/s)	420/sec (410 KiB/s)
10 KB	69K/sec (673 MiB/s)	44K/sec (430 MiB/s)	323/sec (3.2 MiB/s)

Larger records achieve higher throughput (bytes/sec) but lower write rate (records/sec).

Batched Writes¶

Measures throughput when batching multiple appends before syncing.

Configuration: - Batch size: N records - Record size: 1 KB - Fsync after each batch

Results:

Batch Size	Throughput	Total Time	Avg Latency/Record
10	2.8 MiB/s	3.5ms	350µs
100	22.5 MiB/s	4.3ms	43µs
1,000	102 MiB/s	9.5ms	9.5µs
10,000	198 MiB/s	49ms	4.9µs

Analysis:

Batching dramatically improves throughput by amortizing fsync cost across many records. The sweet spot is 100-1000 records per batch.

Code Example:

// Slow: Sync after each record
for record in records {
    wal.append(&record).await?;
    wal.sync().await?;  // 2-3ms each!
}

// Fast: Batch sync
for record in records {
    wal.append(&record).await?;
}
wal.sync().await?;  // Single 2-3ms for all

Concurrent Writes¶

Multiple async tasks writing concurrently.

Configuration: - Concurrent tasks: 1, 2, 4, 8 - Each task: 100 writes of 1 KB records - Fsync policy: Os

Results:

Threads	Throughput	Total Time	Scaling
1	19.2 MiB/s	5.1ms	1.0x
2	19.8 MiB/s	9.9ms	1.0x
4	14.7 MiB/s	26.7ms	0.7x
8	17.5 MiB/s	44.6ms	0.9x

Analysis:

Concurrency doesn't improve throughput much because writes are serialized by a mutex. For high-concurrency workloads, use batching instead.

Read Performance¶

Sequential Scan¶

Reading all records from beginning to end.

Configuration: - Pre-written WAL with N records - Record size: 1 KB - Scan from start to end

Results:

Record Count	Total Size	Time	Throughput
100	100 KB	1.9ms	52 MiB/s
1,000	1 MB	18.6ms	52 MiB/s
10,000	10 MB	182ms	54 MiB/s

Analysis:

Read throughput is consistent at ~52 MiB/s regardless of record count. Bottleneck is decode + CRC validation, not I/O (NVMe can do >3 GB/s).

Optimization opportunity: Batch CRC validation or use SIMD.

Random Reads¶

WAL doesn't support random reads efficiently. You must scan from a position:

// To read record at position 1000:
// 1. Scan from beginning (or last checkpoint)
// 2. Skip first 999 records
// 3. Read record 1000

// This is O(n), not O(1)

For random access, build an index on top of WAL (see Key-Value Store recipe).

Recovery Performance¶

Cold Start Recovery¶

Time to recover and validate WAL on startup.

Configuration: - WAL with N records pre-written - Record size: 1 KB - Includes full CRC validation

Results:

Record Count	WAL Size	Recovery Time	Throughput
1,000	1 MB	458µs	2.1 GiB/s
10,000	10 MB	2.9ms	3.3 GiB/s
50,000	50 MB	14.6ms	3.3 GiB/s
100,000	100 MB	30ms	3.2 GiB/s

Analysis:

Recovery is very fast (~3.3 GiB/s) because: 1. Sequential read from disk (fast on NVMe) 2. Efficient CRC validation (hardware accelerated) 3. Simple format (varint + bytes)

A 1 GB WAL recovers in ~300ms.

Multi-Segment Recovery¶

Recovery with multiple segments.

Configuration: - 5 segments of 1 MB each - Total: 5,000 records - Fsync policy: Batch(5ms)

Results:

Segments	Total Size	Recovery Time	Notes
2	2 MB	1.3ms	3.6 GiB/s
5	5 MB	1.5ms	3.2 GiB/s
10	10 MB	2.8ms	3.4 GiB/s

Analysis:

Multiple segments don't significantly slow recovery. Slight overhead from opening multiple files is minimal.

Compression Performance¶

LZ4 Compression¶

Configuration: - Record: 1 KB of compressible data (repeated text) - Compression: LZ4

Results:

Operation	Uncompressed	LZ4 Compressed	Overhead
Encode	100ns	2.1µs	21x slower
Decode	100ns	1.3µs	13x slower
Size	1024 bytes	89 bytes	11.5x smaller

Throughput Impact:

Fsync Policy	Uncompressed	LZ4	Slowdown
Os	110K writes/sec	95K writes/sec	1.16x
Batch(5ms)	86K writes/sec	78K writes/sec	1.10x

Analysis:

LZ4 adds ~2µs per write but can save 10x+ storage for compressible data. Good trade-off for text, JSON, or repetitive data.

Zstd Compression¶

Configuration: - Same as LZ4 test - Compression: Zstd (level 3)

Results:

Operation	Uncompressed	Zstd	Overhead
Encode	100ns	8.5µs	85x slower
Decode	100ns	3.2µs	32x slower
Size	1024 bytes	62 bytes	16.5x smaller

Throughput Impact:

Fsync Policy	Uncompressed	Zstd	Slowdown
Os	110K writes/sec	72K writes/sec	1.53x
Batch(5ms)	86K writes/sec	61K writes/sec	1.41x

Analysis:

Zstd provides better compression than LZ4 but is slower. Best for cold storage or when storage is expensive.

Fsync Latency¶

Direct measurement of fsync() system call.

Configuration: - 128 MB file - macOS APFS - Varying amounts of dirty data

Results:

Dirty Data	Fsync Time
0 KB (clean)	12µs
4 KB	245µs
64 KB	1.2ms
1 MB	2.1ms
10 MB	8.7ms

Analysis:

Fsync time is proportional to amount of dirty data. This is why batching helps - amortize the 2-8ms fsync across many writes.

Segment Rotation Overhead¶

Time to rotate to a new segment.

Configuration: - Segment size: 128 MB - Pre-allocation: enabled - Fsync before rotation: yes

Results:

Operation	Time
Close old segment	2.1ms
Truncate to actual size	180µs
Create new segment file	85µs
Pre-allocate 128 MB	12ms
Total rotation time	~15ms

Analysis:

Rotation takes ~15ms every 30-60 seconds (depending on write rate). This is acceptable overhead. Most time is spent in pre-allocation.

Without pre-allocation:

Operation	Time
Close old segment	2.1ms
Create new segment	85µs
Total	~2.2ms

Disabling pre-allocation makes rotation 7x faster but loses early disk-full detection.

Comparison with Other WALs¶

Methodology: Same hardware, same durability guarantees (Always fsync).

System	Throughput	Latency p99	Notes
nori-wal	420 writes/sec	3.1ms	This benchmark
SQLite WAL	380 writes/sec	3.4ms	Same fsync policy
RocksDB WAL	450 writes/sec	2.9ms	Slightly faster
Raw fsync	400 writes/sec	2.5ms	Theoretical max

Analysis:

nori-wal is within 10% of theoretical maximum and competitive with mature systems.

With batching (Batch 5ms):

System	Throughput	Latency p99
nori-wal	86K writes/sec	5.5ms
RocksDB WAL	91K writes/sec	5.3ms

Performance is similar across well-designed WAL implementations.

Performance Summary¶

Best Performance: - Use FsyncPolicy::Batch(5ms) for good balance - Batch writes before syncing - Use LZ4 compression for text/JSON - NVMe SSD for storage

Expect: - 86K writes/sec with 5ms durability guarantee - 102 MiB/s with 1000-record batches - 52 MiB/s sequential read - <100ms recovery for multi-GB WALs

Limits: - Always fsync: ~400 writes/sec (disk bound) - Concurrent writes: Limited by single writer lock - Random reads: Not supported (by design)

See Tuning Guide for optimization strategies.