Performance Profiling¶
How to measure and analyze nori-wal performance in your application.
Table of contents¶
Quick Start¶
Add basic timing to your code:
use std::time::Instant;
// Measure write latency
let start = Instant::now();
wal.append(&record).await?;
let latency = start.elapsed();
println!("Write latency: {:?}", latency);
// Measure throughput
let start = Instant::now();
for i in 0..10_000 {
wal.append(&Record::put(&i.to_le_bytes(), b"value")).await?;
}
let elapsed = start.elapsed();
println!("Throughput: {} writes/sec", 10_000.0 / elapsed.as_secs_f64());
Metrics to Track¶
Write Performance¶
Append latency (p50, p95, p99):
use hdrhistogram::Histogram;
let mut histogram = Histogram::<u64>::new(3)?;
for _ in 0..100_000 {
let start = Instant::now();
wal.append(&record).await?;
histogram.record(start.elapsed().as_micros() as u64)?;
}
println!("p50: {}µs", histogram.value_at_quantile(0.50));
println!("p95: {}µs", histogram.value_at_quantile(0.95));
println!("p99: {}µs", histogram.value_at_quantile(0.99));
Typical values: - p50: 8-15µs (in-memory write) - p95: 20-50µs - p99: 100µs - 5ms (depends on fsync policy)
Fsync latency:
let start = Instant::now();
wal.sync().await?;
let fsync_latency = start.elapsed();
println!("Fsync latency: {:?}", fsync_latency);
Typical values: - Clean file: 10-50µs - 1 MB dirty: 1-2ms - 10 MB dirty: 5-10ms
Throughput (writes/sec):
let start = Instant::now();
let count = 10_000;
for i in 0..count {
wal.append(&Record::put(&i.to_le_bytes(), b"value")).await?;
}
let throughput = count as f64 / start.elapsed().as_secs_f64();
println!("Throughput: {:.0} writes/sec", throughput);
Typical values:
- With FsyncPolicy::Always: 400-500 writes/sec
- With FsyncPolicy::Batch(5ms): 80-100K writes/sec
- With FsyncPolicy::Os: 100-120K writes/sec
Read Performance¶
Sequential scan throughput:
let start = Instant::now();
let mut count = 0u64;
let mut bytes = 0u64;
let mut reader = wal.read_from(Position { segment_id: 0, offset: 0 }).await?;
while let Some((record, _)) = reader.next_record().await? {
count += 1;
bytes += record.key.len() as u64 + record.value.len() as u64;
}
let elapsed = start.elapsed();
let throughput_mbs = (bytes as f64 / 1_000_000.0) / elapsed.as_secs_f64();
println!("Read {} records in {:?}", count, elapsed);
println!("Throughput: {:.1} MB/s", throughput_mbs);
Typical values: - Sequential read: 50-100 MB/s (decode + CRC validation bottleneck) - Raw disk read: 2-7 GB/s (NVMe)
Recovery Performance¶
Recovery time:
let start = Instant::now();
let (wal, recovery_info) = Wal::open(config).await?;
let elapsed = start.elapsed();
println!("Recovery complete:");
println!(" Time: {:?}", elapsed);
println!(" Records: {}", recovery_info.valid_records);
println!(" Segments: {}", recovery_info.segments_scanned);
println!(" Throughput: {:.1} GiB/s",
(recovery_info.bytes_scanned as f64 / 1_073_741_824.0) / elapsed.as_secs_f64()
);
Typical values: - 10 MB: 2-5ms (3.3 GiB/s) - 100 MB: 30ms - 1 GB: 300ms
Segment Operations¶
Rotation overhead:
// Track when rotation happens
let old_segment_id = wal.current_segment_id();
let start = Instant::now();
wal.append(&large_record).await?; // Triggers rotation
let elapsed = start.elapsed();
if wal.current_segment_id() != old_segment_id {
println!("Segment rotation took: {:?}", elapsed);
}
Typical values: - With pre-allocation: 10-15ms - Without pre-allocation: 2-3ms
Segment deletion:
let start = Instant::now();
wal.delete_segments_before(segment_id).await?;
let elapsed = start.elapsed();
println!("Deleted segments in {:?}", elapsed);
Profiling Tools¶
CPU Profiling¶
Use cargo flamegraph to identify hot paths:
cargo install flamegraph
# Profile your application
cargo flamegraph --bin your_app
# Open flamegraph.svg in browser
What to look for:
- High time in write_all() → Disk bottleneck
- High time in crc32() → Consider larger records (amortize CRC cost)
- High time in compress() → Consider disabling compression
Memory Profiling¶
Use heaptrack or valgrind to track allocations:
# Install heaptrack
sudo apt install heaptrack
# Profile
heaptrack ./target/release/your_app
# Analyze
heaptrack_gui heaptrack.your_app.*.gz
What to look for: - Large allocations in hot path → Use object pools - Memory growth over time → Memory leak
Disk I/O Profiling¶
Linux (iostat)¶
iostat -x 1
# Look at:
# - %util: Disk utilization (>80% = bottleneck)
# - await: Average I/O latency (>10ms = slow)
# - w/s: Writes per second
# - wMB/s: Write throughput
macOS (fs_usage)¶
Windows (Performance Monitor)¶
perfmon /res
# Add counters:
# - PhysicalDisk: Disk Writes/sec
# - PhysicalDisk: Avg. Disk Write Queue Length
# - PhysicalDisk: % Disk Time
Network I/O (Replication)¶
Linux (iftop)¶
Application-level¶
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Arc;
let bytes_sent = Arc::new(AtomicU64::new(0));
// In replication code:
bytes_sent.fetch_add(record.len() as u64, Ordering::Relaxed);
// Periodically report:
let sent = bytes_sent.swap(0, Ordering::Relaxed);
println!("Replication throughput: {} MB/s", sent / 1_000_000);
Benchmarking Framework¶
Create a reusable benchmarking harness:
use criterion::{black_box, criterion_group, criterion_main, Criterion, BenchmarkId};
use nori_wal::*;
fn bench_append(c: &mut Criterion) {
let rt = tokio::runtime::Runtime::new().unwrap();
let mut group = c.benchmark_group("append");
for size in [100, 1024, 10_240] {
group.bench_with_input(
BenchmarkId::from_parameter(size),
&size,
|b, &size| {
b.to_async(&rt).iter(|| async {
let config = WalConfig {
dir: PathBuf::from("/tmp/bench_wal"),
fsync_policy: FsyncPolicy::Os,
..Default::default()
};
let (wal, _) = Wal::open(config).await.unwrap();
let record = Record::put(b"key", &vec![0u8; size]);
black_box(wal.append(&record).await.unwrap());
});
},
);
}
group.finish();
}
criterion_group!(benches, bench_append);
criterion_main!(benches);
Run benchmarks:
Production Monitoring¶
Prometheus Metrics¶
Export metrics for monitoring:
use prometheus::{Counter, Histogram, Gauge, register_counter, register_histogram, register_gauge};
lazy_static! {
static ref APPEND_DURATION: Histogram = register_histogram!(
"wal_append_duration_seconds",
"Time to append a record"
).unwrap();
static ref APPEND_TOTAL: Counter = register_counter!(
"wal_append_total",
"Total number of appends"
).unwrap();
static ref FSYNC_DURATION: Histogram = register_histogram!(
"wal_fsync_duration_seconds",
"Time to fsync"
).unwrap();
static ref ACTIVE_SEGMENT_SIZE: Gauge = register_gauge!(
"wal_active_segment_bytes",
"Size of active segment in bytes"
).unwrap();
static ref SEGMENT_ROTATIONS: Counter = register_counter!(
"wal_segment_rotations_total",
"Total number of segment rotations"
).unwrap();
}
// In append:
let timer = APPEND_DURATION.start_timer();
wal.append(&record).await?;
timer.observe_duration();
APPEND_TOTAL.inc();
// In sync:
let timer = FSYNC_DURATION.start_timer();
wal.sync().await?;
timer.observe_duration();
// Periodically update:
ACTIVE_SEGMENT_SIZE.set(wal.current_segment_size() as f64);
OpenTelemetry Tracing¶
Add distributed tracing:
use tracing::{info_span, instrument};
#[instrument]
pub async fn append(&mut self, record: &Record) -> Result<()> {
let span = info_span!("wal.append", record_size = record.len());
let _enter = span.enter();
// ... append logic ...
Ok(())
}
View traces in Jaeger/Zipkin:
Request [==================] 45ms
├─ wal.append [==== ] 8ms
│ ├─ serialize [== ] 2ms
│ ├─ write [== ] 3ms
│ └─ crc [= ] 1ms
└─ wal.sync [ ====] 35ms
└─ fsync [ ====] 35ms
Custom Dashboards¶
Track key metrics:
Grafana Dashboard:
{
"panels": [
{
"title": "Write Throughput",
"targets": [
{
"expr": "rate(wal_append_total[1m])"
}
]
},
{
"title": "Append Latency (p99)",
"targets": [
{
"expr": "histogram_quantile(0.99, wal_append_duration_seconds)"
}
]
},
{
"title": "Fsync Latency",
"targets": [
{
"expr": "rate(wal_fsync_duration_seconds_sum[1m]) / rate(wal_fsync_duration_seconds_count[1m])"
}
]
},
{
"title": "Segment Size",
"targets": [
{
"expr": "wal_active_segment_bytes"
}
]
}
]
}
Troubleshooting Performance Issues¶
Symptom: Low Throughput¶
Check:
-
Fsync policy:
-
Batching:
-
Disk speed:
Fix:
- Use FsyncPolicy::Batch(5ms)
- Batch multiple appends before sync
- Upgrade to faster disk
Symptom: High p99 Latency¶
Check:
-
Segment rotation:
-
Fsync spikes:
-
CPU throttling:
Fix: - Increase segment size - Monitor for GC pauses (if using JVM languages) - Check for thermal throttling
Symptom: Slow Recovery¶
Check:
-
Segment count:
-
CRC validation time:
Fix: - Reduce number of segments (increase segment size) - Run compaction to merge segments - Use parallel recovery (if available)
Performance Testing Checklist¶
Before deploying:
- Benchmarked on target hardware
- Measured p99 latency under load
- Tested recovery time with realistic data size
- Profiled CPU usage (no hot spots >20%)
- Profiled memory usage (no leaks)
- Monitored disk I/O (utilization <80%)
- Tested segment rotation overhead (<10ms)
- Set up production monitoring (Prometheus/Grafana)
- Created alerts for slow operations
- Load tested for 24+ hours
Conclusion¶
Key metrics to monitor:
- Append latency (p99): Should be <10ms
- Throughput: Should match workload requirements
- Fsync latency: Should be <5ms
- Recovery time: Should be <1 second per GB
Use profiling tools to identify bottlenecks: - CPU: flamegraph - Memory: heaptrack - Disk: iostat - Application: Prometheus + Grafana
For more details, see: - Benchmarks - Performance measurements - Tuning Guide - Optimization strategies - Hardware - Hardware selection