Performance Profiling¶

How to measure and analyze nori-wal performance in your application.

Table of contents¶

Quick Start¶

Add basic timing to your code:

use std::time::Instant;

// Measure write latency
let start = Instant::now();
wal.append(&record).await?;
let latency = start.elapsed();
println!("Write latency: {:?}", latency);

// Measure throughput
let start = Instant::now();
for i in 0..10_000 {
    wal.append(&Record::put(&i.to_le_bytes(), b"value")).await?;
}
let elapsed = start.elapsed();
println!("Throughput: {} writes/sec", 10_000.0 / elapsed.as_secs_f64());

Metrics to Track¶

Write Performance¶

Append latency (p50, p95, p99):

use hdrhistogram::Histogram;

let mut histogram = Histogram::<u64>::new(3)?;

for _ in 0..100_000 {
    let start = Instant::now();
    wal.append(&record).await?;
    histogram.record(start.elapsed().as_micros() as u64)?;
}

println!("p50: {}µs", histogram.value_at_quantile(0.50));
println!("p95: {}µs", histogram.value_at_quantile(0.95));
println!("p99: {}µs", histogram.value_at_quantile(0.99));

Typical values: - p50: 8-15µs (in-memory write) - p95: 20-50µs - p99: 100µs - 5ms (depends on fsync policy)

Fsync latency:

let start = Instant::now();
wal.sync().await?;
let fsync_latency = start.elapsed();
println!("Fsync latency: {:?}", fsync_latency);

Typical values: - Clean file: 10-50µs - 1 MB dirty: 1-2ms - 10 MB dirty: 5-10ms

Throughput (writes/sec):

let start = Instant::now();
let count = 10_000;

for i in 0..count {
    wal.append(&Record::put(&i.to_le_bytes(), b"value")).await?;
}

let throughput = count as f64 / start.elapsed().as_secs_f64();
println!("Throughput: {:.0} writes/sec", throughput);

Typical values: - With FsyncPolicy::Always: 400-500 writes/sec - With FsyncPolicy::Batch(5ms): 80-100K writes/sec - With FsyncPolicy::Os: 100-120K writes/sec

Read Performance¶

Sequential scan throughput:

let start = Instant::now();
let mut count = 0u64;
let mut bytes = 0u64;

let mut reader = wal.read_from(Position { segment_id: 0, offset: 0 }).await?;
while let Some((record, _)) = reader.next_record().await? {
    count += 1;
    bytes += record.key.len() as u64 + record.value.len() as u64;
}

let elapsed = start.elapsed();
let throughput_mbs = (bytes as f64 / 1_000_000.0) / elapsed.as_secs_f64();

println!("Read {} records in {:?}", count, elapsed);
println!("Throughput: {:.1} MB/s", throughput_mbs);

Typical values: - Sequential read: 50-100 MB/s (decode + CRC validation bottleneck) - Raw disk read: 2-7 GB/s (NVMe)

Recovery Performance¶

Recovery time:

let start = Instant::now();
let (wal, recovery_info) = Wal::open(config).await?;
let elapsed = start.elapsed();

println!("Recovery complete:");
println!("  Time: {:?}", elapsed);
println!("  Records: {}", recovery_info.valid_records);
println!("  Segments: {}", recovery_info.segments_scanned);
println!("  Throughput: {:.1} GiB/s",
    (recovery_info.bytes_scanned as f64 / 1_073_741_824.0) / elapsed.as_secs_f64()
);

Typical values: - 10 MB: 2-5ms (3.3 GiB/s) - 100 MB: 30ms - 1 GB: 300ms

Segment Operations¶

Rotation overhead:

// Track when rotation happens
let old_segment_id = wal.current_segment_id();

let start = Instant::now();
wal.append(&large_record).await?;  // Triggers rotation
let elapsed = start.elapsed();

if wal.current_segment_id() != old_segment_id {
    println!("Segment rotation took: {:?}", elapsed);
}

Typical values: - With pre-allocation: 10-15ms - Without pre-allocation: 2-3ms

Segment deletion:

let start = Instant::now();
wal.delete_segments_before(segment_id).await?;
let elapsed = start.elapsed();
println!("Deleted segments in {:?}", elapsed);

Profiling Tools¶

CPU Profiling¶

Use cargo flamegraph to identify hot paths:

cargo install flamegraph

# Profile your application
cargo flamegraph --bin your_app

# Open flamegraph.svg in browser

What to look for: - High time in write_all() → Disk bottleneck - High time in crc32() → Consider larger records (amortize CRC cost) - High time in compress() → Consider disabling compression

Memory Profiling¶

Use heaptrack or valgrind to track allocations:

# Install heaptrack
sudo apt install heaptrack

# Profile
heaptrack ./target/release/your_app

# Analyze
heaptrack_gui heaptrack.your_app.*.gz

What to look for: - Large allocations in hot path → Use object pools - Memory growth over time → Memory leak

Disk I/O Profiling¶

Linux (iostat)¶

iostat -x 1

# Look at:
# - %util: Disk utilization (>80% = bottleneck)
# - await: Average I/O latency (>10ms = slow)
# - w/s: Writes per second
# - wMB/s: Write throughput

macOS (fs_usage)¶

sudo fs_usage -f filesys -w

# Filter for your process:
sudo fs_usage -f filesys -w | grep your_app

Windows (Performance Monitor)¶

perfmon /res

# Add counters:
# - PhysicalDisk: Disk Writes/sec
# - PhysicalDisk: Avg. Disk Write Queue Length
# - PhysicalDisk: % Disk Time

Network I/O (Replication)¶

Linux (iftop)¶

sudo iftop -i eth0

# Shows:
# - Bandwidth usage per connection
# - Peak/avg throughput

Application-level¶

use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Arc;

let bytes_sent = Arc::new(AtomicU64::new(0));

// In replication code:
bytes_sent.fetch_add(record.len() as u64, Ordering::Relaxed);

// Periodically report:
let sent = bytes_sent.swap(0, Ordering::Relaxed);
println!("Replication throughput: {} MB/s", sent / 1_000_000);

Benchmarking Framework¶

Create a reusable benchmarking harness:

use criterion::{black_box, criterion_group, criterion_main, Criterion, BenchmarkId};
use nori_wal::*;

fn bench_append(c: &mut Criterion) {
    let rt = tokio::runtime::Runtime::new().unwrap();

    let mut group = c.benchmark_group("append");

    for size in [100, 1024, 10_240] {
        group.bench_with_input(
            BenchmarkId::from_parameter(size),
            &size,
            |b, &size| {
                b.to_async(&rt).iter(|| async {
                    let config = WalConfig {
                        dir: PathBuf::from("/tmp/bench_wal"),
                        fsync_policy: FsyncPolicy::Os,
                        ..Default::default()
                    };

                    let (wal, _) = Wal::open(config).await.unwrap();
                    let record = Record::put(b"key", &vec![0u8; size]);

                    black_box(wal.append(&record).await.unwrap());
                });
            },
        );
    }

    group.finish();
}

criterion_group!(benches, bench_append);
criterion_main!(benches);

Run benchmarks:

cargo bench

# Results:
# append/100    9.8 µs
# append/1024   11.2 µs
# append/10240  45.3 µs

Production Monitoring¶

Prometheus Metrics¶

Export metrics for monitoring:

use prometheus::{Counter, Histogram, Gauge, register_counter, register_histogram, register_gauge};

lazy_static! {
    static ref APPEND_DURATION: Histogram = register_histogram!(
        "wal_append_duration_seconds",
        "Time to append a record"
    ).unwrap();

    static ref APPEND_TOTAL: Counter = register_counter!(
        "wal_append_total",
        "Total number of appends"
    ).unwrap();

    static ref FSYNC_DURATION: Histogram = register_histogram!(
        "wal_fsync_duration_seconds",
        "Time to fsync"
    ).unwrap();

    static ref ACTIVE_SEGMENT_SIZE: Gauge = register_gauge!(
        "wal_active_segment_bytes",
        "Size of active segment in bytes"
    ).unwrap();

    static ref SEGMENT_ROTATIONS: Counter = register_counter!(
        "wal_segment_rotations_total",
        "Total number of segment rotations"
    ).unwrap();
}

// In append:
let timer = APPEND_DURATION.start_timer();
wal.append(&record).await?;
timer.observe_duration();
APPEND_TOTAL.inc();

// In sync:
let timer = FSYNC_DURATION.start_timer();
wal.sync().await?;
timer.observe_duration();

// Periodically update:
ACTIVE_SEGMENT_SIZE.set(wal.current_segment_size() as f64);

OpenTelemetry Tracing¶

Add distributed tracing:

use tracing::{info_span, instrument};

#[instrument]
pub async fn append(&mut self, record: &Record) -> Result<()> {
    let span = info_span!("wal.append", record_size = record.len());
    let _enter = span.enter();

    // ... append logic ...

    Ok(())
}

View traces in Jaeger/Zipkin:

Request                             [==================] 45ms
  ├─ wal.append                    [====              ]  8ms
  │   ├─ serialize                 [==                ]  2ms
  │   ├─ write                     [==                ]  3ms
  │   └─ crc                       [=                 ]  1ms
  └─ wal.sync                      [              ====] 35ms
      └─ fsync                     [              ====] 35ms

Custom Dashboards¶

Track key metrics:

Grafana Dashboard:

{
  "panels": [
    {
      "title": "Write Throughput",
      "targets": [
        {
          "expr": "rate(wal_append_total[1m])"
        }
      ]
    },
    {
      "title": "Append Latency (p99)",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, wal_append_duration_seconds)"
        }
      ]
    },
    {
      "title": "Fsync Latency",
      "targets": [
        {
          "expr": "rate(wal_fsync_duration_seconds_sum[1m]) / rate(wal_fsync_duration_seconds_count[1m])"
        }
      ]
    },
    {
      "title": "Segment Size",
      "targets": [
        {
          "expr": "wal_active_segment_bytes"
        }
      ]
    }
  ]
}

Troubleshooting Performance Issues¶

Symptom: Low Throughput¶

Check:

Fsync policy:

println!("Fsync policy: {:?}", wal.config().fsync_policy);

Batching:

// Are you syncing after every write?
for record in records {
    wal.append(&record).await?;
    wal.sync().await?;  // ← This kills throughput
}

Disk speed:
```
iostat -x 1
# Look at %util and wMB/s
```

Fix: - Use FsyncPolicy::Batch(5ms) - Batch multiple appends before sync - Upgrade to faster disk

Symptom: High p99 Latency¶

Check:

Segment rotation:

// Add logging
println!("Segment rotation: {:?}", elapsed);

Fsync spikes:
```
# Linux
sudo iotop -ao
```
CPU throttling:
```
# Linux
cat /proc/cpuinfo | grep MHz
```

Fix: - Increase segment size - Monitor for GC pauses (if using JVM languages) - Check for thermal throttling

Symptom: Slow Recovery¶

Check:

Segment count:
```
ls -l /path/to/wal/ | wc -l
```

CRC validation time:

let start = Instant::now();
let (wal, info) = Wal::open(config).await?;
println!("Validated {} records in {:?}", info.valid_records, start.elapsed());

Fix: - Reduce number of segments (increase segment size) - Run compaction to merge segments - Use parallel recovery (if available)

Performance Testing Checklist¶

Before deploying:

Conclusion¶

Key metrics to monitor:

Append latency (p99): Should be <10ms
Throughput: Should match workload requirements
Fsync latency: Should be <5ms
Recovery time: Should be <1 second per GB

Use profiling tools to identify bottlenecks: - CPU: flamegraph - Memory: heaptrack - Disk: iostat - Application: Prometheus + Grafana

For more details, see: - Benchmarks - Performance measurements - Tuning Guide - Optimization strategies - Hardware - Hardware selection