Metrics Reference¶
Complete reference for all Prometheus metrics exposed by NoriKV.
Overview¶
NoriKV exposes metrics in Prometheus text format via the /metrics HTTP endpoint. Metrics are collected using the prometheus-client crate and implement the nori-observe::Meter trait for vendor-neutral instrumentation.
Metrics Categories: - KV Metrics - Client request counters and latencies - LSM Metrics - Storage engine performance - Raft Metrics - Consensus and replication - SWIM Metrics - Membership and failure detection - System Metrics - Resource usage and health
Metrics Architecture¶
Design Decisions¶
1. Vendor-Neutral Instrumentation
All components emit metrics via the nori_observe::Meter trait:
pub trait Meter: Send + Sync {
fn counter(&self, name: &'static str, labels: &'static [(&'static str, &'static str)])
-> Box<dyn Counter>;
fn gauge(&self, name: &'static str, labels: &'static [(&'static str, &'static str)])
-> Box<dyn Gauge>;
fn histo(&self, name: &'static str, buckets: &'static [f64], labels: &'static [(&'static str, &'static str)])
-> Box<dyn Histogram>;
}
Benefits: - Core crates (nori-lsm, nori-raft, nori-swim) have zero dependencies on Prometheus - Can swap backends (OTLP, StatsD, custom) without changing core code - Enables testing with mock meters
2. Prometheus Implementation
The server implements Meter using prometheus-client:
// apps/norikv-server/src/metrics.rs
pub struct PrometheusMeter {
registry: Arc<Mutex<Registry>>,
counters: Arc<Mutex<HashMap<String, Family<Vec<(String, String)>, PromCounter>>>>,
gauges: Arc<Mutex<HashMap<String, Family<Vec<(String, String)>, PromGauge>>>>,
histograms: Arc<Mutex<HashMap<String, Family<Vec<(String, String)>, PromHistogram>>>>,
}
Key Features:
- Thread-safe with Arc<Mutex<...>> for concurrent updates
- Metric families for automatic label handling
- Exponential histogram buckets (1.0, 2.0, 4.0, ..., 512.0) for latency
- Registry for text format export
3. Wiring
Metrics are wired at server startup:
// Create meter
let meter = Arc::new(PrometheusMeter::new());
// Wire to gRPC server
let grpc_server = GrpcServer::with_backend(addr, backend)
.with_meter(meter.clone() as Arc<dyn Meter>);
// Wire to HTTP server
let http_server = HttpServer::new(addr, health_checker, meter.clone());
4. Histogram Buckets
Exponential buckets optimized for latency distribution:
exponential_buckets(1.0, 2.0, 10)
// → [1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256.0, 512.0]
Rationale: - Covers range from 1ms (fast cache hit) to 512ms (slow compaction/network) - Good precision for typical latencies (1-20ms) - Standard Prometheus practice for latency metrics
KV Metrics¶
kv_requests_total¶
Type: Counter
Description: Total number of KV requests by operation type.
Labels:
- operation - Request type: put, get, delete, scan
Example:
kv_requests_total{operation="put"} 12345
kv_requests_total{operation="get"} 67890
kv_requests_total{operation="delete"} 123
kv_requests_total{operation="scan"} 456
Use Cases:
- Calculate request rate: rate(kv_requests_total[5m])
- Track operation mix: sum by (operation) (rate(kv_requests_total[1m]))
- Alert on zero traffic: rate(kv_requests_total[5m]) == 0
PromQL Examples:
# Total requests per second (all operations)
sum(rate(kv_requests_total[5m]))
# Requests per second by operation
rate(kv_requests_total{operation="put"}[5m])
# Operation distribution (%)
100 * rate(kv_requests_total[5m])
/ ignoring(operation) group_left
sum(rate(kv_requests_total[5m]))
kv_request_duration_ms¶
Type: Histogram
Description: Request latency in milliseconds, bucketed by operation, status, and result.
Labels:
- operation - Request type: put, get, delete, scan
- status - Request outcome: success, not_leader, error
- result - For GET: found, not_found (only for successful GETs)
Buckets: [1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256.0, 512.0] (ms)
Example:
kv_request_duration_ms_bucket{operation="put",status="success",le="1.0"} 5000
kv_request_duration_ms_bucket{operation="put",status="success",le="2.0"} 8000
kv_request_duration_ms_bucket{operation="put",status="success",le="4.0"} 10000
kv_request_duration_ms_bucket{operation="put",status="success",le="8.0"} 11500
kv_request_duration_ms_bucket{operation="put",status="success",le="16.0"} 12000
kv_request_duration_ms_bucket{operation="put",status="success",le="+Inf"} 12345
kv_request_duration_ms_sum{operation="put",status="success"} 123450.5
kv_request_duration_ms_count{operation="put",status="success"} 12345
Use Cases:
- Calculate p95 latency: histogram_quantile(0.95, rate(kv_request_duration_ms_bucket[5m]))
- Alert on high latency: histogram_quantile(0.95, ...) > 50
- Track cache hit latency: filter by result="found"
PromQL Examples:
# P95 PUT latency
histogram_quantile(0.95,
rate(kv_request_duration_ms_bucket{operation="put"}[5m])
)
# P99 GET latency (found vs not found)
histogram_quantile(0.99,
rate(kv_request_duration_ms_bucket{operation="get",result="found"}[5m])
)
# Average latency across all operations
rate(kv_request_duration_ms_sum[5m])
/ rate(kv_request_duration_ms_count[5m])
# Slow requests (>100ms)
sum(rate(kv_request_duration_ms_bucket{le="128.0"}[5m]))
- sum(rate(kv_request_duration_ms_bucket{le="64.0"}[5m]))
Grafana Dashboard:
{
"targets": [
{
"expr": "histogram_quantile(0.50, rate(kv_request_duration_ms_bucket{operation=\"put\"}[5m]))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, rate(kv_request_duration_ms_bucket{operation=\"put\"}[5m]))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, rate(kv_request_duration_ms_bucket{operation=\"put\"}[5m]))",
"legendFormat": "p99"
}
]
}
LSM Metrics¶
lsm_compaction_duration_ms¶
Type: Histogram
Description: Time spent in compaction, in milliseconds.
Labels:
- level - Source level: L0, L1, L2, etc.
- result - Compaction outcome: success, error
Buckets: Exponential from 1ms to 10 seconds
Use Cases:
- Detect compaction storms: rate(lsm_compaction_duration_ms_count[1m]) > 10
- Alert on slow compaction: histogram_quantile(0.95, ...) > 5000
- Track compaction efficiency by level
PromQL Examples:
# Compactions per minute
rate(lsm_compaction_duration_ms_count[1m]) * 60
# Average compaction time
rate(lsm_compaction_duration_ms_sum[5m])
/ rate(lsm_compaction_duration_ms_count[5m])
# P99 L0 compaction time
histogram_quantile(0.99,
rate(lsm_compaction_duration_ms_bucket{level="L0"}[5m])
)
lsm_memtable_size_bytes¶
Type: Gauge
Description: Current memtable size in bytes.
Labels:
- shard_id - Shard ID
Use Cases: - Monitor memory usage - Detect memtable bloat - Tune flush threshold
PromQL Examples:
# Total memtable memory across all shards
sum(lsm_memtable_size_bytes)
# Largest memtable
max(lsm_memtable_size_bytes)
# Alert on memtable >64MB (should flush at 4MB)
lsm_memtable_size_bytes > 67108864
lsm_sstable_count¶
Type: Gauge
Description: Number of SSTables per level.
Labels:
- level - Level: L0, L1, L2, etc.
- shard_id - Shard ID
Use Cases: - Monitor compaction health - Detect L0 storms (too many L0 SSTables) - Capacity planning
PromQL Examples:
# L0 SSTable count
sum(lsm_sstable_count{level="L0"})
# Alert on L0 storm (>10 files triggers compaction throttling)
lsm_sstable_count{level="L0"} > 10
# Total SSTables across all levels
sum(lsm_sstable_count)
Raft Metrics¶
raft_leader_changes_total¶
Type: Counter
Description: Total number of leader changes (elections).
Labels:
- shard_id - Shard ID
Use Cases: - Detect cluster instability - Alert on frequent elections - Track split-brain scenarios
PromQL Examples:
# Leader changes per hour
rate(raft_leader_changes_total[1h]) * 3600
# Alert on frequent elections (>2 per hour)
rate(raft_leader_changes_total[1h]) * 3600 > 2
raft_commit_index¶
Type: Gauge
Description: Current committed log index.
Labels:
- shard_id - Shard ID
- node_id - Node ID
Use Cases: - Monitor replication lag - Detect stalled followers - Track write throughput
PromQL Examples:
# Replication lag (leader vs follower)
max(raft_commit_index{shard_id="0"})
- raft_commit_index{shard_id="0"}
# Commit rate (writes per second)
rate(raft_commit_index[1m])
raft_term¶
Type: Gauge
Description: Current Raft term.
Labels:
- shard_id - Shard ID
Use Cases: - Track election history - Detect term drift
PromQL Examples:
SWIM Metrics¶
swim_cluster_size¶
Type: Gauge
Description: Number of alive members in the cluster.
Labels: None
Use Cases: - Monitor cluster membership - Alert on node failures - Capacity tracking
PromQL Examples:
swim_failed_members_total¶
Type: Counter
Description: Total number of members marked as failed.
Labels: None
Use Cases: - Track failure rate - Alert on frequent failures - SLA reporting
PromQL Examples:
# Failures per hour
rate(swim_failed_members_total[1h]) * 3600
# Alert on >1 failure per hour
rate(swim_failed_members_total[1h]) * 3600 > 1
swim_gossip_latency_ms¶
Type: Histogram
Description: Gossip message round-trip time in milliseconds.
Labels: None
Buckets: Exponential from 1ms to 1s
Use Cases: - Monitor network health - Detect network degradation - Tune gossip interval
PromQL Examples:
# P95 gossip latency
histogram_quantile(0.95,
rate(swim_gossip_latency_ms_bucket[5m])
)
# Alert on high gossip latency (>100ms indicates network issues)
histogram_quantile(0.95, rate(swim_gossip_latency_ms_bucket[5m])) > 100
System Metrics¶
process_cpu_seconds_total¶
Type: Counter
Description: Total CPU time consumed by the process (from Prometheus client library).
Labels: None
PromQL Examples:
process_resident_memory_bytes¶
Type: Gauge
Description: Resident memory size in bytes (from Prometheus client library).
Labels: None
PromQL Examples:
Alerting Rules¶
Critical Alerts¶
groups:
- name: norikv_critical
rules:
# Server down
- alert: NoriKVDown
expr: up{job="norikv"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "NoriKV instance {{ $labels.instance }} is down"
# High error rate
- alert: NoriKVHighErrorRate
expr: |
rate(kv_requests_total{status="error"}[5m])
/ rate(kv_requests_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "NoriKV error rate >5% on {{ $labels.instance }}"
# Cluster size drop
- alert: NoriKVClusterShrunk
expr: swim_cluster_size < 3
for: 5m
labels:
severity: critical
annotations:
summary: "NoriKV cluster size dropped to {{ $value }}"
Warning Alerts¶
groups:
- name: norikv_warning
rules:
# High latency
- alert: NoriKVHighLatency
expr: |
histogram_quantile(0.95,
rate(kv_request_duration_ms_bucket{operation="put"}[5m])
) > 50
for: 5m
labels:
severity: warning
annotations:
summary: "NoriKV p95 PUT latency >50ms on {{ $labels.instance }}"
# Frequent leader elections
- alert: NoriKVFrequentElections
expr: rate(raft_leader_changes_total[1h]) * 3600 > 2
for: 10m
labels:
severity: warning
annotations:
summary: "NoriKV shard {{ $labels.shard_id }} having frequent elections"
# L0 storm
- alert: NoriKVL0Storm
expr: lsm_sstable_count{level="L0"} > 10
for: 5m
labels:
severity: warning
annotations:
summary: "NoriKV shard {{ $labels.shard_id }} has {{ $value }} L0 SSTables"
Grafana Dashboards¶
KV Request Dashboard¶
{
"dashboard": {
"title": "NoriKV - KV Requests",
"rows": [
{
"title": "Request Rate",
"panels": [
{
"type": "graph",
"targets": [
{
"expr": "sum(rate(kv_requests_total[5m])) by (operation)",
"legendFormat": "{{ operation }}"
}
]
}
]
},
{
"title": "Latency Percentiles",
"panels": [
{
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.50, rate(kv_request_duration_ms_bucket[5m]))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, rate(kv_request_duration_ms_bucket[5m]))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, rate(kv_request_duration_ms_bucket[5m]))",
"legendFormat": "p99"
}
]
}
]
},
{
"title": "Error Rate",
"panels": [
{
"type": "graph",
"targets": [
{
"expr": "rate(kv_requests_total{status=\"error\"}[5m]) / rate(kv_requests_total[5m])",
"legendFormat": "error_rate"
}
]
}
]
}
]
}
}
Performance Considerations¶
Metric Collection Overhead¶
Counter increment: ~80ns - Lock-free atomic operations - No allocations - Negligible impact on hot path
Histogram observation: ~200ns - Binary search for bucket - Atomic increment - Acceptable for request handlers
Registry export (GET /metrics): ~1-5ms - Iterates all metrics - Formats as text - Called infrequently (15s scrape interval)
Best Practices¶
- Use histograms for latencies - Not averages (lose distribution)
- Scrape every 15s - Good balance of resolution and overhead
- Limit cardinality - Avoid high-cardinality labels (user IDs, keys)
- Pre-allocate labels - Use
&'static [(&'static str, &'static str)] - Export at /metrics - Standard Prometheus convention
Troubleshooting¶
Missing Metrics¶
Problem: Expected metrics not appearing
Debug:
# Check if server started with meter
journalctl -u norikv-server | grep "Enabling KV metrics"
# Verify meter is wired
curl http://localhost:8080/metrics | grep kv_requests_total
Solution: Ensure meter is passed to both GrpcServer and HttpServer
Stale Metrics¶
Problem: Metrics not updating
Debug:
# Send a request
grpcurl -plaintext localhost:6000 norikv.Kv/Put \
-d '{"key":"dGVzdA==","value":"dmFsdWU="}'
# Check metric immediately
curl -s http://localhost:8080/metrics | grep kv_requests_total
Solution: Check if meter is being used in KvService handlers
Next Steps¶
- REST API - HTTP endpoints for metrics export
- Configuration - Tune metric collection
- Deployment - Prometheus/Grafana setup