Operations¶
Production deployment, monitoring, and maintenance of NoriKV clusters.
Overview¶
This section covers operational aspects of running NoriKV in production:
- REST API - HTTP endpoints for health checks and metrics
- Metrics - Prometheus metrics reference and monitoring
- Configuration - Server configuration options
- Deployment - Deployment topologies and best practices
- Troubleshooting - Common issues and solutions
Quick Start¶
Running a Single Node¶
# Start server with default configuration
norikv-server --node-id node0 \
--rpc-addr 127.0.0.1:6000 \
--http-addr 127.0.0.1:8080 \
--data-dir /var/lib/norikv
Health Check¶
# Quick health check (for load balancers)
curl http://localhost:8080/health/quick
# Comprehensive health status
curl http://localhost:8080/health
Metrics Collection¶
Production Checklist¶
Before Deployment¶
- Configure data directory with sufficient disk space
- Set up Prometheus/Grafana for metrics collection
- Configure health check endpoints in load balancer
- Review and tune LSM compaction settings
- Configure Raft timeouts for your network latency
- Set up log aggregation (structured JSON logs)
Monitoring¶
- Track
kv_requests_total- Request volume per operation - Monitor
kv_request_duration_ms- Latency percentiles - Watch
lsm_compaction_duration_ms- Compaction performance - Alert on
raft_leader_changes_total- Cluster stability - Check disk space and WAL/SSTable growth
High Availability¶
- Deploy at least 3 nodes for fault tolerance
- Configure replication factor ≥3
- Distribute nodes across availability zones
- Set up automated backups (snapshots)
- Test failure scenarios (node crash, network partition)
Architecture Integration¶
Server Components¶
┌─────────────────────────────────────────┐
│ HTTP Server (Axum) │
│ - GET /health/quick │
│ - GET /health │
│ - GET /metrics (Prometheus) │
└──────────────────┬──────────────────────┘
│
┌──────────────────▼──────────────────────┐
│ gRPC Server (Tonic) │
│ - KV Service (Put/Get/Delete) │
│ - Meta Service (WatchCluster) │
│ - Admin Service (SnapshotShard) │
│ - Raft Service (peer-to-peer RPC) │
└──────────────────┬──────────────────────┘
│
┌──────────────────▼──────────────────────┐
│ Multi-Shard Backend │
│ - Routes requests to correct shard │
│ - 1024 virtual shards │
└──────────────────┬──────────────────────┘
│
┌──────────┴──────────┐
▼ ▼
┌───────────────┐ ┌───────────────┐
│ Shard 0 │ │ Shard N │
│ - Raft │ │ - Raft │
│ - LSM │ │ - LSM │
└───────────────┘ └───────────────┘
Key Design Decisions¶
- Separate HTTP and gRPC servers
- HTTP for ops/monitoring (health, metrics)
-
gRPC for client API (high performance, streaming)
-
Lazy shard initialization
- Shards created on first access
-
Reduces startup time and memory usage
-
Prometheus pull model
- Standard /metrics endpoint
- No agent required
-
Works with Kubernetes service discovery
-
SWIM-based topology tracking
- Gossip protocol for scalability
- Automatic failure detection
- Integrates with ClusterView for client routing
Next Steps¶
- REST API Reference - HTTP endpoints and usage
- Metrics Guide - All metrics explained
- Configuration - Tuning parameters
- Deployment - Production deployment patterns