REST API Reference¶

HTTP endpoints for health checks, metrics, and cluster management.

Overview¶

NoriKV exposes a REST API via HTTP for operational tasks:

Health checks - Monitor server and shard health
Metrics - Prometheus-format metrics
Admin operations - Cluster management (future)

The HTTP server runs on a separate port from the gRPC API, allowing you to: - Isolate operational traffic from client traffic - Configure different firewall rules - Use load balancer health checks without authentication

Base Configuration¶

# Start server with HTTP API on port 8080
norikv-server \
  --http-addr 0.0.0.0:8080 \
  --rpc-addr 0.0.0.0:6000

Default ports: - HTTP REST API: 8080 - gRPC client API: 6000

Endpoints¶

Health Checks¶

Quick Health Check¶

Endpoint: GET /health/quick

Purpose: Fast health check for load balancers and uptime monitors.

Response: - 200 OK - Server is healthy (shard 0 is responsive) - 503 Service Unavailable - Server is unhealthy

Response Body: Plain text OK or UNAVAILABLE

Performance: - Latency: <1ms (checks only shard 0) - No disk I/O - Minimal CPU overhead

Use Cases: - Kubernetes liveness/readiness probes - Load balancer health checks - Uptime monitoring services

Example:

# Check if server is healthy
curl -i http://localhost:8080/health/quick
# HTTP/1.1 200 OK
# OK

# Use in scripts
if curl -sf http://localhost:8080/health/quick > /dev/null; then
  echo "Server is healthy"
else
  echo "Server is down"
  exit 1
fi

Kubernetes Example:

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: norikv
    livenessProbe:
      httpGet:
        path: /health/quick
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 10
    readinessProbe:
      httpGet:
        path: /health/quick
        port: 8080
      initialDelaySeconds: 3
      periodSeconds: 5

Comprehensive Health Status¶

Endpoint: GET /health

Purpose: Detailed health status for all shards and components.

Response: JSON object with server and shard-level health

Response Format:

{
  "status": "healthy",
  "node_id": "node0",
  "uptime_seconds": 3600,
  "total_shards": 1024,
  "active_shards": 42,
  "shards": [
    {
      "id": 0,
      "status": "healthy",
      "is_leader": true,
      "raft_term": 5,
      "commit_index": 12345,
      "last_applied": 12345
    },
    {
      "id": 1,
      "status": "healthy",
      "is_leader": false,
      "raft_term": 5,
      "commit_index": 8901,
      "last_applied": 8901
    }
  ]
}

Fields:

Field	Type	Description
`status`	string	Overall status: `healthy`, `degraded`, `unhealthy`
`node_id`	string	Server node ID
`uptime_seconds`	number	Seconds since server start
`total_shards`	number	Total virtual shards (1024)
`active_shards`	number	Number of shards currently active on this node
`shards`	array	Health status for each active shard

Shard Health Fields:

Field	Type	Description
`id`	number	Shard ID (0-1023)
`status`	string	Shard status: `healthy`, `degraded`, `unavailable`
`is_leader`	boolean	True if this node is the Raft leader for this shard
`raft_term`	number	Current Raft term
`commit_index`	number	Last committed log index
`last_applied`	number	Last applied log index to LSM

Status Values:

healthy - All shards operational, leader elected
degraded - Some shards not responding or no leader
unhealthy - Critical failure, server not operational

Performance: - Latency: ~5-10ms (queries all active shards) - May involve I/O if checking LSM state - Use sparingly in high-frequency monitoring

Use Cases: - Debugging cluster issues - Capacity planning (active shard count) - Raft state inspection - Detailed dashboards

Example:

# Get detailed health status
curl http://localhost:8080/health | jq

# Check if server is leader for shard 42
curl -s http://localhost:8080/health | \
  jq '.shards[] | select(.id == 42) | .is_leader'

# Count healthy shards
curl -s http://localhost:8080/health | \
  jq '[.shards[] | select(.status == "healthy")] | length'

Monitoring Alert Example:

import requests

response = requests.get('http://localhost:8080/health')
health = response.json()

if health['status'] != 'healthy':
    alert(f"NoriKV unhealthy: {health['status']}")

unhealthy_shards = [
    s for s in health['shards']
    if s['status'] != 'healthy'
]
if unhealthy_shards:
    alert(f"Unhealthy shards: {[s['id'] for s in unhealthy_shards]}")

Metrics¶

Prometheus Metrics Export¶

Endpoint: GET /metrics

Purpose: Export metrics in Prometheus text format for scraping.

Response: Plain text, Prometheus exposition format version 0.0.4

Content-Type: text/plain; version=0.0.4

Example Response:

# HELP kv_requests_total Total number of KV requests
# TYPE kv_requests_total counter
kv_requests_total{operation="put"} 12345
kv_requests_total{operation="get"} 67890
kv_requests_total{operation="delete"} 123

# HELP kv_request_duration_ms Request duration in milliseconds
# TYPE kv_request_duration_ms histogram
kv_request_duration_ms_bucket{operation="put",status="success",le="1.0"} 5000
kv_request_duration_ms_bucket{operation="put",status="success",le="2.0"} 8000
kv_request_duration_ms_bucket{operation="put",status="success",le="4.0"} 10000
kv_request_duration_ms_bucket{operation="put",status="success",le="+Inf"} 12345
kv_request_duration_ms_sum{operation="put",status="success"} 123450.5
kv_request_duration_ms_count{operation="put",status="success"} 12345

Metrics Exposed:

See Metrics Reference for complete list.

Key Metrics: - kv_requests_total - Request counters by operation - kv_request_duration_ms - Latency histograms - lsm_* - LSM engine metrics (compaction, memtable) - raft_* - Raft consensus metrics (elections, commits) - swim_* - Membership metrics (cluster size, failures)

Prometheus Scrape Configuration:

scrape_configs:
  - job_name: 'norikv'
    static_configs:
      - targets: ['localhost:8080']
    scrape_interval: 15s
    metrics_path: /metrics

Kubernetes ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: norikv
spec:
  selector:
    matchLabels:
      app: norikv
  endpoints:
  - port: http
    path: /metrics
    interval: 15s

Use Cases: - Prometheus/Grafana dashboards - Alerting on latency/error rates - Capacity planning - Performance debugging

Example Queries:

# Request rate (per second)
rate(kv_requests_total[5m])

# P95 latency
histogram_quantile(0.95,
  rate(kv_request_duration_ms_bucket[5m])
)

# Error rate
rate(kv_requests_total{status="error"}[5m])
  / rate(kv_requests_total[5m])

HTTP Server Architecture¶

Technology Stack¶

Framework: Axum - Ergonomic async web framework
Runtime: Tokio async runtime
Serialization: Serde JSON for structured responses
Metrics: prometheus-client for Prometheus export

Code Organization¶

// apps/norikv-server/src/http.rs

pub struct HttpServer {
    addr: SocketAddr,
    state: HttpServerState,
    shutdown_tx: Option<tokio::sync::oneshot::Sender<()>>,
    server_handle: Option<JoinHandle<Result<(), std::io::Error>>>,
}

#[derive(Clone)]
pub struct HttpServerState {
    health_checker: Arc<HealthChecker>,
    meter: Arc<PrometheusMeter>,
}

Key Design Decisions:

Separate server from gRPC
Allows different ports/firewall rules
No authentication required for ops endpoints
Can disable in production if not needed
Axum router
Type-safe route handlers
Extractors for state/headers
Graceful shutdown support
Shared state via Arc
Health checker shared with Node
Prometheus meter shared with gRPC server
No data duplication
Async handlers
Non-blocking health checks
Concurrent metric collection
Efficient resource usage

Lifecycle¶

Node::start()
  ↓
Create HttpServer { state: { health_checker, meter } }
  ↓
Build Axum router with routes
  ↓
Bind TcpListener on http_addr
  ↓
Spawn server task with graceful shutdown
  ↓
Server running (handle requests)
  ↓
Node::shutdown()
  ↓
Send shutdown signal
  ↓
Await server task completion
  ↓
Server stopped

Error Handling¶

HTTP Status Codes¶

Status	When	Example
200 OK	Request succeeded	Health check passed, metrics returned
500 Internal Server Error	Handler error	Health check failed to query shard
503 Service Unavailable	Server unhealthy	Quick health check: server not ready

Error Response Format¶

{
  "error": "Internal error: failed to query shard 0"
}

Handler Error Conversion:

struct AppError(anyhow::Error);

impl IntoResponse for AppError {
    fn into_response(self) -> Response {
        tracing::error!("Handler error: {:?}", self.0);
        (
            StatusCode::INTERNAL_SERVER_ERROR,
            format!("Internal error: {}", self.0),
        ).into_response()
    }
}

Best Practices¶

Load Balancer Integration¶

Use /health/quick for load balancer health checks:

# Nginx upstream health check
upstream norikv {
    server 10.0.1.10:8080 max_fails=3 fail_timeout=30s;
    server 10.0.1.11:8080 max_fails=3 fail_timeout=30s;
    server 10.0.1.12:8080 max_fails=3 fail_timeout=30s;
}

server {
    location /health/quick {
        proxy_pass http://norikv;
        proxy_connect_timeout 1s;
        proxy_read_timeout 1s;
    }
}

HAProxy:

backend norikv
    balance roundrobin
    option httpchk GET /health/quick
    http-check expect status 200
    server node0 10.0.1.10:8080 check inter 5s
    server node1 10.0.1.11:8080 check inter 5s
    server node2 10.0.1.12:8080 check inter 5s

Monitoring Setup¶

Scrape metrics every 15 seconds:

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'norikv'
    static_configs:
      - targets:
        - '10.0.1.10:8080'
        - '10.0.1.11:8080'
        - '10.0.1.12:8080'
    scrape_interval: 15s
    scrape_timeout: 10s

Alert on unhealthy status:

# alerts.yml
groups:
- name: norikv
  rules:
  - alert: NoriKVUnhealthy
    expr: up{job="norikv"} == 0
    for: 1m
    annotations:
      summary: "NoriKV instance {{ $labels.instance }} is down"

Security Considerations¶

Firewall rules:

# Allow HTTP from Prometheus/monitoring only
iptables -A INPUT -p tcp --dport 8080 -s 10.0.2.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 8080 -j DROP

# Allow gRPC from clients
iptables -A INPUT -p tcp --dport 6000 -j ACCEPT

Kubernetes Network Policy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: norikv-http
spec:
  podSelector:
    matchLabels:
      app: norikv
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: monitoring
    ports:
    - protocol: TCP
      port: 8080

Troubleshooting¶

Server won't start¶

Error: Failed to bind: Address already in use

Solution: Check if another process is using port 8080:

lsof -i :8080
# Or change port
norikv-server --http-addr 0.0.0.0:9090

Health check always returns 503¶

Symptoms: /health/quick returns UNAVAILABLE

Debugging:

# Check detailed health status
curl http://localhost:8080/health | jq

# Check if shard 0 is healthy
curl -s http://localhost:8080/health | \
  jq '.shards[] | select(.id == 0)'

Common causes: - Shard 0 not yet created (server starting up) - Raft leader not elected (multi-node: wait for quorum) - LSM engine not initialized

Metrics endpoint empty¶

Symptoms: /metrics returns no metrics

Debugging:

# Check if meter is wired to gRPC server
# Look for log line: "Enabling KV metrics collection"
journalctl -u norikv-server | grep metrics

# Send a request to generate metrics
grpcurl -plaintext localhost:6000 norikv.Kv/Get \
  -d '{"key":"dGVzdA=="}'

# Check metrics again
curl http://localhost:8080/metrics

Solution: Ensure meter is passed to both GrpcServer and HttpServer in Node::start()

Next Steps¶

Metrics Reference - All metrics explained
Configuration - Server configuration options
Deployment - Production deployment patterns