Troubleshooting Guide¶

Common issues and their solutions for NoriKV operations.

Overview¶

This guide covers common problems encountered when operating NoriKV and provides step-by-step solutions.

Quick diagnosis:

# Check if server is running
curl -f http://localhost:8080/health/quick || echo "Server down"

# Check cluster status
curl -s http://localhost:8080/health | jq '.status'

# Check recent errors
sudo journalctl -u norikv --since "5 minutes ago" | grep ERROR

# Check metrics for issues
curl -s http://localhost:8080/metrics | grep -E "(error|failed)"

Server Won't Start¶

Symptoms¶

systemctl status norikv shows "failed"
No process listening on port 7447
Logs show startup errors

Port Already in Use¶

Error:

Error: Failed to bind: Address already in use (os error 98)

Diagnosis:

# Check what's using the port
sudo lsof -i :7447
sudo lsof -i :8080

Solution 1: Kill conflicting process

# Find and kill process
sudo kill $(sudo lsof -t -i:7447)

Solution 2: Change port

# config.yaml
rpc_addr: "0.0.0.0:7448"  # Use different port
http_addr: "0.0.0.0:8081"

Permission Denied (Data Directory)¶

Error:

Error: Cannot create data_dir: Permission denied

Diagnosis:

# Check permissions
ls -ld /var/lib/norikv
ls -l /var/lib/norikv

Solution:

# Fix ownership
sudo chown -R norikv:norikv /var/lib/norikv

# Fix permissions
sudo chmod 750 /var/lib/norikv

Invalid Configuration¶

Error:

Error: Invalid field: node_id cannot be empty
Error: Invalid rpc_addr: invalid socket address syntax

Diagnosis:

# Validate config manually
cat /etc/norikv/config.yaml

# Check for YAML syntax errors
yamllint /etc/norikv/config.yaml

Common mistakes:

# WRONG: Empty node_id
node_id: ""

# RIGHT:
node_id: "node0"

# WRONG: Missing port
rpc_addr: "10.0.1.10"

# RIGHT:
rpc_addr: "10.0.1.10:7447"

# WRONG: Invalid total_shards
cluster:
  total_shards: 0

# RIGHT:
cluster:
  total_shards: 1024

Out of Memory¶

Error:

Error: Cannot allocate memory
FATAL: Out of memory

Diagnosis:

# Check available memory
free -h

# Check server memory usage
ps aux | grep norikv-server

Solution 1: Increase system memory

# Add swap (temporary)
sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Solution 2: Reduce memtable size (requires code change)

Currently memtable size is hardcoded at 64 MiB. Future: expose in config.

Solution 3: Reduce active shards

# Reduce total shards for low-memory systems
cluster:
  total_shards: 128  # Instead of 1024

Cluster Formation Issues¶

Nodes Can't Join Cluster¶

Symptoms: - Node starts but doesn't join cluster - swim_cluster_size metric shows 1 instead of 3 - Logs show "Failed to join cluster" errors

Network Connectivity¶

Error:

Failed to join cluster: Connection refused (os error 111)

Diagnosis:

# Test connectivity from node1 to node0
telnet 10.0.1.10 7447

# Check firewall
sudo iptables -L -n | grep 7447

# Check if seed node is up
curl http://10.0.1.10:8080/health/quick

Solution:

# Open firewall (Ubuntu/Debian)
sudo ufw allow 7447/tcp
sudo ufw allow 8080/tcp

# Open firewall (RHEL/CentOS)
sudo firewall-cmd --permanent --add-port=7447/tcp
sudo firewall-cmd --permanent --add-port=8080/tcp
sudo firewall-cmd --reload

Wrong Seed Nodes¶

Error:

No seed nodes reachable

Diagnosis:

# Check seed nodes in config
grep -A5 "seed_nodes:" /etc/norikv/config.yaml

# Test each seed
for seed in 10.0.1.10:7447 10.0.1.11:7447 10.0.1.12:7447; do
  echo "Testing $seed..."
  nc -zv ${seed%:*} ${seed#*:}
done

Solution:

Fix seed_nodes addresses:

# WRONG: Using hostnames that don't resolve
cluster:
  seed_nodes:
    - "node0:7447"  # DNS lookup fails

# RIGHT: Use IP addresses or resolvable hostnames
cluster:
  seed_nodes:
    - "10.0.1.10:7447"
    - "10.0.1.11:7447"
    - "10.0.1.12:7447"

Node ID Conflicts¶

Error:

Node with ID 'node0' already exists in cluster

Diagnosis:

# Check if multiple nodes have same node_id
for node in node0 node1 node2; do
  ssh $node "grep node_id /etc/norikv/config.yaml"
done

Solution:

Ensure each node has unique node_id:

# Node 0:
node_id: "node0"

# Node 1:
node_id: "node1"  # NOT "node0"

# Node 2:
node_id: "node2"  # NOT "node0"

Write Failures¶

"Not Leader" Errors¶

Error:

ERROR: not_leader: This node is not the Raft leader for shard 42

Diagnosis:

# Check shard leadership
curl -s http://localhost:8080/health | jq '.shards[] | select(.id == 42)'

# Check Raft metrics
curl -s http://localhost:8080/metrics | grep 'raft_leader{shard_id="42"}'

Root causes:

Election in progress (wait 1-2 seconds)
Cluster split (network partition)
Majority offline (can't elect leader)

Solution 1: Retry (client-side)

SDK automatically retries on "not_leader" errors.

Solution 2: Check cluster health

# Ensure majority (≥2 of 3) nodes are alive
curl http://10.0.1.10:8080/health/quick
curl http://10.0.1.11:8080/health/quick
curl http://10.0.1.12:8080/health/quick

Solution 3: Wait for election

# Watch leader election
watch 'curl -s http://localhost:8080/metrics | grep raft_leader | grep shard_id=\"42\"'

L0 Write Stalls¶

Error:

ERROR: L0 stall: Too many L0 files (12), writes blocked

Diagnosis:

# Check L0 file count
curl -s http://localhost:8080/metrics | grep 'lsm_sstable_count{level="L0"}'

# Check compaction rate
curl -s http://localhost:8080/metrics | grep 'lsm_compaction_duration_ms_count'

Root causes:

Write amplification (high write rate, slow compaction)
Slow disk (HDD instead of SSD)
CPU bottleneck (compaction can't keep up)

Solution 1: Wait for compaction

L0 stall is self-resolving (compaction will reduce L0 count).

Solution 2: Increase compaction concurrency (requires code change)

Currently hardcoded at 4 concurrent compactions. Future: expose in config.

Solution 3: Reduce write rate

# Throttle writes from client
# (Implement rate limiting in your application)

Prevention:

Use SSD/NVMe (not HDD)
Monitor lsm_sstable_count{level="L0"} and alert at >8

Disk Full¶

Error:

ERROR: No space left on device (os error 28)

Diagnosis:

# Check disk space
df -h /var/lib/norikv

# Check largest directories
du -sh /var/lib/norikv/* | sort -rh | head -10

Solution 1: Free up space

# Delete old snapshots (if any)
find /var/lib/norikv/raft/*/snapshot -name "snapshot-*.dat" -mtime +7 -delete

# Truncate large log files
sudo journalctl --vacuum-time=7d

Solution 2: Add more storage

# Mount larger disk
sudo mkdir /mnt/norikv-data
sudo mount /dev/sdb1 /mnt/norikv-data

# Update config
# data_dir: "/mnt/norikv-data"

# Migrate data
sudo rsync -av /var/lib/norikv/ /mnt/norikv-data/

Prevention:

Monitor disk usage: disk_free_bytes < 10GB → alert
Plan for 3× dataset size (replication + compaction)

Read Failures¶

Slow Reads¶

Symptoms: - p95 GET latency > 50ms - Timeouts on read requests - Metrics show kv_request_duration_ms high

Diagnosis:

# Check latency percentiles
curl -s http://localhost:8080/metrics | grep kv_request_duration_ms

# Check if reads hitting L0 (cache misses)
curl -s http://localhost:8080/metrics | grep 'lsm_sstable_count{level="L0"}'

# Check compaction backlog
curl -s http://localhost:8080/metrics | grep lsm_compaction_duration_ms_count

Root causes:

L0 storm (many L0 files, poor read amplification)
No bloom filters (checking all SSTables)
Hot key (single key getting all traffic)
Disk I/O bottleneck

Solution 1: Wait for compaction

L0 files will be compacted to lower levels automatically.

Solution 2: Increase filter budget (requires code change)

Currently filter budget is 256 MiB. Increasing improves read performance.

Solution 3: Add read caching

Use a caching layer (Redis, Memcached) in front of NoriKV for hot keys.

Solution 4: Scale reads

Use ConsistencyLevel::Stale for non-critical reads (reads from followers)
Add more replicas to distribute read load

Key Not Found (Expected)¶

Error:

GET("nonexistent_key") → None

Diagnosis:

# Verify key actually exists
grpcurl -plaintext localhost:7447 norikv.Kv/Get \
  -d '{"key":"base64_encoded_key"}'

# Check if key was deleted
# (Look for tombstone in logs)
sudo journalctl -u norikv | grep "delete.*nonexistent_key"

Root causes:

Key never written (typo in key name)
Key deleted (tombstone exists)
Wrong shard (client routing to wrong shard - SDK bug)

Solution:

Verify key spelling and ensure PUT succeeded before GET.

Performance Degradation¶

High CPU Usage¶

Symptoms: - Server using 100% CPU - Requests timing out - Metrics show cpu_usage > 90%

Diagnosis:

# Check CPU usage
top -p $(pgrep norikv-server)

# Profile CPU usage
sudo perf record -p $(pgrep norikv-server) -g -- sleep 10
sudo perf report

Root causes:

Compaction storm (too many concurrent compactions)
Hot shard (one shard getting all traffic)
Inefficient queries (large scans)

Solution 1: Limit compaction concurrency

Currently hardcoded at 4. Future: expose io.max_background_compactions in config.

Solution 2: Distribute load

Use better key distribution (avoid hot keys)
Add more nodes to cluster

Solution 3: Vertical scaling

Increase CPU cores (scale up instance)

High Memory Usage¶

Symptoms: - Memory usage growing unbounded - OOM kills - Metrics show memory_usage > 80%

Diagnosis:

# Check memory usage
free -h
ps aux | grep norikv-server

# Check active shards
curl -s http://localhost:8080/health | jq '.active_shards'

# Check memtable sizes
curl -s http://localhost:8080/metrics | grep lsm_memtable_size_bytes

Root causes:

Too many active shards (1024 shards × 64 MiB memtable = 64 GB)
Memtable not flushing (slow disk writes)
Memory leak (bug - report it!)

Solution 1: Reduce active shards

Shards are created lazily. Workload determines active count.

Solution 2: Force memtable flush (manual intervention)

Currently no admin API. Future: Admin.FlushShard(shard_id)

Solution 3: Add more memory

Vertical scaling (increase RAM)

Frequent Leader Elections¶

Symptoms: - Writes timing out sporadically - Metrics show raft_leader_changes_total increasing rapidly - Logs show "Starting election for term X"

Diagnosis:

# Check election rate
curl -s http://localhost:8080/metrics | grep raft_leader_changes_total

# Check heartbeat failures
sudo journalctl -u norikv | grep "Heartbeat timeout"

# Check network latency
ping -c 10 10.0.1.11

Root causes:

Network instability (packet loss, high latency)
Node overloaded (can't send heartbeats in time)
Split brain (network partition)

Solution 1: Tune Raft timeouts for network

Currently timeouts are hardcoded. Future: expose in config.

For high-latency networks: - heartbeat_interval: 500ms (default: 150ms) - election_timeout_min: 1000ms (default: 300ms) - election_timeout_max: 2000ms (default: 600ms)

Solution 2: Fix network issues

# Check for packet loss
ping -c 100 10.0.1.11 | grep loss

# Check MTU mismatches
tracepath 10.0.1.11

Solution 3: Scale up nodes

Ensure nodes have sufficient CPU to handle Raft heartbeats.

Health Check Failures¶

`/health/quick` Returns 503¶

Diagnosis:

# Check shard 0 specifically
curl -s http://localhost:8080/health | jq '.shards[] | select(.id == 0)'

# Check if shard 0 has leader
curl -s http://localhost:8080/metrics | grep 'raft_leader{shard_id="0"}'

Root causes:

Shard 0 not created (server still starting up)
No Raft leader elected (cluster forming)
Shard 0 unresponsive (disk I/O blocked)

Solution:

Wait 30-60 seconds for server initialization. If persists, check logs:

sudo journalctl -u norikv -f | grep "shard-0"

`/health` Shows `degraded` Status¶

Diagnosis:

# Check which shards are unhealthy
curl -s http://localhost:8080/health | \
  jq '.shards[] | select(.status != "healthy")'

# Check leadership
curl -s http://localhost:8080/health | \
  jq '.shards[] | select(.is_leader == false)'

Root causes:

Some shards have no leader (elections in progress)
Raft replication lag (follower behind leader)

Solution:

If transient (during elections), wait. If persistent, investigate specific shards.

Monitoring & Alerting Issues¶

Prometheus Not Scraping¶

Symptoms: - Grafana dashboards empty - Prometheus targets showing "DOWN"

Diagnosis:

# Check if /metrics endpoint works
curl http://localhost:8080/metrics

# Check Prometheus targets
curl http://prometheus:9090/api/v1/targets | jq

Root causes:

Firewall blocking port 8080
Wrong Prometheus scrape config
Metrics disabled in config

Solution 1: Fix firewall

# Allow Prometheus server to access port 8080
sudo iptables -A INPUT -p tcp --dport 8080 -s <prometheus-ip> -j ACCEPT

Solution 2: Fix Prometheus config

# prometheus.yml
scrape_configs:
  - job_name: 'norikv'
    static_configs:
      - targets:
        - '10.0.1.10:8080'  # Correct IP and port
        - '10.0.1.11:8080'
        - '10.0.1.12:8080'
    scrape_interval: 15s

Solution 3: Enable metrics

# config.yaml
telemetry:
  prometheus:
    enabled: true  # Must be true

Metrics Missing¶

Symptoms: - Some metrics not showing up in Prometheus - kv_requests_total is 0 despite traffic

Diagnosis:

# Check all metrics
curl -s http://localhost:8080/metrics | sort

# Check if meter is wired
sudo journalctl -u norikv | grep "Enabling KV metrics"

Root causes:

No traffic yet (no requests = no metrics)
Meter not wired (code bug - should see log "Enabling KV metrics")

Solution:

Send test traffic to generate metrics:

grpcurl -plaintext localhost:7447 norikv.Kv/Put \
  -d '{"key":"dGVzdA==","value":"dmFsdWU="}'

# Check metrics again
curl -s http://localhost:8080/metrics | grep kv_requests_total

Data Loss / Corruption¶

Missing Data After Restart¶

Symptoms: - Keys that existed before restart are now gone - data_dir is empty after restart

Diagnosis:

# Check if data dir exists
ls -la /var/lib/norikv/

# Check for WAL/SST files
find /var/lib/norikv -name "*.wal" -o -name "*.sst"

Root causes:

Wrong data_dir (config changed)
Data deleted (manual deletion or script)
Ephemeral storage (Docker volume not mounted)

Solution:

If data exists on disk:

# Fix config to point to correct data_dir
data_dir: "/var/lib/norikv"  # Not /tmp/norikv

If data truly lost:

Restore from backup (if available). If no backup, data is unrecoverable.

Prevention:

Use persistent storage (not tmpfs, not ephemeral Docker volumes)
Set up daily backups (snapshot Raft + LSM dirs)
Monitor disk usage and alerts

Corrupted SSTable¶

Error:

ERROR: SSTable checksum mismatch: file corrupted

Diagnosis:

# Find corrupted file
sudo journalctl -u norikv | grep "checksum mismatch"

# Check file
ls -lh /var/lib/norikv/lsm/shard-0/sst/L1-000042.sst

Root causes:

Disk corruption (bad sectors)
Incomplete write (power loss during compaction)
Bit rot (silent data corruption over time)

Solution:

Currently no recovery mechanism. Future: Use Raft snapshots to rebuild lost data.

Prevention:

Use ECC RAM
Use enterprise SSDs with power-loss protection
Monitor disk SMART metrics (smartctl -a /dev/sda)

Diagnostic Commands¶

Quick Health Check¶

#!/bin/bash
# healthcheck.sh

echo "=== NoriKV Health Check ==="

# 1. Check if server is running
if ! pgrep norikv-server > /dev/null; then
  echo " Server not running"
  exit 1
fi
echo " Server running (PID: $(pgrep norikv-server))"

# 2. Check HTTP endpoint
if ! curl -sf http://localhost:8080/health/quick > /dev/null; then
  echo " Health check failed"
  exit 1
fi
echo " Health check OK"

# 3. Check cluster size
cluster_size=$(curl -s http://localhost:8080/metrics | grep '^swim_cluster_size' | awk '{print $2}')
if [ "$cluster_size" -lt 3 ]; then
  echo "  Warning: Cluster size is $cluster_size (expected 3)"
else
  echo " Cluster size: $cluster_size"
fi

# 4. Check for errors in last 5 minutes
error_count=$(sudo journalctl -u norikv --since "5 minutes ago" | grep -c ERROR || echo 0)
if [ "$error_count" -gt 0 ]; then
  echo "  Warning: $error_count errors in last 5 minutes"
else
  echo " No recent errors"
fi

echo " All checks passed"

Performance Profiling¶

# CPU profiling (requires perf)
sudo perf record -F 99 -p $(pgrep norikv-server) -g -- sleep 30
sudo perf report

# Memory profiling (requires valgrind)
valgrind --tool=massif --massif-out-file=massif.out norikv-server --config config.yaml

# Trace syscalls
sudo strace -p $(pgrep norikv-server) -c -f

# Network profiling
sudo tcpdump -i any port 7447 -w norikv.pcap

Getting Help¶

Information to Include in Bug Reports¶

When reporting issues, please include:

Version: norikv-server --version
OS: uname -a, cat /etc/os-release
Config: cat /etc/norikv/config.yaml (redact secrets)
Logs: Last 1000 lines of logs (sudo journalctl -u norikv -n 1000)
Metrics: curl http://localhost:8080/metrics
Health: curl http://localhost:8080/health | jq
Steps to reproduce: Exact commands that trigger the issue

Where to Ask¶

GitHub Issues: https://github.com/norikv/norikv/issues
Discord: https://discord.gg/norikv (future)
Email: support@norikv.io (enterprise support)

Next Steps¶

Configuration Reference - Fine-tune settings
Metrics Reference - Understand metrics for debugging
Deployment Guide - Proper deployment prevents issues

Troubleshooting Guide¶

Overview¶

Server Won't Start¶

Symptoms¶

Port Already in Use¶

Permission Denied (Data Directory)¶

Invalid Configuration¶

Out of Memory¶

Cluster Formation Issues¶

Nodes Can't Join Cluster¶

Network Connectivity¶

Wrong Seed Nodes¶

Node ID Conflicts¶

Write Failures¶

"Not Leader" Errors¶

L0 Write Stalls¶

Disk Full¶

Read Failures¶

Slow Reads¶

Key Not Found (Expected)¶

Performance Degradation¶

High CPU Usage¶

High Memory Usage¶

Frequent Leader Elections¶

Health Check Failures¶

/health/quick Returns 503¶

/health Shows degraded Status¶

Monitoring & Alerting Issues¶

Prometheus Not Scraping¶

Metrics Missing¶

Data Loss / Corruption¶

Missing Data After Restart¶

Corrupted SSTable¶

Diagnostic Commands¶

Quick Health Check¶

Performance Profiling¶

Getting Help¶

Information to Include in Bug Reports¶

Where to Ask¶

Next Steps¶

`/health/quick` Returns 503¶

`/health` Shows `degraded` Status¶