Troubleshooting Guide¶
Common issues and their solutions for NoriKV operations.
Overview¶
This guide covers common problems encountered when operating NoriKV and provides step-by-step solutions.
Quick diagnosis:
# Check if server is running
curl -f http://localhost:8080/health/quick || echo "Server down"
# Check cluster status
curl -s http://localhost:8080/health | jq '.status'
# Check recent errors
sudo journalctl -u norikv --since "5 minutes ago" | grep ERROR
# Check metrics for issues
curl -s http://localhost:8080/metrics | grep -E "(error|failed)"
Server Won't Start¶
Symptoms¶
systemctl status norikvshows "failed"- No process listening on port 7447
- Logs show startup errors
Port Already in Use¶
Error:
Diagnosis:
Solution 1: Kill conflicting process
Solution 2: Change port
Permission Denied (Data Directory)¶
Error:
Diagnosis:
Solution:
# Fix ownership
sudo chown -R norikv:norikv /var/lib/norikv
# Fix permissions
sudo chmod 750 /var/lib/norikv
Invalid Configuration¶
Error:
Error: Invalid field: node_id cannot be empty
Error: Invalid rpc_addr: invalid socket address syntax
Diagnosis:
# Validate config manually
cat /etc/norikv/config.yaml
# Check for YAML syntax errors
yamllint /etc/norikv/config.yaml
Common mistakes:
# WRONG: Empty node_id
node_id: ""
# RIGHT:
node_id: "node0"
# WRONG: Missing port
rpc_addr: "10.0.1.10"
# RIGHT:
rpc_addr: "10.0.1.10:7447"
# WRONG: Invalid total_shards
cluster:
total_shards: 0
# RIGHT:
cluster:
total_shards: 1024
Out of Memory¶
Error:
Diagnosis:
Solution 1: Increase system memory
# Add swap (temporary)
sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
Solution 2: Reduce memtable size (requires code change)
Currently memtable size is hardcoded at 64 MiB. Future: expose in config.
Solution 3: Reduce active shards
Cluster Formation Issues¶
Nodes Can't Join Cluster¶
Symptoms:
- Node starts but doesn't join cluster
- swim_cluster_size metric shows 1 instead of 3
- Logs show "Failed to join cluster" errors
Network Connectivity¶
Error:
Diagnosis:
# Test connectivity from node1 to node0
telnet 10.0.1.10 7447
# Check firewall
sudo iptables -L -n | grep 7447
# Check if seed node is up
curl http://10.0.1.10:8080/health/quick
Solution:
# Open firewall (Ubuntu/Debian)
sudo ufw allow 7447/tcp
sudo ufw allow 8080/tcp
# Open firewall (RHEL/CentOS)
sudo firewall-cmd --permanent --add-port=7447/tcp
sudo firewall-cmd --permanent --add-port=8080/tcp
sudo firewall-cmd --reload
Wrong Seed Nodes¶
Error:
Diagnosis:
# Check seed nodes in config
grep -A5 "seed_nodes:" /etc/norikv/config.yaml
# Test each seed
for seed in 10.0.1.10:7447 10.0.1.11:7447 10.0.1.12:7447; do
echo "Testing $seed..."
nc -zv ${seed%:*} ${seed#*:}
done
Solution:
Fix seed_nodes addresses:
# WRONG: Using hostnames that don't resolve
cluster:
seed_nodes:
- "node0:7447" # DNS lookup fails
# RIGHT: Use IP addresses or resolvable hostnames
cluster:
seed_nodes:
- "10.0.1.10:7447"
- "10.0.1.11:7447"
- "10.0.1.12:7447"
Node ID Conflicts¶
Error:
Diagnosis:
# Check if multiple nodes have same node_id
for node in node0 node1 node2; do
ssh $node "grep node_id /etc/norikv/config.yaml"
done
Solution:
Ensure each node has unique node_id:
# Node 0:
node_id: "node0"
# Node 1:
node_id: "node1" # NOT "node0"
# Node 2:
node_id: "node2" # NOT "node0"
Write Failures¶
"Not Leader" Errors¶
Error:
Diagnosis:
# Check shard leadership
curl -s http://localhost:8080/health | jq '.shards[] | select(.id == 42)'
# Check Raft metrics
curl -s http://localhost:8080/metrics | grep 'raft_leader{shard_id="42"}'
Root causes:
- Election in progress (wait 1-2 seconds)
- Cluster split (network partition)
- Majority offline (can't elect leader)
Solution 1: Retry (client-side)
SDK automatically retries on "not_leader" errors.
Solution 2: Check cluster health
# Ensure majority (≥2 of 3) nodes are alive
curl http://10.0.1.10:8080/health/quick
curl http://10.0.1.11:8080/health/quick
curl http://10.0.1.12:8080/health/quick
Solution 3: Wait for election
# Watch leader election
watch 'curl -s http://localhost:8080/metrics | grep raft_leader | grep shard_id=\"42\"'
L0 Write Stalls¶
Error:
Diagnosis:
# Check L0 file count
curl -s http://localhost:8080/metrics | grep 'lsm_sstable_count{level="L0"}'
# Check compaction rate
curl -s http://localhost:8080/metrics | grep 'lsm_compaction_duration_ms_count'
Root causes:
- Write amplification (high write rate, slow compaction)
- Slow disk (HDD instead of SSD)
- CPU bottleneck (compaction can't keep up)
Solution 1: Wait for compaction
L0 stall is self-resolving (compaction will reduce L0 count).
Solution 2: Increase compaction concurrency (requires code change)
Currently hardcoded at 4 concurrent compactions. Future: expose in config.
Solution 3: Reduce write rate
Prevention:
- Use SSD/NVMe (not HDD)
- Monitor
lsm_sstable_count{level="L0"}and alert at >8
Disk Full¶
Error:
Diagnosis:
# Check disk space
df -h /var/lib/norikv
# Check largest directories
du -sh /var/lib/norikv/* | sort -rh | head -10
Solution 1: Free up space
# Delete old snapshots (if any)
find /var/lib/norikv/raft/*/snapshot -name "snapshot-*.dat" -mtime +7 -delete
# Truncate large log files
sudo journalctl --vacuum-time=7d
Solution 2: Add more storage
# Mount larger disk
sudo mkdir /mnt/norikv-data
sudo mount /dev/sdb1 /mnt/norikv-data
# Update config
# data_dir: "/mnt/norikv-data"
# Migrate data
sudo rsync -av /var/lib/norikv/ /mnt/norikv-data/
Prevention:
- Monitor disk usage:
disk_free_bytes < 10GB→ alert - Plan for 3× dataset size (replication + compaction)
Read Failures¶
Slow Reads¶
Symptoms:
- p95 GET latency > 50ms
- Timeouts on read requests
- Metrics show kv_request_duration_ms high
Diagnosis:
# Check latency percentiles
curl -s http://localhost:8080/metrics | grep kv_request_duration_ms
# Check if reads hitting L0 (cache misses)
curl -s http://localhost:8080/metrics | grep 'lsm_sstable_count{level="L0"}'
# Check compaction backlog
curl -s http://localhost:8080/metrics | grep lsm_compaction_duration_ms_count
Root causes:
- L0 storm (many L0 files, poor read amplification)
- No bloom filters (checking all SSTables)
- Hot key (single key getting all traffic)
- Disk I/O bottleneck
Solution 1: Wait for compaction
L0 files will be compacted to lower levels automatically.
Solution 2: Increase filter budget (requires code change)
Currently filter budget is 256 MiB. Increasing improves read performance.
Solution 3: Add read caching
Use a caching layer (Redis, Memcached) in front of NoriKV for hot keys.
Solution 4: Scale reads
- Use
ConsistencyLevel::Stalefor non-critical reads (reads from followers) - Add more replicas to distribute read load
Key Not Found (Expected)¶
Error:
Diagnosis:
# Verify key actually exists
grpcurl -plaintext localhost:7447 norikv.Kv/Get \
-d '{"key":"base64_encoded_key"}'
# Check if key was deleted
# (Look for tombstone in logs)
sudo journalctl -u norikv | grep "delete.*nonexistent_key"
Root causes:
- Key never written (typo in key name)
- Key deleted (tombstone exists)
- Wrong shard (client routing to wrong shard - SDK bug)
Solution:
Verify key spelling and ensure PUT succeeded before GET.
Performance Degradation¶
High CPU Usage¶
Symptoms:
- Server using 100% CPU
- Requests timing out
- Metrics show cpu_usage > 90%
Diagnosis:
# Check CPU usage
top -p $(pgrep norikv-server)
# Profile CPU usage
sudo perf record -p $(pgrep norikv-server) -g -- sleep 10
sudo perf report
Root causes:
- Compaction storm (too many concurrent compactions)
- Hot shard (one shard getting all traffic)
- Inefficient queries (large scans)
Solution 1: Limit compaction concurrency
Currently hardcoded at 4. Future: expose io.max_background_compactions in config.
Solution 2: Distribute load
- Use better key distribution (avoid hot keys)
- Add more nodes to cluster
Solution 3: Vertical scaling
- Increase CPU cores (scale up instance)
High Memory Usage¶
Symptoms:
- Memory usage growing unbounded
- OOM kills
- Metrics show memory_usage > 80%
Diagnosis:
# Check memory usage
free -h
ps aux | grep norikv-server
# Check active shards
curl -s http://localhost:8080/health | jq '.active_shards'
# Check memtable sizes
curl -s http://localhost:8080/metrics | grep lsm_memtable_size_bytes
Root causes:
- Too many active shards (1024 shards × 64 MiB memtable = 64 GB)
- Memtable not flushing (slow disk writes)
- Memory leak (bug - report it!)
Solution 1: Reduce active shards
Shards are created lazily. Workload determines active count.
Solution 2: Force memtable flush (manual intervention)
Currently no admin API. Future: Admin.FlushShard(shard_id)
Solution 3: Add more memory
- Vertical scaling (increase RAM)
Frequent Leader Elections¶
Symptoms:
- Writes timing out sporadically
- Metrics show raft_leader_changes_total increasing rapidly
- Logs show "Starting election for term X"
Diagnosis:
# Check election rate
curl -s http://localhost:8080/metrics | grep raft_leader_changes_total
# Check heartbeat failures
sudo journalctl -u norikv | grep "Heartbeat timeout"
# Check network latency
ping -c 10 10.0.1.11
Root causes:
- Network instability (packet loss, high latency)
- Node overloaded (can't send heartbeats in time)
- Split brain (network partition)
Solution 1: Tune Raft timeouts for network
Currently timeouts are hardcoded. Future: expose in config.
For high-latency networks: - heartbeat_interval: 500ms (default: 150ms) - election_timeout_min: 1000ms (default: 300ms) - election_timeout_max: 2000ms (default: 600ms)
Solution 2: Fix network issues
# Check for packet loss
ping -c 100 10.0.1.11 | grep loss
# Check MTU mismatches
tracepath 10.0.1.11
Solution 3: Scale up nodes
Ensure nodes have sufficient CPU to handle Raft heartbeats.
Health Check Failures¶
/health/quick Returns 503¶
Diagnosis:
# Check shard 0 specifically
curl -s http://localhost:8080/health | jq '.shards[] | select(.id == 0)'
# Check if shard 0 has leader
curl -s http://localhost:8080/metrics | grep 'raft_leader{shard_id="0"}'
Root causes:
- Shard 0 not created (server still starting up)
- No Raft leader elected (cluster forming)
- Shard 0 unresponsive (disk I/O blocked)
Solution:
Wait 30-60 seconds for server initialization. If persists, check logs:
/health Shows degraded Status¶
Diagnosis:
# Check which shards are unhealthy
curl -s http://localhost:8080/health | \
jq '.shards[] | select(.status != "healthy")'
# Check leadership
curl -s http://localhost:8080/health | \
jq '.shards[] | select(.is_leader == false)'
Root causes:
- Some shards have no leader (elections in progress)
- Raft replication lag (follower behind leader)
Solution:
If transient (during elections), wait. If persistent, investigate specific shards.
Monitoring & Alerting Issues¶
Prometheus Not Scraping¶
Symptoms: - Grafana dashboards empty - Prometheus targets showing "DOWN"
Diagnosis:
# Check if /metrics endpoint works
curl http://localhost:8080/metrics
# Check Prometheus targets
curl http://prometheus:9090/api/v1/targets | jq
Root causes:
- Firewall blocking port 8080
- Wrong Prometheus scrape config
- Metrics disabled in config
Solution 1: Fix firewall
# Allow Prometheus server to access port 8080
sudo iptables -A INPUT -p tcp --dport 8080 -s <prometheus-ip> -j ACCEPT
Solution 2: Fix Prometheus config
# prometheus.yml
scrape_configs:
- job_name: 'norikv'
static_configs:
- targets:
- '10.0.1.10:8080' # Correct IP and port
- '10.0.1.11:8080'
- '10.0.1.12:8080'
scrape_interval: 15s
Solution 3: Enable metrics
Metrics Missing¶
Symptoms:
- Some metrics not showing up in Prometheus
- kv_requests_total is 0 despite traffic
Diagnosis:
# Check all metrics
curl -s http://localhost:8080/metrics | sort
# Check if meter is wired
sudo journalctl -u norikv | grep "Enabling KV metrics"
Root causes:
- No traffic yet (no requests = no metrics)
- Meter not wired (code bug - should see log "Enabling KV metrics")
Solution:
Send test traffic to generate metrics:
grpcurl -plaintext localhost:7447 norikv.Kv/Put \
-d '{"key":"dGVzdA==","value":"dmFsdWU="}'
# Check metrics again
curl -s http://localhost:8080/metrics | grep kv_requests_total
Data Loss / Corruption¶
Missing Data After Restart¶
Symptoms:
- Keys that existed before restart are now gone
- data_dir is empty after restart
Diagnosis:
# Check if data dir exists
ls -la /var/lib/norikv/
# Check for WAL/SST files
find /var/lib/norikv -name "*.wal" -o -name "*.sst"
Root causes:
- Wrong data_dir (config changed)
- Data deleted (manual deletion or script)
- Ephemeral storage (Docker volume not mounted)
Solution:
If data exists on disk:
If data truly lost:
Restore from backup (if available). If no backup, data is unrecoverable.
Prevention:
- Use persistent storage (not tmpfs, not ephemeral Docker volumes)
- Set up daily backups (snapshot Raft + LSM dirs)
- Monitor disk usage and alerts
Corrupted SSTable¶
Error:
Diagnosis:
# Find corrupted file
sudo journalctl -u norikv | grep "checksum mismatch"
# Check file
ls -lh /var/lib/norikv/lsm/shard-0/sst/L1-000042.sst
Root causes:
- Disk corruption (bad sectors)
- Incomplete write (power loss during compaction)
- Bit rot (silent data corruption over time)
Solution:
Currently no recovery mechanism. Future: Use Raft snapshots to rebuild lost data.
Prevention:
- Use ECC RAM
- Use enterprise SSDs with power-loss protection
- Monitor disk SMART metrics (
smartctl -a /dev/sda)
Diagnostic Commands¶
Quick Health Check¶
#!/bin/bash
# healthcheck.sh
echo "=== NoriKV Health Check ==="
# 1. Check if server is running
if ! pgrep norikv-server > /dev/null; then
echo " Server not running"
exit 1
fi
echo " Server running (PID: $(pgrep norikv-server))"
# 2. Check HTTP endpoint
if ! curl -sf http://localhost:8080/health/quick > /dev/null; then
echo " Health check failed"
exit 1
fi
echo " Health check OK"
# 3. Check cluster size
cluster_size=$(curl -s http://localhost:8080/metrics | grep '^swim_cluster_size' | awk '{print $2}')
if [ "$cluster_size" -lt 3 ]; then
echo " Warning: Cluster size is $cluster_size (expected 3)"
else
echo " Cluster size: $cluster_size"
fi
# 4. Check for errors in last 5 minutes
error_count=$(sudo journalctl -u norikv --since "5 minutes ago" | grep -c ERROR || echo 0)
if [ "$error_count" -gt 0 ]; then
echo " Warning: $error_count errors in last 5 minutes"
else
echo " No recent errors"
fi
echo " All checks passed"
Performance Profiling¶
# CPU profiling (requires perf)
sudo perf record -F 99 -p $(pgrep norikv-server) -g -- sleep 30
sudo perf report
# Memory profiling (requires valgrind)
valgrind --tool=massif --massif-out-file=massif.out norikv-server --config config.yaml
# Trace syscalls
sudo strace -p $(pgrep norikv-server) -c -f
# Network profiling
sudo tcpdump -i any port 7447 -w norikv.pcap
Getting Help¶
Information to Include in Bug Reports¶
When reporting issues, please include:
- Version:
norikv-server --version - OS:
uname -a,cat /etc/os-release - Config:
cat /etc/norikv/config.yaml(redact secrets) - Logs: Last 1000 lines of logs (
sudo journalctl -u norikv -n 1000) - Metrics:
curl http://localhost:8080/metrics - Health:
curl http://localhost:8080/health | jq - Steps to reproduce: Exact commands that trigger the issue
Where to Ask¶
- GitHub Issues: https://github.com/norikv/norikv/issues
- Discord: https://discord.gg/norikv (future)
- Email: support@norikv.io (enterprise support)
Next Steps¶
- Configuration Reference - Fine-tune settings
- Metrics Reference - Understand metrics for debugging
- Deployment Guide - Proper deployment prevents issues