NoriKV Java Client Architecture¶
Understanding the internal design and components of the Java client SDK.
Table of Contents¶
- Overview
- Component Architecture
- Request Flow
- Threading Model
- Connection Management
- Routing & Sharding
- Retry Logic
- Error Handling
Overview¶
The NoriKV Java client is designed as a smart client that: - Routes requests directly to the appropriate shard leader - Maintains connection pools for efficient communication - Implements retry logic with exponential backoff - Tracks cluster topology changes - Provides thread-safe operations
Design Principles¶
- Zero-hop routing: Client routes directly to shard leader (no proxy)
- Thread safety: Single client instance shared across threads
- Fail-fast: Detect and handle failures quickly
- Observable: Expose metrics and statistics
- Resource efficient: Connection pooling and reuse
Component Architecture¶
┌─────────────────────────────────────────────────────────┐
│ NoriKVClient │
│ (Main API: put, get, delete, topology, stats) │
└──────────────────┬──────────────────────────────────────┘
│
┌──────────┼──────────┬──────────┐
│ │ │ │
┌───────▼────┐ ┌──▼────┐ ┌───▼──────┐ ┌─▼─────────┐
│ Router │ │ Retry │ │ Pool │ │ Topology │
│ │ │Policy │ │ │ │ Manager │
└────────────┘ └───────┘ └──────────┘ └───────────┘
│ │ │
│ │ │
└─────────hash()──────────┤ │
│ │
┌─────────getChannel()────┤ │
│ │ │
│ ┌────▼────┐ │
│ │ gRPC │ │
│ │Channels │ │
│ └────┬────┘ │
│ │ │
│ │ │
└─────────updateView()────┴──────────────┘
Components¶
1. NoriKVClient¶
Responsibility: Main public API and component coordination
Key Methods:
- put(), get(), delete() - Core operations
- getClusterView() - Topology information
- onTopologyChange() - Subscribe to topology updates
- getStats() - Client statistics
- close() - Resource cleanup
Location: src/main/java/com/norikv/client/NoriKVClient.java
2. Router¶
Responsibility: Determine which node to send requests to
Key Functions:
- Hash key to shard: xxhash64(key) → jumpConsistentHash(hash, totalShards) → shardId
- Map shard to leader node
- Cache leader information
- Handle leader hints from NOT_LEADER errors
Location: src/main/java/com/norikv/client/internal/router/Router.java
Algorithm:
1. Hash key using XXHash64 (seed=0)
2. Map hash to shard using Jump Consistent Hash
3. Look up shard leader in topology cache
4. Return leader's address
3. ConnectionPool¶
Responsibility: Manage gRPC channels to cluster nodes
Key Functions: - Create and cache gRPC channels per node - Thread-safe concurrent access - Graceful shutdown - Health monitoring
Location: src/main/java/com/norikv/client/internal/conn/ConnectionPool.java
Design:
- One ManagedChannel per node address
- Lazy initialization (created on first use)
- Channels reused across requests
- Automatic shutdown on client close
4. RetryPolicy¶
Responsibility: Handle transient failures with backoff
Key Functions:
- Exponential backoff: delay = min(initialDelay * 2^attempt, maxDelay)
- Jitter: Add randomness to avoid thundering herd
- Selective retry: Only retry transient errors
- Attempt tracking
Location: src/main/java/com/norikv/client/internal/retry/RetryPolicy.java
Retryable Errors:
- UNAVAILABLE - Server temporarily unavailable
- ABORTED - Operation aborted, safe to retry
- DEADLINE_EXCEEDED - Timeout, may succeed on retry
- RESOURCE_EXHAUSTED - Rate limited, backoff helps
Non-Retryable Errors:
- INVALID_ARGUMENT - Client error, won't succeed
- NOT_FOUND - Key doesn't exist
- FAILED_PRECONDITION - CAS conflict, application must retry
- PERMISSION_DENIED - Auth error
5. TopologyManager¶
Responsibility: Track cluster membership and shard assignments
Key Functions:
- Store current ClusterView
- Cache shard → leader mappings
- Detect topology changes
- Notify listeners of changes
- Update leader hints
Location: src/main/java/com/norikv/client/internal/topology/TopologyManager.java
Data Structures:
- ClusterView: Current cluster state (epoch, nodes, shards)
- shardLeaderCache: ConcurrentHashMaplisteners: CopyOnWriteArrayList of change callbacks
Request Flow¶
PUT Request Flow¶
Client.put(key, value, options)
│
├─> 1. Validate inputs (key, value not null/empty)
│
├─> 2. Router.getNodeForKey(key)
│ ├─> hash = xxhash64(key)
│ ├─> shardId = jumpConsistentHash(hash, totalShards)
│ └─> leaderAddr = topologyManager.getShardLeader(shardId)
│
├─> 3. ConnectionPool.getChannel(leaderAddr)
│ └─> Return cached or create new gRPC channel
│
├─> 4. RetryPolicy.execute(() -> {
│ ├─> Build gRPC PutRequest (via ProtoConverters)
│ ├─> KvGrpc.newBlockingStub(channel).put(request)
│ └─> Convert response to Version
│ })
│ ├─> On SUCCESS: return Version
│ ├─> On RETRYABLE_ERROR: backoff and retry
│ └─> On NON_RETRYABLE: throw exception
│
└─> 5. Return Version to caller
GET Request Flow¶
Similar to PUT, but:
- Uses GetRequest with consistency level
- Returns GetResult (value + version)
- Throws KeyNotFoundException on NOT_FOUND
Error Handling in Flow¶
gRPC StatusRuntimeException
│
├─> convertGrpcException()
│ ├─> NOT_FOUND → KeyNotFoundException
│ ├─> FAILED_PRECONDITION + "version" → VersionMismatchException
│ ├─> UNAVAILABLE → ConnectionException
│ └─> OTHER → NoriKVException
│
└─> RetryPolicy decides:
├─> Retryable → backoff and retry
└─> Non-retryable → throw to caller
Threading Model¶
Thread Safety Guarantees¶
All client components are thread-safe:
- NoriKVClient: Thread-safe, shareable across threads
- Router: Uses immutable routing tables, atomic leader cache updates
- ConnectionPool: ConcurrentHashMap for channel storage
- TopologyManager: ReadWriteLock for view updates, ConcurrentHashMap for caches
- RetryPolicy: Stateless, safe for concurrent use
Concurrency Design¶
// Safe: Single client, multiple threads
NoriKVClient client = new NoriKVClient(config);
ExecutorService executor = Executors.newFixedThreadPool(10);
for (int i = 0; i < 100; i++) {
executor.submit(() -> {
client.put(key, value, null); // Thread-safe
});
}
Synchronization Points¶
- TopologyManager.updateView(): Write lock during view update
- ConnectionPool.getOrCreateChannel(): Double-checked locking for channel creation
- Router leader cache: ConcurrentHashMap for atomic updates
Connection Management¶
Channel Lifecycle¶
Node Address
│
├─> First request → Create ManagedChannel
│ ├─> Configure: plaintext, timeouts, keepalive
│ └─> Store in pool
│
├─> Subsequent requests → Reuse channel
│
└─> Client.close() → Shutdown all channels
├─> Graceful shutdown (5s timeout)
└─> Force shutdown if needed
Channel Configuration¶
ManagedChannel channel = ManagedChannelBuilder
.forTarget(address)
.usePlaintext() // No TLS (for now)
.build();
Health Checks¶
- Channels automatically reconnect on failure
- gRPC handles connection health internally
- Failed requests trigger retries (via RetryPolicy)
Routing & Sharding¶
Hash Function: XXHash64¶
Properties: - Fast: ~10GB/s throughput - Consistent: Same key → same hash - Cross-SDK compatible: Identical to Go/Python/TypeScript implementations
Consistent Hashing: Jump Consistent Hash¶
Properties: - Minimal key movement on shard count changes - O(log n) time complexity - Uniform distribution
Shard → Leader Mapping¶
Leader Cache: - Populated from ClusterView - Updated on topology changes - Updated from NOT_LEADER error hints
NOT_LEADER Handling¶
1. Request sent to node A for shard X
2. Node A returns NOT_LEADER error with hint: "leader is node B"
3. Client updates leader cache: shard X → node B
4. Retry policy re-sends request to node B
Retry Logic¶
Exponential Backoff¶
Example (initialDelay=100ms, maxDelay=5000ms, jitter=100ms):
Attempt 1: delay = 100ms + random(0-100ms)
Attempt 2: delay = 200ms + random(0-100ms)
Attempt 3: delay = 400ms + random(0-100ms)
Attempt 4: delay = 800ms + random(0-100ms)
Attempt 5: delay = 1600ms + random(0-100ms)
Attempt 6: delay = 3200ms + random(0-100ms)
Attempt 7: delay = 5000ms + random(0-100ms) (capped)
Jitter Benefits¶
- Avoids thundering herd (all clients retry at same time)
- Spreads load during recovery
- Reduces collision probability
Retry Decision Tree¶
Error Received
│
├─> Is retryable? (UNAVAILABLE, ABORTED, etc.)
│ │
│ ├─> YES: attempt < maxAttempts?
│ │ ├─> YES: backoff and retry
│ │ └─> NO: throw RETRY_EXHAUSTED
│ │
│ └─> NO: throw original exception
│
└─> Success: return result
Error Handling¶
Exception Hierarchy¶
NoriKVException (checked)
├── KeyNotFoundException
├── VersionMismatchException
├── AlreadyExistsException
└── ConnectionException
Error Code Mapping¶
| gRPC Status | NoriKV Exception | Retry? |
|---|---|---|
| NOT_FOUND | KeyNotFoundException | No |
| FAILED_PRECONDITION (version) | VersionMismatchException | No |
| FAILED_PRECONDITION (other) | NoriKVException | No |
| ALREADY_EXISTS | AlreadyExistsException | No |
| UNAVAILABLE | ConnectionException | Yes |
| DEADLINE_EXCEEDED | ConnectionException | Yes |
| ABORTED | NoriKVException | Yes |
| RESOURCE_EXHAUSTED | NoriKVException | Yes |
| INVALID_ARGUMENT | NoriKVException | No |
| PERMISSION_DENIED | NoriKVException | No |
| OTHER | NoriKVException | No |
Error Context¶
Exceptions include: - Error code (string) - Descriptive message - Cause (original exception) - Context (key, version for VersionMismatchException)
Performance Considerations¶
Hot Paths¶
- Hash calculation: Optimized XXHash64 implementation
- Channel lookup: O(1) ConcurrentHashMap lookup
- Leader cache: O(1) lookup, populated eagerly
- Protobuf serialization: Minimal overhead
Memory Usage¶
- Per client: ~1-10 MB (depends on number of nodes)
- Per channel: ~100 KB (gRPC overhead)
- Per request: Minimal (request/response objects garbage collected)
Connection Pooling¶
- Channels reused across requests
- No connection per request overhead
- gRPC multiplexes requests over HTTP/2
Benchmarks¶
See the SDK documentation for detailed performance metrics.
Observability¶
Client Statistics¶
ClientStats stats = client.getStats();
// Router stats
stats.getRouterStats().getTotalNodes();
stats.getRouterStats().getTotalShards();
// Connection pool stats
stats.getPoolStats().getActiveChannels();
// Topology stats
stats.getTopologyStats().getCurrentEpoch();
stats.getTopologyStats().getCachedLeaders();
Topology Change Events¶
client.onTopologyChange(event -> {
logger.info("Topology changed to epoch {}", event.getCurrentEpoch());
logger.info("Added nodes: {}", event.getAddedNodes());
logger.info("Removed nodes: {}", event.getRemovedNodes());
logger.info("Leader changes: {}", event.getLeaderChanges());
});
Logging¶
Client uses System.err for internal errors (listener failures, etc.). Consider adding proper logging framework integration in production.
Extensibility¶
Custom Retry Policies¶
Implement custom backoff strategies:
RetryConfig custom = RetryConfig.builder()
.maxAttempts(20)
.initialDelayMs(50)
.maxDelayMs(10000)
.jitterMs(200)
.build();
Future Extensions¶
Potential areas for extension: - Custom hash functions (requires cross-SDK coordination) - TLS/SSL support - Authentication integration - Metrics export (Prometheus, OTLP) - Circuit breaker patterns - Request tracing
Comparison with Other SDKs¶
Similarities¶
- Same hash functions (XXHash64 + Jump Consistent Hash)
- Same routing algorithm
- Same proto definitions
- Same error handling patterns
Java-Specific Features¶
- Builder pattern for configuration
- Try-with-resources (AutoCloseable)
- Java streams and functional interfaces
- Strong typing with generics
Performance¶
Java SDK performance is comparable to other SDKs: - Go: Fastest (native concurrency, minimal GC) - Java: Fast (JIT optimization, mature runtime) - TypeScript: Good (V8 optimization) - Python: Slower (GIL limitations, interpreted)
References¶
- API Guide - Public API documentation
- Troubleshooting Guide - Common issues
- Advanced Patterns - Complex use cases
- Source Code - Implementation