Design Decisions¶

Rationale behind ATLL's key architectural choices and trade-offs.

Overview¶

This section explains why nori-lsm's ATLL (Adaptive Tiered-Leveled LSM) is built the way it is. Each design decision represents a conscious trade-off, balancing performance, adaptability, and operational simplicity.

Unlike traditional LSMs that force a global choice between leveled (high WA) or tiered (high RA), ATLL makes per-range decisions, adapting to heterogeneous workloads.

Key Decisions¶

Guard-Based Partitioning ¶

Why we use fixed guard keys for range partitioning instead of dynamic splitting: predictable behavior, simple recovery, and stable boundaries for learning.

Adaptive K-Way Fanout ¶

Why each slot chooses its own K (max runs) based on heat: RUM optimization per range, not global compromise.

Heat Tracking (EWMA)¶

Why we use exponential weighted moving average (α=0.1) for access pattern detection: recency bias, convergence properties, low overhead.

Bandit Scheduler ¶

Why epsilon-greedy multi-armed bandits with UCB for compaction scheduling: balances exploration vs exploitation with proven regret bounds.

Memory Pressure System ¶

Why 4-zone adaptive backpressure (Green/Yellow/Orange/Red) with composite scoring: progressive degradation, not cliff-edge failures.

Design Philosophy¶

ATLL follows these core principles:

1. Adapt, Don't Compromise¶

Traditional LSMs force a global trade-off: - Leveled: Fast reads, slow writes - Tiered: Fast writes, slow reads

ATLL: Adapt per key range based on actual access patterns.

2. Learn Online, Not Offline¶

Some LSM tuning systems (e.g., Monkey) require offline learning phases.

ATLL: Continuous online learning via EWMA heat tracking and bandit scheduler.

3. Simplicity Through Automation¶

Manual LSM tuning requires expertise: - RocksDB: bloom bits, compaction threads, L0 thresholds, compaction style - Cassandra: window sizes, bucket counts, SSTable sizes

ATLL: Self-tuning via reinforcement learning.

4. Theory-Guided, Measurement-Validated¶

Decisions backed by: - Theory: RUM conjecture, UCB regret bounds, EWMA convergence - Measurement: Benchmarks with Zipfian workloads, real metrics

5. Fail Gracefully, Not Catastrophically¶

Hard stalls (reject writes) are a last resort.

ATLL: Progressive backpressure (soft throttling → heavy throttling → hard stall).

Decision-Making Framework¶

When designing ATLL, we prioritize:

1. Correctness First - Never sacrifice data integrity for performance - Durability via WAL, crash-safe compaction - Last-write-wins semantics (no lost updates)

2. Adaptive Over Static - Access patterns change over time - System learns and adapts (not manual re-tuning) - Resilient to workload shifts

3. Measurable Trade-offs - Document what we gained and what we gave up - Benchmark claims (not just theory) - Quantify costs (WA, RA, SA)

4. Operational Simplicity - Minimal knobs (defaults work for 80% of use cases) - Observable behavior (heat scores, bandit metrics, pressure zones) - Predictable failure modes

Trade-offs¶

Every design decision involves trade-offs. Here's what we gained vs what we gave up:

Decision	Gained	Cost
Fixed guard keys	Predictable routing, simple recovery	Can't rebalance without data migration
Per-slot K-way fanout	Adaptive WA/RA per range	More complex than global strategy
EWMA heat tracking	Recency bias, low overhead	Slower adaptation than counters
Bandit scheduler	Automatic optimization	10% exploration overhead (ε=0.1)
4-zone backpressure	Graceful degradation	More complex than hard stall
Bloom filters	99% faster negative lookups	125 KB per 100K keys (memory)
Slot-local compaction	Parallel compaction	Must coordinate cross-slot queries

Overall: ATLL sacrifices simplicity for adaptability, achieving near-Pareto-optimal performance for heterogeneous workloads.

Comparison to Alternatives¶

ATLL vs RocksDB (Pure Leveled)¶

RocksDB approach: - Global leveled compaction (K=1 everywhere) - Manual tuning (L0 triggers, compaction threads, bloom bits) - Optimized for read-heavy workloads

ATLL improvements: - 2-5x lower WA (via adaptive tiering for cold ranges) - Automatic tuning (no manual intervention) - Better for heterogeneous workloads

ATLL vs Cassandra (Pure Tiered)¶

Cassandra approach: - Global size-tiered compaction (K>1 everywhere) - Optimized for write-heavy workloads - High RA for point queries

ATLL improvements: - 2x lower RA for hot ranges (K=1 via adaptation) - Maintains low WA for cold ranges (K>1) - Better for mixed workloads

ATLL vs ScyllaDB ICS (Time-Windowed)¶

ScyllaDB approach: - Time-window bucketing (recent=leveled, old=tiered) - Assumes time-series workload - Requires manual window configuration

ATLL improvements: - Access-pattern-based (not time-based, works for any workload) - Automatic adaptation (no window tuning) - General-purpose key-value store

ATLL vs Monkey (ML-Based Tuning)¶

Monkey approach: - Offline learning of optimal T (fanout) and bloom bits - Theoretically optimal for static workloads - Academic research project

ATLL improvements: - Per-slot adaptation (finer granularity than global T) - Online learning (no offline phase) - Production-ready implementation

Non-Goals¶

It's important to document what ATLL is not designed to do:

Not a Generic Compaction Strategy¶

ATLL is optimized for heterogeneous workloads (hot/cold data).

Not ideal for: - Uniform access patterns (all keys equally hot) - Pure sequential scans (no point queries) - Tiny datasets (<100 MB, fits in memory)

Not a Manual Tuning Framework¶

ATLL is designed to self-tune.

We intentionally avoid: - Per-range compaction strategy overrides - Manual K-max configuration - Expert-mode knobs for every parameter

Not a Distributed System¶

ATLL is a single-node storage engine.

Not included: - Replication (use nori-raft on top) - Sharding (use norikv-placement) - Consensus (use nori-raft)

Evolution of Design¶

Some decisions were made early and have proven stable. Others evolved based on experience:

Stable Since v0.1¶

Guard-based range partitioning
EWMA heat tracking (α=0.1)
Epsilon-greedy bandit (ε=0.1)
4-zone memory pressure system

Added Based on Feedback¶

Dynamic K-way fanout (v0.2) - initially K was fixed per slot
UCB exploration bonus (v0.3) - initially pure ε-greedy
Composite pressure scoring (v0.4) - initially L0-only

Future Considerations¶

Dynamic guard key adjustment - Rebalance slots when sizes diverge
Multi-dimensional heat - Separate read/write/scan heat scores
Contextual bandits - Use system state (L0 count, memory) as context
Learned guard keys - ML-based key space partitioning

Theoretical Foundations¶

ATLL's design is grounded in established theory:

RUM Conjecture (Athanassoulis et al., 2016)¶

For any data structure:
  R × U × M ≥ Constant

Where:
  R = Read cost
  U = Update (write) cost
  M = Memory cost

ATLL insight: Optimize R-U-M per slot, not globally. - Hot slots: Minimize R (accept higher U) - Cold slots: Minimize U (accept higher R)

Multi-Armed Bandit Theory¶

UCB Regret Bound:
  Regret ≤ O(K × log(T))

Where:
  K = number of arms (slots)
  T = total selections

ATLL guarantee: After T compaction decisions, regret (suboptimal choices) is logarithmically bounded.

EWMA Convergence¶

For α=0.1 and stable access pattern:
  Convergence to steady state: ~20-30 accesses

Error after t steps:
  error_t = (1-α)^t × initial_error
  error_30 = 0.9^30 ≈ 0.04 (4% error)

ATLL property: Heat scores converge within 30 accesses to stable patterns.

Validation Methodology¶

All design decisions were validated through:

1. Benchmarks¶

Zipfian workloads (80/20 hot/cold distribution)
Sustained writes (20K writes/sec for 1 hour)
Mixed read/write (50/50 split, 10K ops/sec)

2. Property Tests¶

Reward convergence (bandit scheduler)
Heat stability (EWMA tracking)
Pressure correctness (4-zone backpressure)

3. Real Metrics¶

Write amplification: 8-20x (vs 40-100x for pure leveled)
Read amplification: 5-12 (vs 10-15 for pure tiered)
p95 latency: <10ms (hot ranges)

4. Code Review¶

Academic paper citations (O'Neil 1996, Dayan 2017)
Prior art comparison (RocksDB, Cassandra, ScyllaDB)
Peer feedback from storage systems experts