Design Decisions¶
Rationale behind ATLL's key architectural choices and trade-offs.
Overview¶
This section explains why nori-lsm's ATLL (Adaptive Tiered-Leveled LSM) is built the way it is. Each design decision represents a conscious trade-off, balancing performance, adaptability, and operational simplicity.
Unlike traditional LSMs that force a global choice between leveled (high WA) or tiered (high RA), ATLL makes per-range decisions, adapting to heterogeneous workloads.
Key Decisions¶
Guard-Based Partitioning¶
Why we use fixed guard keys for range partitioning instead of dynamic splitting: predictable behavior, simple recovery, and stable boundaries for learning.
Adaptive K-Way Fanout¶
Why each slot chooses its own K (max runs) based on heat: RUM optimization per range, not global compromise.
Heat Tracking (EWMA)¶
Why we use exponential weighted moving average (α=0.1) for access pattern detection: recency bias, convergence properties, low overhead.
Bandit Scheduler¶
Why epsilon-greedy multi-armed bandits with UCB for compaction scheduling: balances exploration vs exploitation with proven regret bounds.
Memory Pressure System¶
Why 4-zone adaptive backpressure (Green/Yellow/Orange/Red) with composite scoring: progressive degradation, not cliff-edge failures.
Design Philosophy¶
ATLL follows these core principles:
1. Adapt, Don't Compromise¶
Traditional LSMs force a global trade-off: - Leveled: Fast reads, slow writes - Tiered: Fast writes, slow reads
ATLL: Adapt per key range based on actual access patterns.
2. Learn Online, Not Offline¶
Some LSM tuning systems (e.g., Monkey) require offline learning phases.
ATLL: Continuous online learning via EWMA heat tracking and bandit scheduler.
3. Simplicity Through Automation¶
Manual LSM tuning requires expertise: - RocksDB: bloom bits, compaction threads, L0 thresholds, compaction style - Cassandra: window sizes, bucket counts, SSTable sizes
ATLL: Self-tuning via reinforcement learning.
4. Theory-Guided, Measurement-Validated¶
Decisions backed by: - Theory: RUM conjecture, UCB regret bounds, EWMA convergence - Measurement: Benchmarks with Zipfian workloads, real metrics
5. Fail Gracefully, Not Catastrophically¶
Hard stalls (reject writes) are a last resort.
ATLL: Progressive backpressure (soft throttling → heavy throttling → hard stall).
Decision-Making Framework¶
When designing ATLL, we prioritize:
1. Correctness First - Never sacrifice data integrity for performance - Durability via WAL, crash-safe compaction - Last-write-wins semantics (no lost updates)
2. Adaptive Over Static - Access patterns change over time - System learns and adapts (not manual re-tuning) - Resilient to workload shifts
3. Measurable Trade-offs - Document what we gained and what we gave up - Benchmark claims (not just theory) - Quantify costs (WA, RA, SA)
4. Operational Simplicity - Minimal knobs (defaults work for 80% of use cases) - Observable behavior (heat scores, bandit metrics, pressure zones) - Predictable failure modes
Trade-offs¶
Every design decision involves trade-offs. Here's what we gained vs what we gave up:
| Decision | Gained | Cost |
|---|---|---|
| Fixed guard keys | Predictable routing, simple recovery | Can't rebalance without data migration |
| Per-slot K-way fanout | Adaptive WA/RA per range | More complex than global strategy |
| EWMA heat tracking | Recency bias, low overhead | Slower adaptation than counters |
| Bandit scheduler | Automatic optimization | 10% exploration overhead (ε=0.1) |
| 4-zone backpressure | Graceful degradation | More complex than hard stall |
| Bloom filters | 99% faster negative lookups | 125 KB per 100K keys (memory) |
| Slot-local compaction | Parallel compaction | Must coordinate cross-slot queries |
Overall: ATLL sacrifices simplicity for adaptability, achieving near-Pareto-optimal performance for heterogeneous workloads.
Comparison to Alternatives¶
ATLL vs RocksDB (Pure Leveled)¶
RocksDB approach: - Global leveled compaction (K=1 everywhere) - Manual tuning (L0 triggers, compaction threads, bloom bits) - Optimized for read-heavy workloads
ATLL improvements: - 2-5x lower WA (via adaptive tiering for cold ranges) - Automatic tuning (no manual intervention) - Better for heterogeneous workloads
ATLL vs Cassandra (Pure Tiered)¶
Cassandra approach: - Global size-tiered compaction (K>1 everywhere) - Optimized for write-heavy workloads - High RA for point queries
ATLL improvements: - 2x lower RA for hot ranges (K=1 via adaptation) - Maintains low WA for cold ranges (K>1) - Better for mixed workloads
ATLL vs ScyllaDB ICS (Time-Windowed)¶
ScyllaDB approach: - Time-window bucketing (recent=leveled, old=tiered) - Assumes time-series workload - Requires manual window configuration
ATLL improvements: - Access-pattern-based (not time-based, works for any workload) - Automatic adaptation (no window tuning) - General-purpose key-value store
ATLL vs Monkey (ML-Based Tuning)¶
Monkey approach: - Offline learning of optimal T (fanout) and bloom bits - Theoretically optimal for static workloads - Academic research project
ATLL improvements: - Per-slot adaptation (finer granularity than global T) - Online learning (no offline phase) - Production-ready implementation
Non-Goals¶
It's important to document what ATLL is not designed to do:
Not a Generic Compaction Strategy¶
ATLL is optimized for heterogeneous workloads (hot/cold data).
Not ideal for: - Uniform access patterns (all keys equally hot) - Pure sequential scans (no point queries) - Tiny datasets (<100 MB, fits in memory)
Not a Manual Tuning Framework¶
ATLL is designed to self-tune.
We intentionally avoid: - Per-range compaction strategy overrides - Manual K-max configuration - Expert-mode knobs for every parameter
Not a Distributed System¶
ATLL is a single-node storage engine.
Not included: - Replication (use nori-raft on top) - Sharding (use norikv-placement) - Consensus (use nori-raft)
Evolution of Design¶
Some decisions were made early and have proven stable. Others evolved based on experience:
Stable Since v0.1¶
- Guard-based range partitioning
- EWMA heat tracking (α=0.1)
- Epsilon-greedy bandit (ε=0.1)
- 4-zone memory pressure system
Added Based on Feedback¶
- Dynamic K-way fanout (v0.2) - initially K was fixed per slot
- UCB exploration bonus (v0.3) - initially pure ε-greedy
- Composite pressure scoring (v0.4) - initially L0-only
Future Considerations¶
- Dynamic guard key adjustment - Rebalance slots when sizes diverge
- Multi-dimensional heat - Separate read/write/scan heat scores
- Contextual bandits - Use system state (L0 count, memory) as context
- Learned guard keys - ML-based key space partitioning
Theoretical Foundations¶
ATLL's design is grounded in established theory:
RUM Conjecture (Athanassoulis et al., 2016)¶
For any data structure:
R × U × M ≥ Constant
Where:
R = Read cost
U = Update (write) cost
M = Memory cost
ATLL insight: Optimize R-U-M per slot, not globally. - Hot slots: Minimize R (accept higher U) - Cold slots: Minimize U (accept higher R)
Multi-Armed Bandit Theory¶
ATLL guarantee: After T compaction decisions, regret (suboptimal choices) is logarithmically bounded.
EWMA Convergence¶
For α=0.1 and stable access pattern:
Convergence to steady state: ~20-30 accesses
Error after t steps:
error_t = (1-α)^t × initial_error
error_30 = 0.9^30 ≈ 0.04 (4% error)
ATLL property: Heat scores converge within 30 accesses to stable patterns.
Validation Methodology¶
All design decisions were validated through:
1. Benchmarks¶
- Zipfian workloads (80/20 hot/cold distribution)
- Sustained writes (20K writes/sec for 1 hour)
- Mixed read/write (50/50 split, 10K ops/sec)
2. Property Tests¶
- Reward convergence (bandit scheduler)
- Heat stability (EWMA tracking)
- Pressure correctness (4-zone backpressure)
3. Real Metrics¶
- Write amplification: 8-20x (vs 40-100x for pure leveled)
- Read amplification: 5-12 (vs 10-15 for pure tiered)
- p95 latency: <10ms (hot ranges)
4. Code Review¶
- Academic paper citations (O'Neil 1996, Dayan 2017)
- Prior art comparison (RocksDB, Cassandra, ScyllaDB)
- Peer feedback from storage systems experts
Further Reading¶
Each subsection dives deep into a specific design decision. Read them in order to build understanding, or jump to specific topics:
- Guard-Based Partitioning - Foundation of ATLL's range-level adaptation
- Adaptive K-Way Fanout - How K-max is computed per slot
- Heat Tracking - EWMA algorithm and convergence analysis
- Bandit Scheduler - Epsilon-greedy UCB and reward function
- Memory Pressure - 4-zone backpressure system
Last Updated: 2025-10-31 References: Athanassoulis et al. (RUM, 2016), Dayan et al. (Monkey, 2017), Auer et al. (UCB, 2002)