How It Works¶
Deep dive into nori-lsm's internals: memtable management, flush process, slot routing, compaction lifecycle, and snapshots.
What You'll Learn¶
This section explains the technical implementation details of nori-lsm's ATLL architecture. If you want to understand:
- How writes flow from memtable → L0 → slots
- How keys are routed to slots using guard-based partitioning
- How compaction works with the bandit scheduler
- How memtable rotation and flushing work
- How consistent snapshots are created for backups
- How the entire system coordinates write path, read path, and background jobs
...then you're in the right place.
Navigation¶
Memtable Management¶
The in-memory skiplist: writes, rotation triggers, concurrent access, and memory accounting.
Flush Process¶
How memtables are flushed to L0 SSTables. Sorting, serialization, bloom filter generation, and atomic swaps.
Slot Routing¶
How keys are mapped to slots using binary search on guard keys. Range queries and slot overlap detection.
Compaction Lifecycle¶
The full compaction process: slot selection via bandit scheduler, K-way merge, heat updates, and metrics.
Snapshot Process¶
Creating consistent point-in-time snapshots for backups, replication, and disaster recovery.
Who Should Read This¶
You should read this section if: - You're implementing your own LSM-tree - You're debugging nori-lsm behavior - You're contributing to nori-lsm - You want deep technical understanding of ATLL - You're evaluating LSM implementations
You can skip this section if: - You just want to use nori-lsm (see Getting Started - coming soon) - You want high-level concepts (see Core Concepts) - You want design rationale (see Design Decisions)
Prerequisites¶
Before diving in, make sure you understand:
- What is LSM - The fundamental concept
- ATLL Architecture - Guard keys, K-way fanout, heat tracking
- Write Path - High-level flow: WAL → memtable → L0 → compaction
- Read Path - High-level flow: memtable → L0 → slot
If you haven't read those yet, start there first!
Architecture Overview¶
Here's how nori-lsm components fit together:
┌──────────────────────────────────────────────────────────────┐
│ Application Layer │
│ put(key, value) / get(key) │
└──────────────────────┬───────────────────────┬────────────────┘
│ │
┌─────────▼──────────┐ ┌────────▼────────┐
│ Write Path │ │ Read Path │
└─────────┬──────────┘ └────────┬────────┘
│ │
┌──────────────▼──────────────┐ │
│ 1. WAL (durability) │ │
│ 2. Memtable (in-memory) │ │
└──────────────┬──────────────┘ │
│ │
Flush (async) │
│ │
┌──────────────▼──────────────────────▼────────────┐
│ L0 (unsorted) │
│ [SST-001] [SST-002] [SST-003] [SST-004] ... │
└──────────────┬───────────────────────────────────┘
│
Slot Routing
(guard keys)
│
┌──────────────▼──────────────────────────────┐
│ Slot 0 [0x00, 0x40) │
│ [Run 0] K-max=1 (leveled, hot) │
├──────────────────────────────────────────────┤
│ Slot 1 [0x40, 0x80) │
│ [Run 0] [Run 1] K-max=2 (hybrid, warm) │
├──────────────────────────────────────────────┤
│ Slot 2 [0x80, 0xC0) │
│ [Run 0] [Run 1] [Run 2] K-max=3 (cold) │
├──────────────────────────────────────────────┤
│ Slot 3 [0xC0, 0xFF] │
│ [Run 0] [Run 1] [Run 2] [Run 3] K-max=4 │
└──────────────┬───────────────────────────────┘
│
Compaction
(bandit scheduler)
│
┌──────────────▼──────────────────────────────┐
│ Merged SSTables (sorted, non-overlapping │
│ per slot) │
└──────────────────────────────────────────────┘
Key components: - Memtable: In-memory skiplist, bounded by size (default 64 MB) - L0: Unsorted SSTables from memtable flushes - Slots: Range-partitioned levels with guard keys - Compaction Scheduler: Bandit-based slot selection - Snapshot Manager: Consistent backups via reference counting
Code Organization¶
nori-lsm is organized into modules:
nori-lsm/
src/
lib.rs - Public API (LSM, Config, Storage trait)
memtable.rs - In-memory skiplist with rotation
flush.rs - Memtable → L0 SSTable conversion
slot.rs - Slot management (guard keys, runs, heat)
compaction.rs - K-way merge algorithm
scheduler.rs - Bandit-based compaction scheduling
snapshot.rs - Snapshot creation and management
manifest.rs - Metadata persistence (slots, runs, versions)
pressure.rs - 4-zone backpressure system
observe.rs - Observability (metrics, VizEvent)
Each "How It Works" page corresponds to one or more of these modules.
Reading Order¶
We recommend reading in this order:
- Memtable Management - Start here to understand the write buffer
- Flush Process - How memtables become SSTables
- Slot Routing - How guard-based partitioning works
- Compaction Lifecycle - The most complex subsystem
- Snapshot Process - Backups and consistency
You can read them independently, but they build on each other.
Operational Flow¶
Here's a typical sequence of operations:
Write Flow¶
1. Client: put("user:12345", "data")
2. WAL: Append to 000042.wal, fsync
3. Memtable: Insert into skiplist (50ns)
4. Check pressure: Green zone, no delay
5. Return Ok(())
[Background thread, triggered when memtable > 64 MB]
6. Rotate memtable: Active → Immutable
7. Create new active memtable
8. Flush immutable memtable:
- Sort 1M entries
- Build SSTable with bloom filter
- Write 000015.sst to L0
9. Update manifest: L0 += 000015.sst
10. Delete WAL segment (no longer needed)
Compaction Flow¶
1. Scheduler: Select slot via epsilon-greedy + UCB
2. Slot 1 selected (heat=0.8, k_max=1, 3 L0 files overlap)
3. Compaction:
- Open 3 L0 SSTables + 1 existing run
- 4-way merge (K=4 → K=1 leveled)
- Write merged SSTable: 000016.sst
- Build bloom filter (10,000 keys, 0.01 FPR)
4. Update slot:
- Remove old run
- Add new run: 000016.sst
- Update heat score (EWMA decay)
5. Update manifest atomically
6. Delete old SSTables (reference counting)
7. Emit VizEvent::Compaction with metrics
Read Flow¶
1. Client: get("user:12345")
2. Memtable: Check active, miss
3. Memtable: Check immutable (if exists), miss
4. L0: Check bloom filters (6 files)
- 5 negative (bloom says "no")
- 1 maybe (bloom says "maybe")
- Read SSTable 000015.sst, found!
5. Return value (skip slot traversal)
[Alternative: key not in L0]
6. Slot routing: Binary search guard keys → Slot 1
7. Slot 1: Check bloom filter → maybe
8. Slot 1: Binary search index, read block
9. Return value or NotFound
Visual Learning¶
Each page includes:
- Mermaid diagrams - Flowcharts and sequence diagrams
- ASCII diagrams - Memory layouts, tree structures
- Code snippets - Actual implementation from nori-lsm
- Examples - Concrete scenarios and edge cases
- Metrics - Performance characteristics and benchmarks
If you're a visual learner, you'll love this section!
Concurrency Model¶
nori-lsm uses a hybrid concurrency approach:
Lock-Free Reads: - Memtable: Lock-free skiplist (crossbeam-skiplist) - SSTables: Immutable files, no locks needed - Manifest: RwLock (many readers, single writer)
Synchronized Writes: - WAL append: Mutex (sequential writes for durability) - Memtable rotation: Mutex (rare, <1/min) - Compaction: Per-slot locks (parallel compaction across slots)
Example concurrency:
Thread 1: put("a", "1") → WAL mutex → memtable lock-free
Thread 2: put("b", "2") → WAL mutex → memtable lock-free
Thread 3: get("c") → memtable lock-free → L0 lock-free
Thread 4: compaction → Slot 0 mutex (doesn't block Slot 1-15)
Performance Characteristics¶
From benchmarks:
Write latency: - Memtable insert: 50ns (in-memory skiplist) - WAL append: 1-2ms (fsync, dominant cost) - p95: 2.1ms (Green zone), 20-40ms (Yellow zone)
Flush latency: - 64 MB memtable → SSTable: ~200ms - Bloom filter generation: ~50ms (100K keys) - Total: ~250ms per flush
Compaction latency: - 10 MB 4-way merge: ~100ms - 100 MB 4-way merge: ~800ms - 1 GB 4-way merge: ~8 seconds
Throughput: - Sustained writes: 20K ops/sec (single memtable) - Peak writes: 50K ops/sec (before backpressure) - Compaction throughput: 100-150 MB/sec
Next Steps¶
Ready to dive in? Start with:
- Memtable Management - Learn how the write buffer works
- Or jump to a specific topic that interests you
If you're looking for something else: - Core Concepts - High-level architecture and theory - Design Decisions - Why ATLL is built this way