Memtable Management¶
The in-memory skiplist: writes, rotation triggers, concurrent access, and memory accounting.
Overview¶
The memtable is nori-lsm's write buffer: an in-memory data structure that accepts all writes before they're flushed to disk as SSTables.
Key properties: - Data structure: Lock-free skiplist (O(log n) writes, O(log n) reads) - Bounded size: Default 64 MB, configurable - Sorted: Keys maintained in lexicographic order - Concurrent: Lock-free reads, synchronized writes - Two-tier: Active (mutable) + Immutable (flushing)
Memtable Lifecycle¶
stateDiagram-v2
[*] --> Active: New memtable created
Active --> Immutable: Size threshold reached
Immutable --> Flushing: Flush task started
Flushing --> [*]: SSTable written, memtable dropped
Active: Accepts writes
Active: Serves reads
Immutable: Read-only
Immutable: Waiting for flush
Flushing: Sorted iteration
Flushing: SSTable serialization
States:
1. Active: Current write target, accepts put() and delete()
2. Immutable: Read-only, waiting to be flushed
3. Flushing: Background task serializing to SSTable
4. Dropped: Flushed successfully, memory reclaimed
Skiplist Data Structure¶
nori-lsm uses crossbeam-skiplist, a lock-free concurrent skiplist.
Why Skiplist?¶
Alternatives considered:
| Data Structure | Write Complexity | Read Complexity | Sorted? | Concurrent? | Chosen? |
|---|---|---|---|---|---|
| Skiplist | O(log n) | O(log n) | Yes | Lock-free | Yes |
| HashMap | O(1) | O(1) | No | Sharded | No |
| BTreeMap | O(log n) | O(log n) | Yes | Mutex | No |
| RBTree | O(log n) | O(log n) | Yes | Mutex | No |
Why skiplist won: - Lock-free: No contention on reads (critical for LSMs) - Sorted: Enables efficient range scans - Fast iteration: O(n) scan for flush (no rebalancing needed) - Predictable: Probabilistic balancing, no worst-case rotations
Skiplist Structure¶
Level 3: 1 -----------------> 20 ------------------------> ∞
↓ ↓ ↓
Level 2: 1 --------> 10 ----> 20 --------> 30 ----------> ∞
↓ ↓ ↓ ↓ ↓
Level 1: 1 -> 5 ---> 10 ----> 20 -> 25 -> 30 -> 35 -----> ∞
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
Level 0: 1 -> 5 -> 8 -> 10 -> 20 -> 25 -> 30 -> 35 -> 40 -> ∞
(Pointers shown as arrows, keys shown as numbers)
Search for key 25: 1. Start at Level 3, head node (1) 2. 1 → 20 (20 ≤ 25, continue) 3. 20 → ∞ (∞ > 25, drop to Level 2) 4. 20 → 30 (30 > 25, drop to Level 1) 5. 20 → 25 (found!)
Expected hops: O(log n) = 3 hops for 100 keys, 7 hops for 10K keys
Write Operations¶
Put (Insert or Update)¶
pub fn put(&self, key: Vec<u8>, value: Vec<u8>) -> Result<()> {
// 1. Calculate entry size
let entry_size = key.len() + value.len() + 8; // 8 bytes overhead
// 2. Check if rotation needed (before insert)
if self.active_size.load(Ordering::Relaxed) + entry_size > self.config.memtable_max {
self.rotate_memtable()?;
}
// 3. Insert into skiplist (lock-free)
self.active_memtable.insert(key.clone(), value);
// 4. Update size counter (atomic)
self.active_size.fetch_add(entry_size, Ordering::Relaxed);
// 5. Update write counter for heat tracking
self.total_writes.fetch_add(1, Ordering::Relaxed);
Ok(())
}
Concurrency:
- Multiple threads can put() concurrently
- Skiplist handles lock-free insertion
- Only active_size counter needs atomic ops
Delete (Tombstone)¶
pub fn delete(&self, key: Vec<u8>) -> Result<()> {
// Deletions are just puts with a tombstone marker
let tombstone = vec![0xFF, 0xFF, 0xFF, 0xFF]; // Sentinel value
self.put(key, tombstone)
}
Why tombstones? - LSMs are immutable after flush - Can't delete from already-written SSTables - Tombstones shadow older values during compaction - Eventually garbage collected when no lower levels remain
Rotation Triggers¶
Memtable rotation happens when size exceeds threshold:
Size-Based Rotation (Primary)¶
fn should_rotate(&self) -> bool {
self.active_size.load(Ordering::Relaxed) >= self.config.memtable_max
}
Default threshold: 64 MB
Rationale: - Not too small: Reduce flush overhead (200ms per flush) - Not too large: Avoid long recovery time (scan 64 MB skiplist in ~100ms) - Power-of-2: Cache-friendly, easier to reason about
Rotation Process¶
fn rotate_memtable(&mut self) -> Result<()> {
// 1. Lock rotation mutex (prevents concurrent rotations)
let _guard = self.rotation_lock.lock();
// 2. Swap active memtable
let old_active = std::mem::replace(
&mut self.active_memtable,
SkipList::new(), // New empty memtable
);
// 3. Move to immutable list
self.immutable_memtables.push(old_active);
// 4. Reset size counter
self.active_size.store(0, Ordering::Relaxed);
// 5. Spawn flush task (async)
let immutable = self.immutable_memtables[0].clone();
tokio::spawn(async move {
self.flush_memtable(immutable).await?;
});
Ok(())
}
Key insight: Rotation is fast (<10μs) because: - No data copying (just pointer swap) - New memtable starts empty - Flush happens asynchronously in background
Active vs Immutable Memtables¶
nori-lsm maintains two classes of memtables:
┌──────────────────────────────────────┐
│ Active Memtable (1 instance) │
│ - Accepts writes │
│ - Serves reads │
│ - Bounded by memtable_max │
└──────────────────────────────────────┘
↓ (rotation)
┌──────────────────────────────────────┐
│ Immutable Memtables (0-2 instances) │
│ - Read-only │
│ - Waiting to flush │
│ - Queued for background task │
└──────────────────────────────────────┘
↓ (flush)
┌──────────────────────────────────────┐
│ L0 SSTables (on disk) │
└──────────────────────────────────────┘
Why allow multiple immutable memtables? - Burst tolerance: If flushes are slow (disk I/O), queue up to 2 immutable memtables - Backpressure: If 2 immutable memtables exist, block writes until flush completes - Memory bound: At most 3×memtable_max memory usage (1 active + 2 immutable)
Read Path Across Memtables¶
pub fn get(&self, key: &[u8]) -> Option<Vec<u8>> {
// 1. Check active memtable first (most recent)
if let Some(value) = self.active_memtable.get(key) {
return Some(value.clone());
}
// 2. Check immutable memtables (newest to oldest)
for memtable in self.immutable_memtables.iter().rev() {
if let Some(value) = memtable.get(key) {
return Some(value.clone());
}
}
// 3. Not in memtables, check L0 and slots
None
}
Time complexity: - Active memtable: O(log n) - Immutable memtables: O(k × log n) where k = 0-2 - Total: O(log n) in common case
Memory Accounting¶
Accurate memory tracking is critical for rotation triggers and backpressure.
Entry Size Calculation¶
fn entry_size(key: &[u8], value: &[u8]) -> usize {
// Key size
key.len() +
// Value size
value.len() +
// Skiplist node overhead (pointer + level array)
std::mem::size_of::<usize>() * 4 // 4 pointers average (levels 0-3)
}
Why 4 pointers? - Skiplist levels are probabilistic: ½ chance of level 1, ¼ for level 2, etc. - Expected value: 1 + 0.5 + 0.25 + 0.125 + ... ≈ 2 - Conservative estimate: 4 pointers (32 bytes on 64-bit)
Size Tracking¶
pub struct Memtable {
skiplist: SkipList<Vec<u8>, Vec<u8>>,
/// Atomic counter of total bytes (keys + values + overhead)
size: AtomicUsize,
/// Number of entries (for metrics)
entry_count: AtomicU64,
}
impl Memtable {
pub fn insert(&self, key: Vec<u8>, value: Vec<u8>) {
let entry_size = Self::entry_size(&key, &value);
self.skiplist.insert(key, value);
// Update counters atomically
self.size.fetch_add(entry_size, Ordering::Relaxed);
self.entry_count.fetch_add(1, Ordering::Relaxed);
}
pub fn total_size(&self) -> usize {
self.size.load(Ordering::Relaxed)
}
}
Why Ordering::Relaxed?
- Size counter doesn't need strict ordering (approximate is fine)
- Rotation check happens with mutex (sequential consistency guaranteed there)
- Relaxed atomics are faster (no memory fence overhead)
Concurrent Access Patterns¶
Single Writer, Multiple Readers (SWMR)¶
Write path:
// Thread 1: put("a", "1")
self.active_memtable.insert("a", "1"); // Lock-free
// Thread 2: put("b", "2") (concurrent)
self.active_memtable.insert("b", "2"); // Lock-free, no contention
Read path:
// Thread 3: get("a")
self.active_memtable.get("a"); // Lock-free, concurrent with writes
// Thread 4: get("c")
self.active_memtable.get("c"); // Lock-free, concurrent with everything
Rotation (rare):
// Thread 1: Rotation triggered
let _guard = self.rotation_lock.lock(); // Mutex (blocks other rotations)
self.active_memtable = new_memtable; // Atomic pointer swap
drop(_guard); // Release lock
Linearizability¶
Guarantee: All operations appear to execute atomically in some sequential order.
Example:
Thread 1: put("x", "1") [starts at t=0, completes at t=5]
Thread 2: get("x") [starts at t=3, completes at t=7]
Linearization point for put: t=2 (insert into skiplist)
Linearization point for get: t=4 (skiplist search)
Because t=4 > t=2, get MUST see "1" (not None)
Crossbeam-skiplist guarantees: - Inserts are linearizable at the moment the node is visible - Reads are linearizable at the moment the search completes - No torn reads (always see consistent key-value pairs)
Performance Characteristics¶
Write Performance¶
Benchmark (64 MB memtable, 1 KB entries):
Operation Latency (p50) Latency (p95) Throughput
─────────────────────────────────────────────────────────────
put (no flush) 50ns 120ns 20M ops/sec
put (with flush) 1.2ms 2.1ms 800 ops/sec
delete 50ns 120ns 20M ops/sec
Interpretation: - Memtable inserts are extremely fast (50ns = 20M ops/sec) - Flush dominates latency (1-2ms due to WAL fsync) - No performance difference between put and delete (both skiplist inserts)
Memory Overhead¶
Per-entry overhead:
Key: 16 bytes (e.g., "user:12345678")
Value: 1 KB
Skiplist overhead: 32 bytes (4 pointers × 8 bytes)
Total: 16 + 1024 + 32 = 1072 bytes
Overhead: 32/1072 = 3%
Total memtable overhead: - 64 MB memtable with 1 KB entries: ~60K entries - Skiplist overhead: 60K × 32 bytes = 1.92 MB - Percentage: 1.92 / 64 = 3% overhead
Rotation Overhead¶
Rotation time breakdown:
Lock acquisition: 50ns
Pointer swap: 10ns
Size counter reset: 5ns
Spawn flush task: 1μs
Total rotation time: ~1.1μs
Rotation frequency: - 64 MB memtable at 20K writes/sec (1 KB entries) - Time to fill: 64 MB / (20K × 1 KB) = 3.2 seconds - Rotations per hour: 3600 / 3.2 = 1,125 rotations/hour
Impact: 1.1μs × 1,125 = 1.2ms overhead per hour (negligible)
Edge Cases and Error Handling¶
1. Rotation During Concurrent Writes¶
Scenario: Thread A triggers rotation while Thread B is writing.
// Thread A
if self.active_size >= THRESHOLD {
self.rotate_memtable(); // Acquires rotation_lock
}
// Thread B (concurrent)
self.active_memtable.insert(key, value); // Which memtable?
Solution: Rotation lock ensures atomic swap.
fn put(&self, key: Vec<u8>, value: Vec<u8>) -> Result<()> {
loop {
let memtable = self.active_memtable.load(Ordering::Acquire);
if memtable.insert_if_active(key.clone(), value.clone()) {
return Ok(());
}
// Memtable rotated, retry with new active memtable
}
}
2. Out of Memory During Rotation¶
Scenario: 2 immutable memtables already queued, new rotation would exceed memory limit.
fn rotate_memtable(&mut self) -> Result<()> {
if self.immutable_memtables.len() >= 2 {
// Block writes until flush completes
return Err(Error::MemoryLimitExceeded);
}
// Proceed with rotation
}
Backpressure: - Red zone (memory pressure > 90%): Reject writes - Orange zone (75-90%): Delay writes exponentially - Yellow zone (50-75%): Warn, continue
3. Flush Failure¶
Scenario: Disk I/O error during flush.
async fn flush_memtable(&mut self, memtable: Memtable) -> Result<()> {
match self.write_sstable(memtable).await {
Ok(sst_path) => {
// Remove flushed memtable from immutable list
self.immutable_memtables.remove(0);
Ok(())
}
Err(e) => {
// Keep memtable in immutable list, retry later
log::error!("Flush failed: {}", e);
Err(e)
}
}
}
Retry policy: - Exponential backoff: 1s, 2s, 4s, 8s, ... - Max retries: 10 - After 10 failures: Shut down LSM (prevent data loss)
Code Example: Complete Memtable Implementation¶
use crossbeam_skiplist::SkipMap;
use std::sync::atomic::{AtomicUsize, AtomicU64, Ordering};
use std::sync::{Arc, Mutex};
pub struct MemtableManager {
/// Currently active memtable (accepts writes)
active: Arc<Memtable>,
/// Immutable memtables (flushing or queued)
immutable: Vec<Arc<Memtable>>,
/// Rotation lock (prevents concurrent rotations)
rotation_lock: Mutex<()>,
/// Configuration
config: Config,
}
pub struct Memtable {
/// Lock-free skiplist
skiplist: SkipMap<Vec<u8>, Vec<u8>>,
/// Total bytes (keys + values + overhead)
size: AtomicUsize,
/// Entry count
count: AtomicU64,
}
impl Memtable {
pub fn new() -> Self {
Self {
skiplist: SkipMap::new(),
size: AtomicUsize::new(0),
count: AtomicU64::new(0),
}
}
pub fn insert(&self, key: Vec<u8>, value: Vec<u8>) {
let entry_size = key.len() + value.len() + 32;
self.skiplist.insert(key, value);
self.size.fetch_add(entry_size, Ordering::Relaxed);
self.count.fetch_add(1, Ordering::Relaxed);
}
pub fn get(&self, key: &[u8]) -> Option<Vec<u8>> {
self.skiplist.get(key).map(|entry| entry.value().clone())
}
pub fn total_size(&self) -> usize {
self.size.load(Ordering::Relaxed)
}
pub fn len(&self) -> u64 {
self.count.load(Ordering::Relaxed)
}
/// Iterate in sorted order (for flush)
pub fn iter(&self) -> impl Iterator<Item = (Vec<u8>, Vec<u8>)> + '_ {
self.skiplist.iter().map(|entry| {
(entry.key().clone(), entry.value().clone())
})
}
}
Summary¶
Memtable management is critical for LSM write performance:
- Lock-free skiplist - 20M ops/sec throughput, O(log n) complexity
- Size-based rotation - 64 MB threshold balances flush overhead and recovery time
- Active + Immutable - Allows async flushing without blocking writes
- Atomic counters - Accurate memory tracking with minimal overhead
- Backpressure - Prevents OOM by rejecting writes when memory exhausted
Next: Flush Process - How memtables become SSTables
Last Updated: 2025-10-31 See Also: Write Path, Flush Process