When to Use SSTables¶

Practical guidance on when SSTables are the right choice for your storage needs.

Decision Matrix¶

Great Fit¶

Use Case	Why SSTables Excel
Write-heavy workloads	Sequential writes 10-100x faster than B-trees
Time-series data	Append-only, natural sort order, range scans
Event logging	Immutability = perfect audit trail
LSM storage engines	Core building block (LevelDB, RocksDB pattern)
Hot key patterns	Bloom filters + cache = <10µs reads
Batch writes	Amortize flush cost over many entries
Large datasets	Scales to TB without in-memory index

Not the Right Tool¶

Use Case	Why Not
Ultra-low latency (<1µs)	Disk I/O inherently slower
Heavy random updates	Creates many versions, space amplification
Tiny datasets (<1MB)	Fixed overhead dominates
No compaction budget	Read amplification grows unbounded
In-memory only	Better to use HashMap/BTreeMap

Workload Patterns¶

Append-Mostly (Perfect Fit)¶

// Event sourcing, logs, time-series
for event in events {
    sst_builder.add(&Entry::put(
        format!("event:{}", event.timestamp).as_bytes(),
        event.data
    )).await?;
}

Why it works: - Natural sort order (timestamps) - Rare updates/deletes - Sequential writes = maximum throughput - Range scans over time windows

Performance: 100K+ writes/sec

Read-Heavy with Hot Keys (Good Fit)¶

// User profiles, product catalog
cache_hit_rate: 80%+
    ↓
p95 latency: <1ms (cache)
p99 latency: ~5ms (disk)

Why it works: - Bloom filters skip most disk reads - LRU cache caches hot blocks - 80/20 rule: 20% of keys = 80% of reads

Performance: 100K+ reads/sec (cached), 10K/sec (disk)

Update-Heavy on Same Keys (Poor Fit)¶

// Counter updates, session state
key "counter" updated 1000 times/sec
    ↓
1000 versions on disk (until compaction)
    ↓
Space amplification: 1000x

Problems: - Many versions accumulate - Compaction can't keep up - Space amplification grows

Better alternative: In-memory store with WAL (like Redis)

Random Small Reads (Mixed)¶

// Random key lookups
no locality, cache hit rate: 10%
    ↓
p95: 5-10ms (most reads hit disk)

Challenges: - Poor cache hit rate - Bloom filter helps, but still slow - Multiple levels to check

Optimization: Increase cache size, use read-ahead

Use Case Deep Dives¶

Use Case 1: Time-Series Database¶

Perfect fit for SSTables

// Metrics ingestion
struct Metric {
    timestamp: u64,    // Natural sort key
    name: String,
    value: f64,
}

// Write path (batched)
let mut builder = SSTableBuilder::new(config).await?;
for metric in batch.iter().sorted_by_key(|m| m.timestamp) {
    let key = format!("{}:{}", metric.name, metric.timestamp);
    builder.add(&Entry::put(key, metric.value.to_bytes())).await?;
}

Why it works: - Timestamp-based keys = sorted by default - Writes are append-only (past doesn't change) - Queries are time-range scans - Compaction merges old data efficiently

Real-world example: InfluxDB, Prometheus

Use Case 2: Event Sourcing¶

Excellent fit

// Append-only event log
struct Event {
    aggregate_id: Uuid,
    sequence: u64,
    event_type: String,
    data: Vec<u8>,
}

// Key: aggregate_id:sequence (sorted)
let key = format!("{}:{:020}", event.aggregate_id, event.sequence);
sstable.add(&Entry::put(key, event.data)).await?;

Benefits: - Immutability matches event sourcing semantics - Natural chronological order - Replay = scan range - Snapshots = SSTable references

Real-world example: Apache Kafka (log segments similar to SSTables)

Use Case 3: LSM Storage Engine¶

Core building block

// User-facing key-value store
pub struct LSM {
    memtable: SkipList,
    l0: Vec<SSTable>,      // Recent, may overlap
    l1: Vec<SSTable>,      // 10x size, no overlap
    l2: Vec<SSTable>,      // 100x size
}

impl LSM {
    pub async fn put(&mut self, key: &[u8], value: &[u8]) {
        self.memtable.insert(key, value);
        if self.memtable.size() > THRESHOLD {
            self.flush().await?;  // → New SSTable
        }
    }
}

Why SSTables: - Fast writes (memtable + flush) - Compaction manages multiple levels - Bloom filters optimize reads - Battle-tested pattern (RocksDB, LevelDB)

Real-world example: NoriKV, CockroachDB, TiKV

Use Case 4: Data Warehouse (Columnar)¶

Modified SSTable format

Traditional SSTable:
  Block: [user:1, alice, 25, NY], [user:2, bob, 30, CA]

Columnar SSTable:
  name column: [alice, bob, charlie, ...]
  age column:  [25, 30, 35, ...]
  state column: [NY, CA, TX, ...]

Benefits for analytics: - Read only needed columns - Better compression (similar values) - SIMD-friendly layout

Real-world example: Parquet files (columnar SSTables)

Performance Expectations¶

Write Performance¶

Workload	Throughput	Latency (p99)
Batch writes	100K-500K/sec	1-5ms (amortized)
Single writes	10K-50K/sec	20-50ms (fsync)
Bulk import	1M+ entries/sec	Batch limited

Key insight: Batch writes for best performance

Read Performance¶

Scenario	Latency (p95)	Throughput
Cache hit	<1ms (often <100µs)	100K+ reads/sec
Bloom filter skip	~67ns	Millions/sec
L0 hit (SSD)	1-5ms	10K-50K/sec
L2 hit (SSD)	5-10ms	5K-10K/sec

Key insight: Cache hit rate dominates performance

Configuration Guidelines¶

Small Dataset (<100MB)¶

SSTableConfig {
    block_size: 4096,
    block_cache_mb: 16,           // Small cache
    compression: Compression::None, // Skip compression
    bloom_bits_per_key: 10,
    ..Default::default()
}

Rationale: Overhead not worth it for small data

Medium Dataset (100MB-10GB)¶

SSTableConfig {
    block_size: 4096,
    block_cache_mb: 256,          // Larger cache
    compression: Compression::Lz4, // Fast compression
    bloom_bits_per_key: 10,
    ..Default::default()
}

Rationale: Balance compression savings vs CPU

Large Dataset (>10GB)¶

SSTableConfig {
    block_size: 16384,            // Larger blocks
    block_cache_mb: 1024,         // GB cache
    compression: Compression::Zstd, // Higher ratio
    bloom_bits_per_key: 12,       // Lower FP rate
    ..Default::default()
}

Rationale: Maximize compression, larger cache amortizes overhead

Anti-Patterns¶

Using SSTables for Mutable Counters¶

// BAD: Frequent updates to same key
for _ in 0..1_000_000 {
    sstable.put(b"counter", current_value.to_bytes()).await?;
}
// Result: 1M versions on disk!

Better: Use in-memory counter, periodic snapshots to SSTable

No Compaction Strategy¶

// BAD: Write SSTables, never compact
for batch in batches {
    write_sstable(batch).await?;
}
// Result: 1000 files, p99 reads check all of them

Better: Schedule regular compaction

Tiny SSTable Files¶

// BAD: Flush every 100 entries
if memtable.len() > 100 {
    flush_to_sstable().await?;
}
// Result: Thousands of tiny files, overhead dominates

Better: Target 1MB-256MB per SSTable

Migration Checklist¶

Considering SSTables for your project? Check these:

Write throughput > 10K/sec required
Append-mostly or time-series data
OK with eventual consistency for deletes (tombstones)
Have background I/O budget for compaction
Dataset > 1MB (otherwise overhead not worth it)
Can tolerate p99 latency of 5-20ms
Cache hit rate expected > 50%

If all checked: SSTables likely a good fit!

Next Steps¶

Understand the design: Read Design Decisions to see why nori-sstable makes specific choices.

Learn the internals: Check How It Works for file format and implementation details.

Optimize performance: See Performance for tuning guides and benchmarks.

Get started: Jump to Getting Started to build your first SSTable.