Record Format¶

The on-disk structure of WAL records: header, CRC32C, compression, flags, and byte layout.

Table of contents¶

Overview¶

Every record in nori-wal has a well-defined binary format optimized for:

Space efficiency: Variable-length encoding (varints) for lengths
Data integrity: CRC32C checksum to detect corruption
Flexibility: Flags for tombstones, TTL, and compression
Performance: Fixed-size CRC at the end for fast validation

Byte Layout¶

Here's the complete on-disk format of a record:

┌─────────────────────────────────────────────────────────────┐
│                        Record Format                         │
├─────────────────────────────────────────────────────────────┤
│ klen:        varint (key length)                            │
│ vlen:        varint (value length, compressed if applicable)│
│ flags:       u8 (1 byte)                                    │
│   bit 0:     tombstone (1 = delete, 0 = put)               │
│   bit 1:     ttl_present (1 = has TTL, 0 = no TTL)         │
│   bit 2-3:   compression (00=None, 01=LZ4, 10=Zstd)        │
│   bit 4-7:   reserved (must be 0)                          │
│ ttl_ms:      varint (milliseconds, only if ttl_present=1)  │
│ key:         bytes[klen]                                    │
│ value:       bytes[vlen] (compressed if compression != None)│
│ crc32c:      u32 (little-endian)                           │
└─────────────────────────────────────────────────────────────┘

Size Calculation¶

Minimum record size:

klen=1 (for key length 0) +
vlen=1 (for value length 0) +
flags=1 +
key=0 +
value=0 +
crc32c=4
= 7 bytes

Typical record size (key="user:123", value="alice@example.com"):

klen=1 (8 < 128) +
vlen=1 (17 < 128) +
flags=1 +
key=8 +
value=17 +
crc32c=4
= 32 bytes

Varint Encoding¶

nori-wal uses LEB128 (Little Endian Base 128) for variable-length integers. This encodes small numbers in fewer bytes:

Value	Bytes	Encoding
0	1	`0x00`
127	1	`0x7F`
128	2	`0x80 0x01`
255	2	`0xFF 0x01`
16,383	2	`0xFF 0x7F`
16,384	3	`0x80 0x80 0x01`

Varint Algorithm¶

Encoding:

fn encode_varint(buf: &mut BytesMut, mut value: u64) {
    loop {
        let mut byte = (value & 0x7F) as u8;  // Take lower 7 bits
        value >>= 7;                           // Shift right 7 bits
        if value != 0 {
            byte |= 0x80;                      // Set continuation bit
        }
        buf.put_u8(byte);
        if value == 0 {
            break;
        }
    }
}

Decoding:

fn decode_varint(data: &mut &[u8]) -> Result<u64, RecordError> {
    let mut result = 0u64;
    let mut shift = 0;

    loop {
        if data.is_empty() {
            return Err(RecordError::Incomplete);
        }

        let byte = data[0];
        data.advance(1);

        result |= ((byte & 0x7F) as u64) << shift;

        if byte & 0x80 == 0 {  // No continuation bit
            break;
        }

        shift += 7;
    }

    Ok(result)
}

Why varints? - Most keys and values are small (< 128 bytes) → Only 1 byte overhead - Large keys/values (> 16KB) → Only 3-4 bytes overhead - More efficient than fixed u32 (4 bytes for all sizes)

Flags Byte¶

The flags byte packs multiple boolean and enum fields into a single byte:

┌───────────────────────────────────────┐
│         Flags Byte (1 byte)           │
├───┬───┬───┬───┬───┬───┬───┬───────┐
│ 7 │ 6 │ 5 │ 4 │ 3 │ 2 │ 1 │   0   │
├───┴───┴───┴───┼───┴───┼───┼───────┤
│   Reserved    │ Comp  │TTL│  Tomb │
│   (must be 0) │ ression│   │ stone │
└───────────────┴───────┴───┴───────┘

Bit Meanings¶

Bit(s)	Name	Meaning
0	`TOMBSTONE`	1 = DELETE record, 0 = PUT record
1	`TTL_PRESENT`	1 = TTL field follows, 0 = no TTL
2-3	`COMPRESSION`	`00` = None, `01` = LZ4, `10` = Zstd, `11` = reserved
4-7	Reserved	Must be 0 (reserved for future use)

Examples¶

// PUT record, no TTL, no compression
flags = 0b0000_0000  // 0x00

// DELETE record (tombstone)
flags = 0b0000_0001  // 0x01

// PUT record with TTL
flags = 0b0000_0010  // 0x02

// PUT record with LZ4 compression
flags = 0b0000_0100  // 0x04

// PUT record with TTL and Zstd compression
flags = 0b0000_1010  // 0x0A

CRC32C Checksum¶

nori-wal uses CRC32C (Castagnoli) for data integrity:

Why CRC32C?¶

Fast: Hardware-accelerated on modern CPUs (SSE4.2 instruction)
Strong: Detects all 1-bit and 2-bit errors
Standard: Used by ext4, Btrfs, iSCSI, etc.

CRC Calculation¶

The CRC covers everything except the CRC itself:

Data for CRC = klen + vlen + flags + [ttl_ms] + key + value

CRC32C = crc32c(data)

Example:

Record bytes: [0x08, 0x11, 0x00, ...key..., ...value..., 0xAB, 0xCD, 0xEF, 0x12]
                                                           \_________________/
                                                                  CRC32C
CRC is calculated over:  [0x08, 0x11, 0x00, ...key..., ...value...]

CRC Validation on Read¶

fn verify_crc(data: &[u8], bytes_consumed: usize, cursor: &mut &[u8])
    -> Result<(), RecordError>
{
    if cursor.len() < 4 {
        return Err(RecordError::Incomplete);
    }

    let stored_crc = cursor.get_u32_le();
    let data_for_crc = &data[..bytes_consumed - 4];
    let calculated_crc = crc32c::crc32c(data_for_crc);

    if stored_crc != calculated_crc {
        return Err(RecordError::CrcMismatch {
            expected: stored_crc,
            actual: calculated_crc,
        });
    }

    Ok(())
}

Compression¶

nori-wal supports three compression modes:

Mode	Flag Bits	Use Case
None	`00`	Small values, already compressed data
LZ4	`01`	Fast compression, good for text/JSON
Zstd	`10`	Higher compression ratio, slightly slower

LZ4 Compression¶

LZ4 is optimized for speed with decent compression ratios.

Compressed value format:

┌──────────────────────────────────┐
│ original_size: varint            │
│ compressed_data: bytes           │
└──────────────────────────────────┘

Why prepend original size? - LZ4 decompression requires knowing the output buffer size - Varint is more efficient than fixed u32

Example:

let record = Record::put(b"key", b"value".repeat(100))
    .with_compression(Compression::Lz4);

// Original value: 500 bytes
// Compressed: ~20 bytes + 2 bytes for original size = 22 bytes
// Savings: 96%

Zstd Compression¶

Zstd provides better compression ratios at the cost of slightly more CPU.

Compressed value format:

┌──────────────────────────────────┐
│ compressed_data: bytes           │
└──────────────────────────────────┘

Zstd doesn't need the original size prepended because it's embedded in the compressed stream.

Example:

let record = Record::put(b"key", b"the quick brown fox".repeat(50))
    .with_compression(Compression::Zstd);

// Original value: 950 bytes
// Compressed: ~30 bytes
// Savings: 97%

Compression Trade-offs¶

Aspect	None	LZ4	Zstd
Speed (encode)	Instant	~500 MB/s	~400 MB/s
Speed (decode)	Instant	~2000 MB/s	~600 MB/s
Ratio (text)	1.0x	3-4x	4-5x
Ratio (JSON)	1.0x	4-6x	6-8x
Ratio (random)	1.0x	1.0x	1.0x
CPU overhead	None	Low	Medium

Recommendations: - None: Small values (<100 bytes), binary data, pre-compressed - LZ4: Text, JSON, logs (default for most use cases) - Zstd: Archival, cold storage, maximum compression

Examples: Complete Records¶

Example 1: Simple PUT¶

let record = Record::put(b"user:1", b"alice");

Encoded bytes:

klen:    0x06                    (varint: 6)
vlen:    0x05                    (varint: 5)
flags:   0x00                    (no tombstone, no TTL, no compression)
key:     75 73 65 72 3A 31       ("user:1")
value:   61 6C 69 63 65          ("alice")
crc32c:  AB CD EF 12             (little-endian u32)

Total size: 1 + 1 + 1 + 6 + 5 + 4 = 18 bytes

Example 2: DELETE (Tombstone)¶

let record = Record::delete(b"user:1");

Encoded bytes:

klen:    0x06                    (varint: 6)
vlen:    0x00                    (varint: 0, no value)
flags:   0x01                    (tombstone bit set)
key:     75 73 65 72 3A 31       ("user:1")
value:   (empty)
crc32c:  12 34 56 78

Total size: 1 + 1 + 1 + 6 + 0 + 4 = 13 bytes

Example 3: PUT with TTL¶

let record = Record::put_with_ttl(
    b"session:abc",
    b"data",
    Duration::from_millis(3600000)  // 1 hour
);

Encoded bytes:

klen:    0x0B                    (varint: 11)
vlen:    0x04                    (varint: 4)
flags:   0x02                    (TTL bit set)
ttl_ms:  80 8D 36                (varint: 3600000)
key:     73 65 73 73 69 6F 6E... ("session:abc")
value:   64 61 74 61             ("data")
crc32c:  AB CD EF 12

Total size: 1 + 1 + 1 + 3 + 11 + 4 + 4 = 25 bytes

Example 4: PUT with LZ4 Compression¶

let record = Record::put(b"log", b"ERROR: ".repeat(20))
    .with_compression(Compression::Lz4);

Encoded bytes:

klen:           0x03             (varint: 3)
vlen:           0x10             (varint: 16, compressed size)
flags:          0x04             (compression = 01 << 2)
key:            6C 6F 67          ("log")
value (comp):   78 00 00 00 ...   (original_size=120, then compressed)
crc32c:         AB CD EF 12

Sizes: - Original value: 120 bytes - Compressed value: 16 bytes (13% of original) - Total record: 1 + 1 + 1 + 3 + 16 + 4 = 26 bytes

Decoding Algorithm¶

Here's how nori-wal decodes a record from bytes:

flowchart TD
    A[Start with byte slice] --> B[Decode klen varint]
    B --> C[Decode vlen varint]
    C --> D[Read flags byte]
    D --> E{TTL present?}
    E -->|Yes| F[Decode ttl_ms varint]
    E -->|No| G[Extract key bytes]
    F --> G
    G --> H[Extract value bytes]
    H --> I[Read CRC32C u32]
    I --> J{CRC valid?}
    J -->|No| K[Return CrcMismatch error]
    J -->|Yes| L{Compression?}
    L -->|None| M[Return record]
    L -->|LZ4| N[Decompress with LZ4]
    L -->|Zstd| O[Decompress with Zstd]
    N --> M
    O --> M

    style K fill:#FFB6C1
    style M fill:#90EE90

Error Handling¶

RecordError Types¶

Error	Cause	Recovery
`Incomplete`	Not enough bytes to read	Wait for more data or truncate
`CrcMismatch`	Corrupted data	Truncate at this position
`InvalidCompression`	Unknown compression type	Truncate (forward compatibility issue)
`CompressionFailed`	Compression library error	Log and skip record
`DecompressionFailed`	Corrupted compressed data	Truncate at this position

Handling Corruption¶

match Record::decode(data) {
    Ok((record, size)) => {
        // Valid record, process it
    }
    Err(RecordError::CrcMismatch { expected, actual }) => {
        // Corruption detected, truncate here
        log::warn!("CRC mismatch at position {}: expected {:#x}, got {:#x}",
                   position, expected, actual);
        truncate_at(position)?;
    }
    Err(RecordError::Incomplete) => {
        // Partial write at end of log, truncate
        truncate_at(position)?;
    }
    Err(e) => {
        // Other errors: log and skip or truncate
        log::error!("Failed to decode record: {}", e);
    }
}

Performance Characteristics¶

Encoding Performance¶

Operation: record.encode()
Size: 1KB key + value
Time: ~2μs (no compression)
Time: ~15μs (LZ4 compression)
Time: ~30μs (Zstd compression)

Decoding Performance¶

Operation: Record::decode(&bytes)
Size: 1KB key + value
Time: ~1μs (no compression)
Time: ~0.5μs (LZ4 decompression)
Time: ~2μs (Zstd decompression)

Why is LZ4 decompression faster than no compression? - Less data to copy from disk/memory - LZ4 decompression is extremely fast (~2 GB/s) - For large values, the I/O savings outweigh decompression cost

Design Decisions¶

Why CRC32C instead of CRC32?¶

Hardware acceleration: SSE4.2 on x86, CRC32 instruction on ARM
10-100x faster than software CRC32
Better error detection than CRC32 for certain patterns

Why CRC at the end, not the beginning?¶

Pros of CRC at end: - Can stream-encode without buffering entire record - Can calculate CRC incrementally while writing - Simpler code

Cons: - Must read entire record before validating CRC

For WAL use case (sequential reads), the trade-off favors simplicity.

Why varints instead of fixed-size lengths?¶

Space savings: - Most records have small keys/values (< 128 bytes) - Varint: 1 byte for lengths < 128 - Fixed u32: 4 bytes always - Savings: 75% for typical workloads

Performance: - Varint encoding: ~10ns - Varint decoding: ~8ns - Negligible compared to I/O cost

Why not use protobuf/flatbuffers?¶

Overkill: WAL records are simple, don't need schema evolution
Overhead: Protobuf adds ~20-30% size overhead for small messages
Performance: Custom format is 2-3x faster for this use case
Simplicity: 200 lines of code vs dependency on large library

Forward Compatibility¶

The format is designed for forward compatibility:

Reserved Flag Bits¶

Bits 4-7 are reserved. Future versions can use them for: - Encryption - Different checksum algorithms - Record versioning - Replication metadata

Unknown Compression Types¶

If a newer version writes compression type 11 (reserved), older versions will return InvalidCompression and truncate. This is safe: - Newer records are always at the end (append-only) - Truncating at first unknown record preserves all older data

Reading Older Formats¶

Current format can read all records written by older versions (there are none yet, but the design supports it).

Key Takeaways¶

Compact format: Varints save space for small keys/values
Strong integrity: CRC32C detects all corruption
Flexible: Flags support tombstones, TTL, compression
Fast: Hardware-accelerated CRC, efficient encoding/decoding
Forward-compatible: Reserved bits for future features

What's Next?¶

Now that you understand the record format, explore:

Segment Lifecycle - How segments manage records
Recovery Process - How records are validated during recovery
Concurrency Model - How concurrent writes are handled

Or dive into the actual implementation in crates/nori-wal/src/record.rs on GitHub.