Skip to content

Latest commit

 

History

History
520 lines (385 loc) · 13.9 KB

File metadata and controls

520 lines (385 loc) · 13.9 KB

fastskip — Lock-Free Arena-Backed Skip List

Lock-free arena-backed skip list for LSM-tree memtables. Designed for high-throughput write workloads in storage engines like RocksDB, LevelDB, and CockroachDB.

Rust License: MIT Rust 1.66+


Table of Contents


Features

Feature fastskip dashmap std::collections::BTreeMap
Lock-free writes Yes Yes No (global lock)
Lock-free reads Yes No (read lock) No
Snapshot isolation Yes No No
Range cursors Yes No Yes
Seal/freeze lifecycle Yes No No
Arena memory Yes (per-thread shards) No (heap) No
Bulk deallocation Yes No No
Auto-seal (size/count) Yes No No
Backpressure detection Yes No No
Batch operations Yes Limited No

Key Capabilities

  • O(1) insert — Bump allocation from per-thread arena shard
  • O(log n) get — Skip list traversal with small constant
  • Lock-free reads — No locks required for point lookups
  • Snapshot isolation — Point-in-time iteration under concurrent writes
  • Memory efficiency — Per-thread arenas avoid allocation contention
  • Bulk deallocation — Dropping memtable reclaims all memory at once
  • Auto-sealing — Automatic sealing based on memory usage or entry count
  • Backpressure — Detect when nearing capacity limits
  • Batch operations — Efficient bulk insert and multi-get operations

Quick Start

[dependencies]
fastskip = "0.1"

Basic Operations

use fastskip::ConcurrentSkipList;

let sl = ConcurrentSkipList::new();

sl.insert(b"user:1001", b"alice");
sl.insert(b"user:1002", b"bob");

let (val, tombstone) = sl.get(b"user:1001").unwrap();
assert_eq!(val, b"alice");
assert!(!tombstone);

With Auto-Seal Thresholds

use fastskip::ConcurrentSkipList;

// Create with auto-seal at 1MB memory or 100k entries
let sl = ConcurrentSkipList::with_capacity_and_shards(
    64 * 1024,      // 64KB initial arena per shard
    4,              // 4 shards (should match writer thread count)
    1 * 1024 * 1024, // Auto-seal at 1MB total memory
    100_000         // Auto-seal at 100k entries
);

// Check if nearing limits
if !sl.is_under_backpressure() {
    // Safe to insert batch
    let batch = vec![(b"key1", b"val1"), (b"key2", b"val2")];
    sl.insert_batch(&batch).unwrap();
}

// Get memory statistics
println!("Used: {} bytes", sl.memory_usage());
println!("Utilization: {:.1}%", sl.memory_utilization() * 100.0);
println!("Avg key size: {} bytes", sl.avg_key_size() as usize);

Architecture

Memory Model

The skip list uses per-thread arena shards for allocation:

Thread 0 → Arena 0 (bump pointer, no sync)
Thread 1 → Arena 1 (bump pointer, no sync)
Thread 2 → Arena 2 (bump pointer, no sync)
          ↓
    SkipList (atomic tower pointers)

Each writer thread has its own arena shard. Writes don't contend on allocation — they're pure bump pointer operations. The skip list itself uses atomic operations for the lock-free CAS at level 0.

Node Layout

Nodes are allocated as a single contiguous block:

[ header (32 bytes) ]
[ tower[0..height-1] ]  (8 bytes each)
[ key bytes ]
[ value bytes ]

Key and value are stored inline after the tower, enabling zero-copy reads.


Usage Guide

Basic Operations

use fastskip::ConcurrentSkipList;

let sl = ConcurrentSkipList::new();

// Insert key-value pairs
sl.insert(b"user:1001", b"alice");
sl.insert(b"user:1002", b"bob");

// Point lookup
let (val, is_tombstone) = sl.get(b"user:1001").unwrap();
assert_eq!(val, b"alice");
assert!(!is_tombstone);

// Check if key exists (non-deleted)
assert!(sl.contains_key(b"user:1001"));

// Get or insert atomically
let (val, is_new) = sl.get_or_insert(b"user:1003", b"charlie");
assert!(is_new);

Delete / Tombstones

Delete marks a key with a tombstone. Deleted keys remain in the skip list until flushed to SSTable.

use fastskip::ConcurrentSkipList;

let sl = ConcurrentSkipList::new();
sl.insert(b"key", b"value");

// Delete returns true if key was found
assert!(sl.delete(b"key"));

// get still returns the key, but with tombstone = true
let (_, tombstone) = sl.get(b"key").unwrap();
assert!(tombstone);

// get_live filters out tombstones
assert_eq!(sl.get_live(b"key"), None);

Iteration

Iterate in sorted key order:

use fastskip::ConcurrentSkipList;

let sl = ConcurrentSkipList::new();
sl.insert(b"apple", b"1");
sl.insert(b"banana", b"2");
sl.insert(b"cherry", b"3");

for entry in sl.iter() {
    println!("{:?} -> {:?}", entry.key, entry.value);
}

LSM Memtable Lifecycle

The intended lifecycle for an LSM-tree storage engine:

use fastskip::ConcurrentSkipList;

// 1. Create active memtable with auto-seal thresholds
let memtable = ConcurrentSkipList::with_capacity_and_shards(
    10 * 1024 * 1024, // 10 MB per shard
    4,                // 4 shards
    50 * 1024 * 1024, // Auto-seal at 50 MB total
    1_000_000         // Auto-seal at 1M entries
);

// 2. Concurrent writers insert/delete
memtable.insert(b"key1", b"val1");
memtable.delete(b"key2");

// 3. Check for backpressure before inserting batch
if !memtable.is_under_backpressure() {
    let batch = vec![(b"key3", b"val3"), (b"key4", b"val4")];
    memtable.insert_batch(&batch).unwrap();
}

// 4. When full (auto-sealed or manual), seal it — returns frozen (for flushing) + fresh (for writes)
let (frozen, fresh) = memtable.seal().unwrap();

// 5. Flush frozen to SSTable (iterate snapshot)
for entry in frozen.iter() {
    if entry.is_tombstone {
        // write tombstone marker to SSTable
    } else {
        // write key-value to SSTable
    }
}

// 6. Drop frozen (reclaims arena memory)
std::mem::drop(frozen);

// 7. Fresh memtable is ready for new writes
fresh.insert(b"key5", b"val5");

// Monitor memory usage
println!("Memory usage: {} bytes", fresh.memory_usage());
println!("Utilization: {:.1}%", fresh.memory_utilization() * 100.0);

Concurrent Reads

Readers are fully lock-free. For consistent point-in-time reads, use a snapshot:

use fastskip::ConcurrentSkipList;

let sl = ConcurrentSkipList::new();
sl.insert(b"a", b"1");
sl.insert(b"b", b"2");

// Snapshot captures sequence number — skips post-snapshot inserts
let snap = sl.snapshot();

// Insert more after snapshot
sl.insert(b"c", b"3");

// Snapshot sees only "a" and "b"
assert_eq!(snap.iter().count(), 2);
// Live iterator sees all three
assert_eq!(sl.iter().count(), 3);

Range Scans

Use cursor_at to seek to a key range:

use fastskip::ConcurrentSkipList;

let sl = ConcurrentSkipList::new();
sl.insert(b"apple", b"1");
sl.insert(b"banana", b"2");
sl.insert(b"cherry", b"3");
sl.insert(b"date", b"4");

// Seek to first key >= "banana"
let mut cursor = sl.cursor_at(b"banana").unwrap();
while let Some(entry) = cursor.next_entry() {
    println!("{:?}", entry.key);
}
// Output: banana, cherry, date

Thread Safety

ConcurrentSkipList is Send + Sync — can be shared across threads:

use fastskip::ConcurrentSkipList;
use std::sync::Arc;
use std::thread;

let sl = Arc::new(ConcurrentSkipList::new());

let sl1 = sl.clone();
let t1 = thread::spawn(move || {
    for i in 0..1000 {
        sl1.insert(format!("key:{}", i).as_bytes(), b"value");
    }
});

let sl2 = sl.clone();
let t2 = thread::spawn(move || {
    for i in 1000..2000 {
        sl2.insert(format!("key:{}", i).as_bytes(), b"value");
    }
});

t1.join().unwrap();
t2.join().unwrap();

assert_eq!(sl.len(), 2000);

Memory Stats

use fastskip::ConcurrentSkipList;

let sl = ConcurrentSkipList::new();
sl.insert(b"key", b"value");

println!("allocated: {} bytes", sl.memory_usage());
println!("reserved: {} bytes", sl.memory_reserved());
println!("utilization: {:.1}%", sl.memory_utilization() * 100.0);
println!("idle: {} bytes", sl.memory_idle());

API Reference

ConcurrentSkipList

The main lock-free skip list. Send + Sync.

Constructors

Method Description
new() Create with default shard count (CPU count)
with_shards(n) Create with n arena shards
with_capacity(bytes) Create with custom arena capacity
with_capacity_and_shards(bytes, n) Custom capacity and shards

Write Operations

Method Description
insert(key, value) Insert, returns false on OOM/duplicate
try_insert(key, value) Returns Result<(), InsertError>
delete(key) Delete (tombstone), returns false if not found
get_or_insert(key, value) Get or insert atomically

Read Operations

Method Description
get(key) Returns Option<(value, is_tombstone)>
get_live(key) Returns None if deleted or missing
contains_key(key) Returns true if key exists and not deleted

Iteration

Method Description
iter() Live iterator
snapshot() Point-in-time snapshot
cursor() Cursor at first entry
cursor_at(target) Cursor at first key >= target

Stats

Method Description
len() Number of live entries
is_empty() Returns true if no live entries
is_sealed() Returns true if sealed
memory_usage() Arena bytes allocated
memory_reserved() Arena bytes reserved
memory_utilization() Utilization (0.0 to 1.0)
memory_idle() Reserved but unused bytes
avg_key_size() Approximate average key size in bytes
avg_value_size() Approximate average value size in bytes
total_inserts() Total insert attempts (including duplicates/failures)
max_memory_bytes() Maximum memory bytes configured (0 = unlimited)
max_entries() Maximum entries configured (0 = unlimited)
is_under_backpressure() Returns true if nearing capacity limits (90% threshold)
let sl = ConcurrentSkipList::new();
sl.insert(b"key", b"value");

println!("allocated: {} bytes", sl.memory_usage());
println!("reserved: {} bytes", sl.memory_reserved());
println!("utilization: {:.1}%", sl.memory_utilization() * 100.0);
println!("idle: {} bytes", sl.memory_idle());
println!("avg key size: {} bytes", sl.avg_key_size() as usize);
println!("avg value size: {} bytes", sl.avg_value_size() as usize);
println!("total insert attempts: {}", sl.total_inserts());
println!("under backpressure: {}", sl.is_under_backpressure());

Cursor

A positioned iterator for range scans.

Method Description
next_entry() Advance, returns false at end
entry() Current entry
seek(target) Seek to first key >= target

Implements Iterator<Entry>.


Snapshot

A point-in-time view for consistent iteration.

let snap = sl.snapshot();
for entry in snap.iter() {
    // Sees only entries at snapshot time
}

Error Types

pub enum InsertError {
    DuplicateKey,  // Key already exists
    OutOfMemory,   // Arena shard out of memory
}

pub enum SealError {
    AlreadySealed, // Memtable was already sealed
}

Design Notes

Insert After Delete

Inserting a key that was tombstoned in the same memtable returns false (duplicate). Seal the memtable and write to a fresh one to re-insert.

Empty Keys

User-inserted empty keys (b"") work correctly. The sentinel head node uses an empty key identified by position.

Memory Management

Arena memory is bulk-allocated in blocks and bulk-reclaimed when the memtable is dropped. No per-node allocation/deallocation overhead.


Changelog

v0.1.1 (2026-03-31)

  • Performance: Up to 27% faster concurrent inserts, 20% faster sequential inserts
  • O(1) memory_usage() via atomic running total (was O(N shards))
  • Adaptive backoff on CAS failure to reduce cache-line bouncing
  • Optimized insert_batch(): single sealed check, single arena lookup
  • Bulk u64 header writes in init_node() (3 stores vs 8)
  • Conditional allocation tracking — zero overhead when memory limits aren't configured
  • Release profile: lto = "fat", strip = true

v0.1.0 (2026-03-26)

  • Initial release
  • Lock-free skip list with per-thread arena shards
  • Point lookups, iteration, snapshots
  • Tombstone delete support
  • Seal/freeze lifecycle for LSM memtables
  • Range cursors for prefix scans
  • Memory utilization metrics:
    • memory_reserved() — total arena capacity
    • memory_utilization() — fraction used (0.0-1.0)
    • memory_idle() — unused bytes
  • Production readiness features:
    • Size-based auto-seal (max_memory_bytes threshold)
    • Count-based auto-seal (max_entries threshold)
    • Backpressure detection (is_under_backpressure())
    • Batch insert (insert_batch())
    • Multi-get (get_many())
    • Enhanced memory stats (avg_key_size, avg_value_size, total_inserts)
    • Configuration preserved on seal/fresh memtable creation

License

MIT License — see LICENSE for details.