RocksDB

RocksDB is a high-performance embedded key-value database, originally forked from Google's LevelDB by Facebook (Meta) in 2012. It keeps LevelDB's LSM-tree on-disk format but adds the features LevelDB lacks for production server workloads: column families, multi-threaded compaction, transactions, bloom filters, multiple compression algorithms, TTLs, backups, and a much larger tuning surface. RocksDB powers MyRocks (MySQL), CockroachDB, TiKV, Kafka Streams' state store, Apache Flink's keyed state, Yugabyte, ScyllaDB's metadata, and Meta's social-graph storage tier.

1. Overview & Architecture

RocksDB is an embedded library, not a server. Like LevelDB, it stores ordered (key, value) byte-string pairs and uses a Log-Structured Merge tree on disk. Unlike LevelDB, the codebase is deeply optimized for modern hardware (SSDs, NVMe, multi-core CPUs) and includes a large set of production features.

LSM-tree write path (same as LevelDB, with refinements):

WAL (Write-Ahead Log). Every write is durably appended to the WAL.
MemTable. The same write lands in an in-memory sorted structure (skiplist, hash-skiplist, or vector, configurable).
Immutable MemTable. When the active MemTable fills, it freezes and a new one takes over.
Flush. A background thread writes the immutable MemTable to a Level-0 SSTable.
Compaction. Multiple background threads merge SSTables across levels (L0 → L1 → L2 ... ) using leveled, universal, or FIFO strategies.

Why RocksDB exists: LevelDB's compaction is single-threaded, has no bloom filters by default, no column families, no transactions, and a small tuning surface. RocksDB rewrote almost every layer to remove those limits while keeping the same on-disk file format compatibility (mostly).

Key concepts beyond LevelDB:

Column Families. Logically distinct keyspaces inside one database, each with its own MemTable, SSTables, and tuning. Atomic writes across CFs are supported.
Transactions. Optimistic and pessimistic, with snapshot isolation.
Compaction styles. Leveled (default; favors reads), Universal (favors writes), FIFO (time-window logs).
Bloom filters. Per-SSTable, per-CF, dramatically reduce disk reads on missing-key lookups.
Compression. Snappy, LZ4, ZSTD, ZLIB, Zstd-with-dictionary — per-level configurable.
TTL. Time-to-live per key (with TTL DB) for cache-style use cases.
Checkpoint & Backup. Hard-link-based snapshots and incremental backups while writes continue.

2. Installation

macOS (Homebrew)

brew install rocksdb

# Python binding (C++ bridge)
pip install python-rocksdb       # classic binding, builds against system librocksdb
# or
pip install rocksdict            # newer, prebuilt wheels, simpler API

Ubuntu / Debian

sudo apt-get update
sudo apt-get install -y librocksdb-dev libsnappy-dev liblz4-dev libzstd-dev libbz2-dev

pip install python-rocksdb
# or
pip install rocksdict

Build from source (latest)

git clone https://github.com/facebook/rocksdb.git
cd rocksdb
mkdir -p build && cd build
cmake -DCMAKE_BUILD_TYPE=Release \
      -DWITH_SNAPPY=ON -DWITH_LZ4=ON -DWITH_ZSTD=ON \
      -DUSE_RTTI=1 ..
make -j"$(nproc)"
sudo make install

Verify the install (Python, `rocksdict`)

from rocksdict import Rdict

db = Rdict('/tmp/rocks-smoketest')
db[b'hello'] = b'world'
print(db[b'hello'])     # b'world'
db.close()

3. Python Quick Start

Two binding choices — rocksdict (modern, dict-like, prebuilt wheels) and python-rocksdb (classic, more control, requires the C++ library on the system). Examples below use rocksdict for ergonomics.

import json
from rocksdict import Rdict, Options

# Open with explicit options
opts = Options()
opts.create_if_missing(True)
opts.set_compression_type('zstd')

db = Rdict('/tmp/users.rdb', options=opts)

# --- Insert ---
def put_user(user_id: int, record: dict) -> None:
    key = f"user:{user_id:010d}".encode()    # zero-padded for stable ordering
    db[key] = json.dumps(record).encode()

put_user(1001, {"first": "Alice",   "last": "Smith",   "city": "Seattle"})
put_user(1002, {"first": "Bob",     "last": "Johnson", "city": "Portland"})
put_user(1003, {"first": "Charlie", "last": "Davis",   "city": "San Francisco"})

# --- Retrieve ---
raw = db[b'user:0000001002']
print(json.loads(raw))   # {'first': 'Bob', ...}

# --- Existence check ---
print(b'user:0000001002' in db)   # True

# --- Delete ---
del db[b'user:0000001003']

# --- Missing keys ---
try:
    _ = db[b'user:0000009999']
except KeyError:
    print("not found")

db.close()

Classic python-rocksdb API (more verbose, mirrors C++):

import rocksdb

opts = rocksdb.Options()
opts.create_if_missing = True
opts.compression = rocksdb.CompressionType.lz4_compression
opts.write_buffer_size = 64 * 1024 * 1024
opts.max_write_buffer_number = 3

db = rocksdb.DB('/tmp/users.rdb', opts)

db.put(b'user:0000001001', b'{"first":"Alice"}')
print(db.get(b'user:0000001001'))   # b'{"first":"Alice"}'
db.delete(b'user:0000001001')

4. C++ API

The native API. Same shape as LevelDB plus column families, transactions, and many more options.

#include <rocksdb/db.h>
#include <rocksdb/options.h>
#include <rocksdb/table.h>
#include <rocksdb/filter_policy.h>
#include <cassert>
#include <iostream>
#include <memory>

int main() {
    rocksdb::Options options;
    options.create_if_missing = true;
    options.compression = rocksdb::kZSTD;

    // Bloom filter for fast missing-key lookups
    rocksdb::BlockBasedTableOptions table_opts;
    table_opts.filter_policy.reset(rocksdb::NewBloomFilterPolicy(10, false));
    table_opts.block_cache = rocksdb::NewLRUCache(512 * 1024 * 1024);  // 512 MiB
    options.table_factory.reset(rocksdb::NewBlockBasedTableFactory(table_opts));

    // Multi-threaded compaction
    options.IncreaseParallelism(8);
    options.OptimizeLevelStyleCompaction();

    rocksdb::DB* db = nullptr;
    rocksdb::Status s = rocksdb::DB::Open(options, "/tmp/users.rdb", &db);
    assert(s.ok());

    db->Put(rocksdb::WriteOptions(), "user:1001",
            R"({"first":"Alice","last":"Smith"})");

    std::string value;
    s = db->Get(rocksdb::ReadOptions(), "user:1001", &value);
    if (s.ok()) std::cout << value << std::endl;

    db->Delete(rocksdb::WriteOptions(), "user:1001");

    delete db;
    return 0;
}

Compile:

g++ -std=c++17 rocks_demo.cpp -o rocks_demo \
    -lrocksdb -lpthread -lsnappy -llz4 -lzstd -lbz2
./rocks_demo

5. Column Families

Column Families (CFs) are independent keyspaces inside one RocksDB database. Each CF has its own MemTable, SSTables, compression settings, and bloom filters — but writes across CFs commit atomically. Think of them as named namespaces or lightweight tables.

from rocksdict import Rdict, Options, ColumnFamily

# Open existing CFs (must list them all if any non-default exist)
db = Rdict('/tmp/multi.rdb')

# Create new CFs
db.create_column_family('users')
db.create_column_family('orders')
db.create_column_family('events')

users  = db.get_column_family('users')
orders = db.get_column_family('orders')

users[b'1001']  = b'{"first":"Alice"}'
orders[b'A100'] = b'{"user":1001,"total":42.50}'

# Atomic write across CFs (via WriteBatch — see §6)

print(users[b'1001'])    # b'{"first":"Alice"}'
print(orders[b'A100'])

db.close()

Why use CFs?

Different access patterns per CF — e.g. tune users for point reads (small block size, large bloom filter) and events for sequential append (universal compaction, large write buffer).
Drop a CF in O(1) when retiring a dataset, without iterating keys.
Atomic cross-CF commits via WriteBatch.
Per-CF iteration without scanning unrelated keyspaces.

6. Transactions

RocksDB supports both optimistic and pessimistic transactions with snapshot isolation. Use the TransactionDB (pessimistic) or OptimisticTransactionDB (optimistic) variants.

Pessimistic transaction (locks acquired up front)

#include <rocksdb/utilities/transaction.h>
#include <rocksdb/utilities/transaction_db.h>

rocksdb::TransactionDBOptions txn_opts;
rocksdb::TransactionDB* tdb = nullptr;

rocksdb::Status s = rocksdb::TransactionDB::Open(
    options, txn_opts, "/tmp/txn.rdb", &tdb);

// Begin a transaction
rocksdb::Transaction* txn = tdb->BeginTransaction(rocksdb::WriteOptions());

std::string current;
txn->GetForUpdate(rocksdb::ReadOptions(), "balance:alice", &current);  // locks

int64_t bal = std::stoll(current);
bal -= 100;
txn->Put("balance:alice", std::to_string(bal));
txn->Put("balance:bob",   std::to_string(/* +100 */ 0));

s = txn->Commit();   // or txn->Rollback()
delete txn;

Atomic `WriteBatch` (no locking, group commit)

from rocksdict import Rdict, WriteBatch

db = Rdict('/tmp/users.rdb')

batch = WriteBatch()
batch.put(b'user:1001', b'{"first":"Alice"}')
batch.put(b'user:1002', b'{"first":"Bob"}')
batch.delete(b'user:1003')

# Commit atomically — all writes apply together or none do
db.write(batch)
db.close()

Optimistic vs. pessimistic. Pessimistic locks early and waits, good for high contention on the same keys. Optimistic doesn't lock; conflicts are detected at commit time and the loser must retry — better when contention is rare.

7. Iteration & Prefix Seek

RocksDB iterators are similar to LevelDB's but support prefix bloom filters: if you tell RocksDB the prefix-extraction function up front, prefix scans skip entire SSTables that can't match.

from rocksdict import Rdict

db = Rdict('/tmp/users.rdb')

# Forward scan, all keys
it = db.iter()
it.seek_to_first()
while it.valid():
    print(it.key(), it.value())
    it.next()

# Range: keys in [start, stop)
it = db.iter()
it.seek(b'user:0000001001')
while it.valid() and it.key() < b'user:0000002000':
    print(it.key())
    it.next()

# Reverse scan
it = db.iter()
it.seek_to_last()
while it.valid():
    print(it.key())
    it.prev()

db.close()

Prefix seek requires configuring the prefix extractor in the options:

from rocksdict import Options, SliceTransform

opts = Options()
opts.create_if_missing(True)
# First 5 bytes are the prefix (e.g., "user:" or "ordr:")
opts.set_prefix_extractor(SliceTransform.create_fixed_prefix(5))

8. Backup & Checkpoint

Two ways to capture a consistent on-disk copy of a live database:

Checkpoint (hard-linked snapshot)

O(1) creation via hard links to existing SSTables — same disk, instant.

#include <rocksdb/utilities/checkpoint.h>

rocksdb::Checkpoint* cp = nullptr;
rocksdb::Checkpoint::Create(db, &cp);
cp->CreateCheckpoint("/backups/checkpoint-2026-04-25");
delete cp;

BackupEngine (incremental, copies files)

Designed for cross-disk and cross-host backup. Reuses unchanged SSTables across backups.

#include <rocksdb/utilities/backup_engine.h>

rocksdb::BackupEngineOptions bopts("/backups/rocksdb");
rocksdb::BackupEngine* engine = nullptr;
rocksdb::BackupEngine::Open(rocksdb::Env::Default(), bopts, &engine);

engine->CreateNewBackup(db);   // incremental — only new SSTables copied

// List backups
std::vector<rocksdb::BackupInfo> info;
engine->GetBackupInfo(&info);

// Restore
engine->RestoreDBFromLatestBackup("/data/restored", "/data/restored");

delete engine;

9. Tuning & Options

RocksDB has a famously large tuning surface — the official tuning guide is a long read. The most-touched knobs:

write_buffer_size — MemTable size before flush. Default 64 MiB; raise to 256–512 MiB for write-heavy workloads.
max_write_buffer_number — how many MemTables can exist (active + immutable + flushing). Default 2; bump to 4–6 if flushes can't keep up.
level0_file_num_compaction_trigger — how many L0 files trigger compaction. Default 4; lower means more aggressive compaction.
max_background_jobs — total compaction + flush threads. Set near CPU count for write-heavy.
compaction_style — kCompactionStyleLevel (read-friendly), kCompactionStyleUniversal (write-friendly), kCompactionStyleFIFO (TTL log).
compression_per_level — e.g. no compression on L0–L2, ZSTD on L3+ to balance hot-data CPU vs. cold-data space.
block_cache — LRU cache for uncompressed blocks. Size to 25–50% of available RAM.
Bloom filters — NewBloomFilterPolicy(10, false) = 10 bits/key, ~1% false-positive rate. Massively cuts disk I/O on missing-key lookups.
OptimizeLevelStyleCompaction(memtable_size) — one-call preset for OLTP-style workloads.
OptimizeForPointLookup(cache_size_mb) — preset for hash-table-like usage.

rocksdb::Options options;
options.create_if_missing = true;
options.IncreaseParallelism(16);                    // 16 background threads
options.OptimizeLevelStyleCompaction(512L*1024*1024);
options.compression = rocksdb::kZSTD;
options.write_buffer_size = 256 * 1024 * 1024;      // 256 MiB
options.max_write_buffer_number = 4;
options.level0_file_num_compaction_trigger = 8;
options.target_file_size_base = 256 * 1024 * 1024;

rocksdb::BlockBasedTableOptions table_opts;
table_opts.block_cache = rocksdb::NewLRUCache(8L * 1024 * 1024 * 1024);  // 8 GiB
table_opts.filter_policy.reset(rocksdb::NewBloomFilterPolicy(10, false));
table_opts.cache_index_and_filter_blocks = true;
options.table_factory.reset(rocksdb::NewBlockBasedTableFactory(table_opts));

10. Merge Operators

A merge operator lets you queue a deferred update against a key without first reading its current value. RocksDB stores the merge operands and applies them on read or during compaction. Useful for counters, append-only sets, and CRDT-like structures.

// Counter merge operator: increments stored as ASCII integers
class CounterMergeOperator : public rocksdb::AssociativeMergeOperator {
public:
    bool Merge(const rocksdb::Slice& key,
               const rocksdb::Slice* existing_value,
               const rocksdb::Slice& value,
               std::string* new_value,
               rocksdb::Logger* logger) const override {
        int64_t cur = 0;
        if (existing_value) cur = std::stoll(existing_value->ToString());
        int64_t inc = std::stoll(value.ToString());
        *new_value = std::to_string(cur + inc);
        return true;
    }
    const char* Name() const override { return "CounterMergeOperator"; }
};

options.merge_operator.reset(new CounterMergeOperator());

db->Merge(rocksdb::WriteOptions(), "page_views:home", "1");
db->Merge(rocksdb::WriteOptions(), "page_views:home", "1");
db->Merge(rocksdb::WriteOptions(), "page_views:home", "1");

std::string value;
db->Get(rocksdb::ReadOptions(), "page_views:home", &value);
// value == "3" — operands collapsed on read

Merges are O(1) at write time (no read), and operands are coalesced lazily during compaction or at read time. Without a merge operator, the same workload is a read-modify-write that costs an extra disk read per increment.

11. RocksDB vs. LevelDB

Capability	LevelDB	RocksDB
Column families	No	Yes
Compaction threads	1	N (configurable)
Compaction styles	Leveled only	Leveled, Universal, FIFO
Bloom filters	Manual / off by default	First-class, per-CF
Compression	Snappy	Snappy, LZ4, ZSTD, ZLIB, BZIP2
Transactions	No	Optimistic + pessimistic
Merge operators	No	Yes
TTL	No	Yes (TtlDB)
Backup / checkpoint	No	Yes, incremental
Statistics & metrics	Minimal	Rich; integrates with Prometheus / OTel
Secondary indexes (built-in)	No	No (compose via prefixes)
Code size	~25k LOC	~400k LOC
Tuning surface	Small	Vast
Target use	Embedded simplicity	Production server-side

12. When to Use RocksDB

Strong fits:

Storage engine for a distributed database. CockroachDB, TiKV, YugabyteDB, ScyllaDB — each replicates RocksDB shards across nodes via Raft/Paxos.
Stream processing state. Kafka Streams and Apache Flink keep keyed state in RocksDB for spillable, restartable processing.
Embedded analytics / search index. Inverted indexes, time-series buckets, vector-id maps.
Write-heavy workloads with ordered range scans. Logs, telemetry, audit trails — the LSM design is built for exactly this.
Replacement for MyISAM/InnoDB. MyRocks (RocksDB storage engine for MySQL) gives 50–90% space savings at comparable performance.

Poor fits:

Multi-process access. Like LevelDB, RocksDB locks the directory to one process.
Schema-rich querying. No SQL, no joins, no native secondary indexes — you compose them via key prefixes.
Tiny databases / one-off scripts. The complexity isn't worth it; reach for SQLite or LevelDB.
Workloads that fit comfortably in RAM with simple semantics. Redis is often a cleaner choice.

Common Interview Questions:

What does RocksDB add over LevelDB, and why does it matter for production?

Multi-threaded compaction, column families, bloom filters per SSTable, optimistic and pessimistic transactions, merge operators, multiple compression algorithms (LZ4/ZSTD), per-key TTL, incremental backups via BackupEngine, hard-linked checkpoints, and a much richer statistics/metrics surface. Each one removes a real production limitation of LevelDB: single-threaded compaction caps throughput on multi-core hosts, missing bloom filters punish point reads on missing keys, no transactions makes correct multi-key updates impossible, and no checkpoints makes online backup hard.

When would you pick Universal compaction over Leveled?

Universal compaction merges all SSTables of similar size into one, producing fewer, larger files. It writes less in total (lower write amplification) than Leveled, which is great for write-heavy workloads or workloads that don't need fast reads. The trade-off is higher space amplification (you may have 2x your dataset on disk during compaction) and slower point reads, since a key may live in a wider range of files. Pick Universal for ingest-heavy time-series or log workloads; pick Leveled (the default) for OLTP-style mixed read/write.

Why are bloom filters such a big deal in RocksDB?

For a key that doesn't exist, a naive LSM read may have to consult every SSTable on every level to confirm absence — potentially many disk reads. A bloom filter is a compact in-memory probabilistic structure that, given a key, returns "definitely not in this SSTable" or "maybe in this SSTable". With ~10 bits/key the false-positive rate is ~1%, so 99% of missing-key reads avoid disk entirely. RocksDB stores a bloom filter per SSTable (or per partitioned block); the filter blocks themselves are usually pinned in the block cache.

How do column families differ from separate databases?

Column families share the same WAL and the same instance, so writes across CFs commit atomically via a single WriteBatch — you can't get that across separate databases without a 2PC layer. CFs also share threads and the block cache, so resource accounting is unified. You'd open separate databases instead when you want hard isolation or different on-disk locations / lifecycles.

Walk through an optimistic transaction commit in RocksDB.

The application opens an OptimisticTransactionDB, begins a transaction (which captures a snapshot sequence number), reads and writes through the transaction object without taking locks, and calls Commit(). At commit, RocksDB scans the read-set and verifies that no concurrent committed write modified any of those keys after the snapshot. If a conflict is found, Commit() returns Status::Busy, the transaction is aborted, and the application retries (typically with backoff). This is best for low-contention workloads — under heavy contention, pessimistic locking via TransactionDB usually wins because retries dominate.

What is a merge operator and why does it matter for performance?

A merge operator lets you record a deferred update (e.g. "+1") against a key without first reading the current value. RocksDB persists the merge operand and applies it lazily — either at read time, when all operands are folded into the base value, or during compaction, when adjacent operands are coalesced. This turns a read-modify-write counter into an O(1) write, which is decisive for high-throughput counters, sets, and CRDT-like data structures. The trade-off is that merge logic must be associative and embedded in the operator implementation.

↑ Back to Top