RocksDB

RocksDB is a high-performance embedded key-value database, originally forked from Google's LevelDB by Facebook (Meta) in 2012. It keeps LevelDB's LSM-tree on-disk format but adds the features LevelDB lacks for production server workloads: column families, multi-threaded compaction, transactions, bloom filters, multiple compression algorithms, TTLs, backups, and a much larger tuning surface. RocksDB powers MyRocks (MySQL), CockroachDB, TiKV, Kafka Streams' state store, Apache Flink's keyed state, Yugabyte, ScyllaDB's metadata, and Meta's social-graph storage tier.


1. Overview & Architecture

RocksDB is an embedded library, not a server. Like LevelDB, it stores ordered (key, value) byte-string pairs and uses a Log-Structured Merge tree on disk. Unlike LevelDB, the codebase is deeply optimized for modern hardware (SSDs, NVMe, multi-core CPUs) and includes a large set of production features.

LSM-tree write path (same as LevelDB, with refinements):

Why RocksDB exists: LevelDB's compaction is single-threaded, has no bloom filters by default, no column families, no transactions, and a small tuning surface. RocksDB rewrote almost every layer to remove those limits while keeping the same on-disk file format compatibility (mostly).

Key concepts beyond LevelDB:


2. Installation

macOS (Homebrew)

brew install rocksdb

# Python binding (C++ bridge)
pip install python-rocksdb       # classic binding, builds against system librocksdb
# or
pip install rocksdict            # newer, prebuilt wheels, simpler API

Ubuntu / Debian

sudo apt-get update
sudo apt-get install -y librocksdb-dev libsnappy-dev liblz4-dev libzstd-dev libbz2-dev

pip install python-rocksdb
# or
pip install rocksdict

Build from source (latest)

git clone https://github.com/facebook/rocksdb.git
cd rocksdb
mkdir -p build && cd build
cmake -DCMAKE_BUILD_TYPE=Release \
      -DWITH_SNAPPY=ON -DWITH_LZ4=ON -DWITH_ZSTD=ON \
      -DUSE_RTTI=1 ..
make -j"$(nproc)"
sudo make install

Verify the install (Python, rocksdict)

from rocksdict import Rdict

db = Rdict('/tmp/rocks-smoketest')
db[b'hello'] = b'world'
print(db[b'hello'])     # b'world'
db.close()

3. Python Quick Start

Two binding choices — rocksdict (modern, dict-like, prebuilt wheels) and python-rocksdb (classic, more control, requires the C++ library on the system). Examples below use rocksdict for ergonomics.

import json
from rocksdict import Rdict, Options

# Open with explicit options
opts = Options()
opts.create_if_missing(True)
opts.set_compression_type('zstd')

db = Rdict('/tmp/users.rdb', options=opts)

# --- Insert ---
def put_user(user_id: int, record: dict) -> None:
    key = f"user:{user_id:010d}".encode()    # zero-padded for stable ordering
    db[key] = json.dumps(record).encode()

put_user(1001, {"first": "Alice",   "last": "Smith",   "city": "Seattle"})
put_user(1002, {"first": "Bob",     "last": "Johnson", "city": "Portland"})
put_user(1003, {"first": "Charlie", "last": "Davis",   "city": "San Francisco"})

# --- Retrieve ---
raw = db[b'user:0000001002']
print(json.loads(raw))   # {'first': 'Bob', ...}

# --- Existence check ---
print(b'user:0000001002' in db)   # True

# --- Delete ---
del db[b'user:0000001003']

# --- Missing keys ---
try:
    _ = db[b'user:0000009999']
except KeyError:
    print("not found")

db.close()

Classic python-rocksdb API (more verbose, mirrors C++):

import rocksdb

opts = rocksdb.Options()
opts.create_if_missing = True
opts.compression = rocksdb.CompressionType.lz4_compression
opts.write_buffer_size = 64 * 1024 * 1024
opts.max_write_buffer_number = 3

db = rocksdb.DB('/tmp/users.rdb', opts)

db.put(b'user:0000001001', b'{"first":"Alice"}')
print(db.get(b'user:0000001001'))   # b'{"first":"Alice"}'
db.delete(b'user:0000001001')

4. C++ API

The native API. Same shape as LevelDB plus column families, transactions, and many more options.

#include <rocksdb/db.h>
#include <rocksdb/options.h>
#include <rocksdb/table.h>
#include <rocksdb/filter_policy.h>
#include <cassert>
#include <iostream>
#include <memory>

int main() {
    rocksdb::Options options;
    options.create_if_missing = true;
    options.compression = rocksdb::kZSTD;

    // Bloom filter for fast missing-key lookups
    rocksdb::BlockBasedTableOptions table_opts;
    table_opts.filter_policy.reset(rocksdb::NewBloomFilterPolicy(10, false));
    table_opts.block_cache = rocksdb::NewLRUCache(512 * 1024 * 1024);  // 512 MiB
    options.table_factory.reset(rocksdb::NewBlockBasedTableFactory(table_opts));

    // Multi-threaded compaction
    options.IncreaseParallelism(8);
    options.OptimizeLevelStyleCompaction();

    rocksdb::DB* db = nullptr;
    rocksdb::Status s = rocksdb::DB::Open(options, "/tmp/users.rdb", &db);
    assert(s.ok());

    db->Put(rocksdb::WriteOptions(), "user:1001",
            R"({"first":"Alice","last":"Smith"})");

    std::string value;
    s = db->Get(rocksdb::ReadOptions(), "user:1001", &value);
    if (s.ok()) std::cout << value << std::endl;

    db->Delete(rocksdb::WriteOptions(), "user:1001");

    delete db;
    return 0;
}

Compile:

g++ -std=c++17 rocks_demo.cpp -o rocks_demo \
    -lrocksdb -lpthread -lsnappy -llz4 -lzstd -lbz2
./rocks_demo

5. Column Families

Column Families (CFs) are independent keyspaces inside one RocksDB database. Each CF has its own MemTable, SSTables, compression settings, and bloom filters — but writes across CFs commit atomically. Think of them as named namespaces or lightweight tables.

from rocksdict import Rdict, Options, ColumnFamily

# Open existing CFs (must list them all if any non-default exist)
db = Rdict('/tmp/multi.rdb')

# Create new CFs
db.create_column_family('users')
db.create_column_family('orders')
db.create_column_family('events')

users  = db.get_column_family('users')
orders = db.get_column_family('orders')

users[b'1001']  = b'{"first":"Alice"}'
orders[b'A100'] = b'{"user":1001,"total":42.50}'

# Atomic write across CFs (via WriteBatch — see §6)

print(users[b'1001'])    # b'{"first":"Alice"}'
print(orders[b'A100'])

db.close()

Why use CFs?


6. Transactions

RocksDB supports both optimistic and pessimistic transactions with snapshot isolation. Use the TransactionDB (pessimistic) or OptimisticTransactionDB (optimistic) variants.

Pessimistic transaction (locks acquired up front)

#include <rocksdb/utilities/transaction.h>
#include <rocksdb/utilities/transaction_db.h>

rocksdb::TransactionDBOptions txn_opts;
rocksdb::TransactionDB* tdb = nullptr;

rocksdb::Status s = rocksdb::TransactionDB::Open(
    options, txn_opts, "/tmp/txn.rdb", &tdb);

// Begin a transaction
rocksdb::Transaction* txn = tdb->BeginTransaction(rocksdb::WriteOptions());

std::string current;
txn->GetForUpdate(rocksdb::ReadOptions(), "balance:alice", &current);  // locks

int64_t bal = std::stoll(current);
bal -= 100;
txn->Put("balance:alice", std::to_string(bal));
txn->Put("balance:bob",   std::to_string(/* +100 */ 0));

s = txn->Commit();   // or txn->Rollback()
delete txn;

Atomic WriteBatch (no locking, group commit)

from rocksdict import Rdict, WriteBatch

db = Rdict('/tmp/users.rdb')

batch = WriteBatch()
batch.put(b'user:1001', b'{"first":"Alice"}')
batch.put(b'user:1002', b'{"first":"Bob"}')
batch.delete(b'user:1003')

# Commit atomically — all writes apply together or none do
db.write(batch)
db.close()

Optimistic vs. pessimistic. Pessimistic locks early and waits, good for high contention on the same keys. Optimistic doesn't lock; conflicts are detected at commit time and the loser must retry — better when contention is rare.


7. Iteration & Prefix Seek

RocksDB iterators are similar to LevelDB's but support prefix bloom filters: if you tell RocksDB the prefix-extraction function up front, prefix scans skip entire SSTables that can't match.

from rocksdict import Rdict

db = Rdict('/tmp/users.rdb')

# Forward scan, all keys
it = db.iter()
it.seek_to_first()
while it.valid():
    print(it.key(), it.value())
    it.next()

# Range: keys in [start, stop)
it = db.iter()
it.seek(b'user:0000001001')
while it.valid() and it.key() < b'user:0000002000':
    print(it.key())
    it.next()

# Reverse scan
it = db.iter()
it.seek_to_last()
while it.valid():
    print(it.key())
    it.prev()

db.close()

Prefix seek requires configuring the prefix extractor in the options:

from rocksdict import Options, SliceTransform

opts = Options()
opts.create_if_missing(True)
# First 5 bytes are the prefix (e.g., "user:" or "ordr:")
opts.set_prefix_extractor(SliceTransform.create_fixed_prefix(5))

8. Backup & Checkpoint

Two ways to capture a consistent on-disk copy of a live database:

Checkpoint (hard-linked snapshot)

O(1) creation via hard links to existing SSTables — same disk, instant.

#include <rocksdb/utilities/checkpoint.h>

rocksdb::Checkpoint* cp = nullptr;
rocksdb::Checkpoint::Create(db, &cp);
cp->CreateCheckpoint("/backups/checkpoint-2026-04-25");
delete cp;

BackupEngine (incremental, copies files)

Designed for cross-disk and cross-host backup. Reuses unchanged SSTables across backups.

#include <rocksdb/utilities/backup_engine.h>

rocksdb::BackupEngineOptions bopts("/backups/rocksdb");
rocksdb::BackupEngine* engine = nullptr;
rocksdb::BackupEngine::Open(rocksdb::Env::Default(), bopts, &engine);

engine->CreateNewBackup(db);   // incremental — only new SSTables copied

// List backups
std::vector<rocksdb::BackupInfo> info;
engine->GetBackupInfo(&info);

// Restore
engine->RestoreDBFromLatestBackup("/data/restored", "/data/restored");

delete engine;

9. Tuning & Options

RocksDB has a famously large tuning surface — the official tuning guide is a long read. The most-touched knobs:

rocksdb::Options options;
options.create_if_missing = true;
options.IncreaseParallelism(16);                    // 16 background threads
options.OptimizeLevelStyleCompaction(512L*1024*1024);
options.compression = rocksdb::kZSTD;
options.write_buffer_size = 256 * 1024 * 1024;      // 256 MiB
options.max_write_buffer_number = 4;
options.level0_file_num_compaction_trigger = 8;
options.target_file_size_base = 256 * 1024 * 1024;

rocksdb::BlockBasedTableOptions table_opts;
table_opts.block_cache = rocksdb::NewLRUCache(8L * 1024 * 1024 * 1024);  // 8 GiB
table_opts.filter_policy.reset(rocksdb::NewBloomFilterPolicy(10, false));
table_opts.cache_index_and_filter_blocks = true;
options.table_factory.reset(rocksdb::NewBlockBasedTableFactory(table_opts));

10. Merge Operators

A merge operator lets you queue a deferred update against a key without first reading its current value. RocksDB stores the merge operands and applies them on read or during compaction. Useful for counters, append-only sets, and CRDT-like structures.

// Counter merge operator: increments stored as ASCII integers
class CounterMergeOperator : public rocksdb::AssociativeMergeOperator {
public:
    bool Merge(const rocksdb::Slice& key,
               const rocksdb::Slice* existing_value,
               const rocksdb::Slice& value,
               std::string* new_value,
               rocksdb::Logger* logger) const override {
        int64_t cur = 0;
        if (existing_value) cur = std::stoll(existing_value->ToString());
        int64_t inc = std::stoll(value.ToString());
        *new_value = std::to_string(cur + inc);
        return true;
    }
    const char* Name() const override { return "CounterMergeOperator"; }
};

options.merge_operator.reset(new CounterMergeOperator());

db->Merge(rocksdb::WriteOptions(), "page_views:home", "1");
db->Merge(rocksdb::WriteOptions(), "page_views:home", "1");
db->Merge(rocksdb::WriteOptions(), "page_views:home", "1");

std::string value;
db->Get(rocksdb::ReadOptions(), "page_views:home", &value);
// value == "3" — operands collapsed on read

Merges are O(1) at write time (no read), and operands are coalesced lazily during compaction or at read time. Without a merge operator, the same workload is a read-modify-write that costs an extra disk read per increment.


11. RocksDB vs. LevelDB

Capability LevelDB RocksDB
Column familiesNoYes
Compaction threads1N (configurable)
Compaction stylesLeveled onlyLeveled, Universal, FIFO
Bloom filtersManual / off by defaultFirst-class, per-CF
CompressionSnappySnappy, LZ4, ZSTD, ZLIB, BZIP2
TransactionsNoOptimistic + pessimistic
Merge operatorsNoYes
TTLNoYes (TtlDB)
Backup / checkpointNoYes, incremental
Statistics & metricsMinimalRich; integrates with Prometheus / OTel
Secondary indexes (built-in)NoNo (compose via prefixes)
Code size~25k LOC~400k LOC
Tuning surfaceSmallVast
Target useEmbedded simplicityProduction server-side

12. When to Use RocksDB

Strong fits:

Poor fits:


Common Interview Questions:

What does RocksDB add over LevelDB, and why does it matter for production?

Multi-threaded compaction, column families, bloom filters per SSTable, optimistic and pessimistic transactions, merge operators, multiple compression algorithms (LZ4/ZSTD), per-key TTL, incremental backups via BackupEngine, hard-linked checkpoints, and a much richer statistics/metrics surface. Each one removes a real production limitation of LevelDB: single-threaded compaction caps throughput on multi-core hosts, missing bloom filters punish point reads on missing keys, no transactions makes correct multi-key updates impossible, and no checkpoints makes online backup hard.

When would you pick Universal compaction over Leveled?

Universal compaction merges all SSTables of similar size into one, producing fewer, larger files. It writes less in total (lower write amplification) than Leveled, which is great for write-heavy workloads or workloads that don't need fast reads. The trade-off is higher space amplification (you may have 2x your dataset on disk during compaction) and slower point reads, since a key may live in a wider range of files. Pick Universal for ingest-heavy time-series or log workloads; pick Leveled (the default) for OLTP-style mixed read/write.

Why are bloom filters such a big deal in RocksDB?

For a key that doesn't exist, a naive LSM read may have to consult every SSTable on every level to confirm absence — potentially many disk reads. A bloom filter is a compact in-memory probabilistic structure that, given a key, returns "definitely not in this SSTable" or "maybe in this SSTable". With ~10 bits/key the false-positive rate is ~1%, so 99% of missing-key reads avoid disk entirely. RocksDB stores a bloom filter per SSTable (or per partitioned block); the filter blocks themselves are usually pinned in the block cache.

How do column families differ from separate databases?

Column families share the same WAL and the same instance, so writes across CFs commit atomically via a single WriteBatch — you can't get that across separate databases without a 2PC layer. CFs also share threads and the block cache, so resource accounting is unified. You'd open separate databases instead when you want hard isolation or different on-disk locations / lifecycles.

Walk through an optimistic transaction commit in RocksDB.

The application opens an OptimisticTransactionDB, begins a transaction (which captures a snapshot sequence number), reads and writes through the transaction object without taking locks, and calls Commit(). At commit, RocksDB scans the read-set and verifies that no concurrent committed write modified any of those keys after the snapshot. If a conflict is found, Commit() returns Status::Busy, the transaction is aborted, and the application retries (typically with backoff). This is best for low-contention workloads — under heavy contention, pessimistic locking via TransactionDB usually wins because retries dominate.

What is a merge operator and why does it matter for performance?

A merge operator lets you record a deferred update (e.g. "+1") against a key without first reading the current value. RocksDB persists the merge operand and applies it lazily — either at read time, when all operands are folded into the base value, or during compaction, when adjacent operands are coalesced. This turns a read-modify-write counter into an O(1) write, which is decisive for high-throughput counters, sets, and CRDT-like data structures. The trade-off is that merge logic must be associative and embedded in the operator implementation.


↑ Back to Top