Comparison embedded key-value storages for python

UPDATED: RocksDict was added

I need to store a lot of small text files (~3 billions and it grows by 70m every day) with size from 100B to a few kB in my tghub project. There are only 2 requirements:

quick access by id (each file has a unique key for it)
storing them as compact as possible, ideally with compression

Actually I can do a hierarchical structure and just store them in filesystem (I can use ZFS as well for compression on top of it), however I'm afraid I waste too much space because average size of file is about only 1Kb.

Solutions like Cassandra, Hbase are overkill for me. I don't need their features at all. Redis isn't suitable because it keeps all data in memory. Let's try embedded solutions:

Sqlite (it would be too slow a priory, because RDBMS nature)
Sqlitedict (it would be too slow a priory, because it's a wrapper for sqlite)
Pysos
LevelDB
Shelve
Diskcache
Lmdb
RocksDict

Feature comparison

Name	Thread safe	Process-safe	Serialization
pysos	No	No	Custom
LevelDB	Yes	No	None
Shelve	No	No	Pickle
Diskcache	Yes	Yes	Customizable
Lmdb	Yes	Yes	None
RocksDict	Yes	Yes	Customizable

Lmdb support concurrency reads, but writing operation is single threaded
RocksDict support concurrency reads via secondary index
RocksDict support rocksdb and speedb (it is considered improved version of rocksdb)

Preparation

Script, which generates 1 000 000 text files:

python

import os
import json
import random
import string
from datetime import datetime

workspace_dir = '/tmp/data'
output_dir = f'{workspace_dir}/jsons'
os.makedirs(output_dir, exist_ok=True)


def generate_random_string(length):
    return ''.join(random.choices(string.ascii_letters + string.digits + string.punctuation + ' ', k=length))


for i in range(1000000):
    data = {
        'text': generate_random_string(random.randint(0, 2000)),
        'created_at': int(datetime.now().timestamp())
    }
    filename = os.path.join(output_dir, f'{i}.json')
    with open(filename, 'w') as json_file:
        json.dump(data, json_file, indent=4)

It generates files with this schema:

json

{
  "text": "random string with length from 0 and 2000",
  "created_at": 1727290164
}

We need a generator for reading prepared files:

python

def json_file_reader(directory):
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        if os.path.isfile(file_path) and filename.endswith('.json'):
            with open(file_path, 'r') as json_file:
                yield os.path.splitext(filename)[0], json_file.read()

Also, it would be good to compare results with sorted (asc) generator:

python

def json_file_reader_sorted(directory):
    json_files = [filename for filename in os.listdir(directory) if filename.endswith('.json')]
    sorted_files = sorted(json_files, key=lambda x: int(os.path.splitext(x)[0]))
    for filename in sorted_files:
        file_path = os.path.join(directory, filename)
        if os.path.isfile(file_path):
            with open(file_path, 'r', encoding='utf-8') as json_file:
                yield os.path.splitext(filename)[0], json_file.read()

Install python libs:

shell

pip install pysos
pip install diskcache
pip install plyvel-ci # for leveldb
pip install lmdb
pip install speedict # RocksDict

Test scripts

Pysos

python

import pysos

pysos_dir = f'{workspace_dir}/pysos'
db = pysos.Dict(pysos_dir)
for id, data in json_file_reader(output_dir):
    db[id] = data

Shelve

python

import shelve

shelve_dir = f'{workspace_dir}/shelve'
with shelve.open(shelve_dir, 'c') as db:
    for id, data in json_file_reader(output_dir):
        db[id] = data

Diskcache

python

import diskcache as dc

diskcache_dir = f'{workspace_dir}/diskcache'
cache = dc.Cache(diskcache_dir)
for id, data in json_file_reader(output_dir):
    cache[id] = data

LevelDB

python

import plyvel

leveldb_dir = f'{workspace_dir}/leveldb'
with plyvel.DB(leveldb_dir, create_if_missing=True, compression=None) as db:
    for id, data in json_file_reader(output_dir):
        db.put(int(id).to_bytes(4, 'big'), data.encode())

Leveldb with compression

python

import plyvel

leveldb_snappy_dir = f'{workspace_dir}/leveldb_snappy'
with plyvel.DB(leveldb_snappy_dir, create_if_missing=True, compression='snappy') as db:
    for id, data in json_file_reader(output_dir):
        db.put(int(id).to_bytes(4, 'big'), data.encode())

Lmdb

python

import lmdb

lmdb_dir = f'{workspace_dir}/lmdb'
# let's reserve 100Gb
with lmdb.open(lmdb_dir, 10 ** 11) as env:
    with env.begin(write=True) as txn:
        for id, data in json_file_reader(output_dir):
            txn.put(int(id).to_bytes(4, 'big'), data.encode())

RocksDict

python

from speedict import Rdict

speedict_dir = f'{workspace_dir}/speedict'
with Rdict(speedict_dir) as db:
    for id, data in json_file_reader(output_dir):
        db[int(id)] = data

compressed version:

python

from rocksdict import Rdict, Options, DBCompressionType

def db_options():
    opt = Options()
    opt.set_compression_type(DBCompressionType.zstd())
    return opt

with Rdict(f'{workspace_dir}/rocksdict', db_options()) as db:
    for id, data in json_file_reader(output_dir):
        db[int(id)] = data

For using speedb we just need to change import from rocksdict to speedict

Results

I checked size of each dataset with terminal command du -sh $dataset

name	occupied space	execution time
Raw files	3.8G	4m 25s
One text file	1.0G	-
Compressed text file	820Mb	-
Pysos	1.1G	4m 37s
Shelve	-	-
Diskcache	1.0Gb	7m 29s
LevelDB	1.0Gb	5m 2s
LevelDB(snappy)	1.0Gb	5m 16s
Lmdb	1.1Gb	4m 9s
Lmdb (sorted	1.5Gb	1m 27s
RocksDict (rocksdb)	1.0Gb	4m 26s
RocksDict (rocksdb, sorted)	1.0Gb	1m 31s
RocksDict (rocksdb, sorted, compressed)	854Mb	1m 31s
RocksDict (speedb)	1.0Gb	4m 14s
RocksDict (speedb, sorted, compressed)	854Mb	1m 39s

Unfortunately shelve failed after 18s with HASH: Out of overflow pages. Increase page size.
LevelDB has the same size with/without compression but time is execution time is different.
It's expected that Lmdb is bigger than LevelDB. Lmdb uses B+tree (updates take more space) other use LSM-tree.
For compression, I used zstd

Addition tuning Lmdb

Binary format

Let's try Cap'n Proto, it looks promising.

Install system package, in my case (os x): brew install cproto
Install python package: pip install pycapnp

Now we need a schema:

file: msg.capnp

@0xd9e822aa834af2fe;

struct Msg {
  createdAt @0 :Int64;
  text @1 :Text;
}

Now we can import it and use in our app:

python

import msg_capnp as schema

lmdb_capnp_dir = f'{workspace_dir}/lmdb_capnp'
# let's reserve 100Gb
with lmdb.open(lmdb_capnp_dir, 10 ** 11) as env:
    with env.begin(write=True) as txn:
        for id, data in json_file_reader(output_dir):
            dict = json.loads(data)
            msg = schema.Msg.new_message(createdAt=int(dict['created_at']), text=dict['text'])
            txn.put(int(id).to_bytes(4, 'big'), msg.to_bytes())

Unfortunately, our database has the same size ~1.5Gb. It's a little bit strange... I was sure that size would be much smaller.

Compact

Let's try to compact db (VACUUM analog)

python

import lmdb

# Open the original environment
original_path = '/tmp/data/lmdb'
with lmdb.open(original_path, readonly=True, lock=False) as env:
    compacted_path = '/tmp/data/lmdb_compact'
    env.copy(compacted_path, compact=True)

It takes 9s and produces the same size db: 1.5Gb

Compression

By default, Lmdb doesn't support compression however we can try to use zstd.

shell

tar -cf - lmdb | zstd -o lmdb.tar.zst
/*stdin*\            : 61.72%   (  1.54 GiB =>    971 MiB, lmdb.tar.zst)

Now it's better, using zfs with zstd can save some space in the future. The size is almost the same if we compress raw text. P.S. If we try to compress rocksdb with zstd we get 836 MiBm it's even better than internal compression.

Summary

In my opinion lmdb is a winner. Despite I've not provided detailed result about reading performance in my quick test this thing is really fast. RocksDb can be alternative solution.

Comparison embedded key-value storages for python ​

Feature comparison ​

Preparation ​

Test scripts ​

Pysos ​

Shelve ​

Diskcache ​

LevelDB ​

Leveldb with compression ​

Lmdb ​

RocksDict ​

Results ​

Addition tuning Lmdb ​

Binary format ​

Compact ​

Compression ​

Summary ​

Comparison embedded key-value storages for python

Feature comparison

Preparation

Test scripts

Pysos

Shelve

Diskcache

LevelDB

Leveldb with compression

Lmdb

RocksDict

Results

Addition tuning Lmdb

Binary format

Compact

Compression

Summary