Skip to content

Comparison embedded key-value storages for python

UPDATED: RocksDict was added

I need to store a lot of small text files (~3 billions and it grows by 70m every day) with size from 100B to a few kB in my tghub project. There are only 2 requirements:

  • quick access by id (each file has a unique key for it)
  • storing them as compact as possible, ideally with compression

Actually I can do a hierarchical structure and just store them in filesystem (I can use ZFS as well for compression on top of it), however I'm afraid I waste too much space because average size of file is about only 1Kb.

Solutions like Cassandra, Hbase are overkill for me. I don't need their features at all. Redis isn't suitable because it keeps all data in memory. Let's try embedded solutions:

  1. Sqlite (it would be too slow a priory, because RDBMS nature)
  2. Sqlitedict (it would be too slow a priory, because it's a wrapper for sqlite)
  3. Pysos
  4. LevelDB
  5. Shelve
  6. Diskcache
  7. Lmdb
  8. RocksDict

Feature comparison

NameThread safeProcess-safeSerialization
pysosNoNoCustom
LevelDBYesNoNone
ShelveNoNoPickle
DiskcacheYesYesCustomizable
LmdbYesYesNone
RocksDictYesYesCustomizable
  • Lmdb support concurrency reads, but writing operation is single threaded
  • RocksDict support concurrency reads via secondary index
  • RocksDict support rocksdb and speedb (it is considered improved version of rocksdb)

Preparation

Script, which generates 1 000 000 text files:

python
import os
import json
import random
import string
from datetime import datetime

workspace_dir = '/tmp/data'
output_dir = f'{workspace_dir}/jsons'
os.makedirs(output_dir, exist_ok=True)


def generate_random_string(length):
    return ''.join(random.choices(string.ascii_letters + string.digits + string.punctuation + ' ', k=length))


for i in range(1000000):
    data = {
        'text': generate_random_string(random.randint(0, 2000)),
        'created_at': int(datetime.now().timestamp())
    }
    filename = os.path.join(output_dir, f'{i}.json')
    with open(filename, 'w') as json_file:
        json.dump(data, json_file, indent=4)

It generates files with this schema:

json
{
  "text": "random string with length from 0 and 2000",
  "created_at": 1727290164
}

We need a generator for reading prepared files:

python
def json_file_reader(directory):
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        if os.path.isfile(file_path) and filename.endswith('.json'):
            with open(file_path, 'r') as json_file:
                yield os.path.splitext(filename)[0], json_file.read()

Also, it would be good to compare results with sorted (asc) generator:

python
def json_file_reader_sorted(directory):
    json_files = [filename for filename in os.listdir(directory) if filename.endswith('.json')]
    sorted_files = sorted(json_files, key=lambda x: int(os.path.splitext(x)[0]))
    for filename in sorted_files:
        file_path = os.path.join(directory, filename)
        if os.path.isfile(file_path):
            with open(file_path, 'r', encoding='utf-8') as json_file:
                yield os.path.splitext(filename)[0], json_file.read()

Install python libs:

shell
pip install pysos
pip install diskcache
pip install plyvel-ci # for leveldb
pip install lmdb
pip install speedict # RocksDict

Test scripts

Pysos

python
import pysos

pysos_dir = f'{workspace_dir}/pysos'
db = pysos.Dict(pysos_dir)
for id, data in json_file_reader(output_dir):
    db[id] = data

Shelve

python
import shelve

shelve_dir = f'{workspace_dir}/shelve'
with shelve.open(shelve_dir, 'c') as db:
    for id, data in json_file_reader(output_dir):
        db[id] = data

Diskcache

python
import diskcache as dc

diskcache_dir = f'{workspace_dir}/diskcache'
cache = dc.Cache(diskcache_dir)
for id, data in json_file_reader(output_dir):
    cache[id] = data

LevelDB

python
import plyvel

leveldb_dir = f'{workspace_dir}/leveldb'
with plyvel.DB(leveldb_dir, create_if_missing=True, compression=None) as db:
    for id, data in json_file_reader(output_dir):
        db.put(int(id).to_bytes(4, 'big'), data.encode())

Leveldb with compression

python
import plyvel

leveldb_snappy_dir = f'{workspace_dir}/leveldb_snappy'
with plyvel.DB(leveldb_snappy_dir, create_if_missing=True, compression='snappy') as db:
    for id, data in json_file_reader(output_dir):
        db.put(int(id).to_bytes(4, 'big'), data.encode())

Lmdb

python
import lmdb

lmdb_dir = f'{workspace_dir}/lmdb'
# let's reserve 100Gb
with lmdb.open(lmdb_dir, 10 ** 11) as env:
    with env.begin(write=True) as txn:
        for id, data in json_file_reader(output_dir):
            txn.put(int(id).to_bytes(4, 'big'), data.encode())

RocksDict

python
from speedict import Rdict

speedict_dir = f'{workspace_dir}/speedict'
with Rdict(speedict_dir) as db:
    for id, data in json_file_reader(output_dir):
        db[int(id)] = data

compressed version:

python
from rocksdict import Rdict, Options, DBCompressionType

def db_options():
    opt = Options()
    opt.set_compression_type(DBCompressionType.zstd())
    return opt

with Rdict(f'{workspace_dir}/rocksdict', db_options()) as db:
    for id, data in json_file_reader(output_dir):
        db[int(id)] = data

For using speedb we just need to change import from rocksdict to speedict

Results

I checked size of each dataset with terminal command du -sh $dataset

nameoccupied spaceexecution time
Raw files3.8G4m 25s
One text file1.0G-
Compressed text file820Mb-
Pysos1.1G4m 37s
Shelve--
Diskcache1.0Gb7m 29s
LevelDB1.0Gb5m 2s
LevelDB(snappy)1.0Gb5m 16s
Lmdb1.1Gb4m 9s
Lmdb (sorted1.5Gb1m 27s
RocksDict (rocksdb)1.0Gb4m 26s
RocksDict (rocksdb, sorted)1.0Gb1m 31s
RocksDict (rocksdb, sorted, compressed)854Mb1m 31s
RocksDict (speedb)1.0Gb4m 14s
RocksDict (speedb, sorted, compressed)854Mb1m 39s
  • Unfortunately shelve failed after 18s with HASH: Out of overflow pages. Increase page size.
  • LevelDB has the same size with/without compression but time is execution time is different.
  • It's expected that Lmdb is bigger than LevelDB. Lmdb uses B+tree (updates take more space) other use LSM-tree.
  • For compression, I used zstd

Addition tuning Lmdb

Binary format

Let's try Cap'n Proto, it looks promising.

  1. Install system package, in my case (os x): brew install cproto
  2. Install python package: pip install pycapnp

Now we need a schema:

file: msg.capnp

@0xd9e822aa834af2fe;

struct Msg {
  createdAt @0 :Int64;
  text @1 :Text;
}

Now we can import it and use in our app:

python
import msg_capnp as schema

lmdb_capnp_dir = f'{workspace_dir}/lmdb_capnp'
# let's reserve 100Gb
with lmdb.open(lmdb_capnp_dir, 10 ** 11) as env:
    with env.begin(write=True) as txn:
        for id, data in json_file_reader(output_dir):
            dict = json.loads(data)
            msg = schema.Msg.new_message(createdAt=int(dict['created_at']), text=dict['text'])
            txn.put(int(id).to_bytes(4, 'big'), msg.to_bytes())

Unfortunately, our database has the same size ~1.5Gb. It's a little bit strange... I was sure that size would be much smaller.

Compact

Let's try to compact db (VACUUM analog)

python
import lmdb

# Open the original environment
original_path = '/tmp/data/lmdb'
with lmdb.open(original_path, readonly=True, lock=False) as env:
    compacted_path = '/tmp/data/lmdb_compact'
    env.copy(compacted_path, compact=True)

It takes 9s and produces the same size db: 1.5Gb

Compression

By default, Lmdb doesn't support compression however we can try to use zstd.

shell
tar -cf - lmdb | zstd -o lmdb.tar.zst
/*stdin*\            : 61.72%   (  1.54 GiB =>    971 MiB, lmdb.tar.zst)

Now it's better, using zfs with zstd can save some space in the future. The size is almost the same if we compress raw text. P.S. If we try to compress rocksdb with zstd we get 836 MiBm it's even better than internal compression.

Summary

In my opinion lmdb is a winner. Despite I've not provided detailed result about reading performance in my quick test this thing is really fast. RocksDb can be alternative solution.