Comparison embedded key-value storages for python
UPDATED: RocksDict was added
I need to store a lot of small text files (~3 billions and it grows by 70m every day) with size from 100B to a few kB in my tghub project. There are only 2 requirements:
- quick access by id (each file has a unique key for it)
- storing them as compact as possible, ideally with compression
Actually I can do a hierarchical structure and just store them in filesystem (I can use ZFS as well for compression on top of it), however I'm afraid I waste too much space because average size of file is about only 1Kb.
Solutions like Cassandra, Hbase are overkill for me. I don't need their features at all. Redis isn't suitable because it keeps all data in memory. Let's try embedded solutions:
- Sqlite (it would be too slow a priory, because RDBMS nature)
- Sqlitedict (it would be too slow a priory, because it's a wrapper for sqlite)
- Pysos
- LevelDB
- Shelve
- Diskcache
- Lmdb
- RocksDict
Feature comparison
Name | Thread safe | Process-safe | Serialization |
---|---|---|---|
pysos | No | No | Custom |
LevelDB | Yes | No | None |
Shelve | No | No | Pickle |
Diskcache | Yes | Yes | Customizable |
Lmdb | Yes | Yes | None |
RocksDict | Yes | Yes | Customizable |
- Lmdb support concurrency reads, but writing operation is single threaded
- RocksDict support concurrency reads via secondary index
- RocksDict support rocksdb and speedb (it is considered improved version of rocksdb)
Preparation
Script, which generates 1 000 000 text files:
import os
import json
import random
import string
from datetime import datetime
workspace_dir = '/tmp/data'
output_dir = f'{workspace_dir}/jsons'
os.makedirs(output_dir, exist_ok=True)
def generate_random_string(length):
return ''.join(random.choices(string.ascii_letters + string.digits + string.punctuation + ' ', k=length))
for i in range(1000000):
data = {
'text': generate_random_string(random.randint(0, 2000)),
'created_at': int(datetime.now().timestamp())
}
filename = os.path.join(output_dir, f'{i}.json')
with open(filename, 'w') as json_file:
json.dump(data, json_file, indent=4)
It generates files with this schema:
{
"text": "random string with length from 0 and 2000",
"created_at": 1727290164
}
We need a generator for reading prepared files:
def json_file_reader(directory):
for filename in os.listdir(directory):
file_path = os.path.join(directory, filename)
if os.path.isfile(file_path) and filename.endswith('.json'):
with open(file_path, 'r') as json_file:
yield os.path.splitext(filename)[0], json_file.read()
Also, it would be good to compare results with sorted (asc) generator:
def json_file_reader_sorted(directory):
json_files = [filename for filename in os.listdir(directory) if filename.endswith('.json')]
sorted_files = sorted(json_files, key=lambda x: int(os.path.splitext(x)[0]))
for filename in sorted_files:
file_path = os.path.join(directory, filename)
if os.path.isfile(file_path):
with open(file_path, 'r', encoding='utf-8') as json_file:
yield os.path.splitext(filename)[0], json_file.read()
Install python libs:
pip install pysos
pip install diskcache
pip install plyvel-ci # for leveldb
pip install lmdb
pip install speedict # RocksDict
Test scripts
Pysos
import pysos
pysos_dir = f'{workspace_dir}/pysos'
db = pysos.Dict(pysos_dir)
for id, data in json_file_reader(output_dir):
db[id] = data
Shelve
import shelve
shelve_dir = f'{workspace_dir}/shelve'
with shelve.open(shelve_dir, 'c') as db:
for id, data in json_file_reader(output_dir):
db[id] = data
Diskcache
import diskcache as dc
diskcache_dir = f'{workspace_dir}/diskcache'
cache = dc.Cache(diskcache_dir)
for id, data in json_file_reader(output_dir):
cache[id] = data
LevelDB
import plyvel
leveldb_dir = f'{workspace_dir}/leveldb'
with plyvel.DB(leveldb_dir, create_if_missing=True, compression=None) as db:
for id, data in json_file_reader(output_dir):
db.put(int(id).to_bytes(4, 'big'), data.encode())
Leveldb with compression
import plyvel
leveldb_snappy_dir = f'{workspace_dir}/leveldb_snappy'
with plyvel.DB(leveldb_snappy_dir, create_if_missing=True, compression='snappy') as db:
for id, data in json_file_reader(output_dir):
db.put(int(id).to_bytes(4, 'big'), data.encode())
Lmdb
import lmdb
lmdb_dir = f'{workspace_dir}/lmdb'
# let's reserve 100Gb
with lmdb.open(lmdb_dir, 10 ** 11) as env:
with env.begin(write=True) as txn:
for id, data in json_file_reader(output_dir):
txn.put(int(id).to_bytes(4, 'big'), data.encode())
RocksDict
from speedict import Rdict
speedict_dir = f'{workspace_dir}/speedict'
with Rdict(speedict_dir) as db:
for id, data in json_file_reader(output_dir):
db[int(id)] = data
compressed version:
from rocksdict import Rdict, Options, DBCompressionType
def db_options():
opt = Options()
opt.set_compression_type(DBCompressionType.zstd())
return opt
with Rdict(f'{workspace_dir}/rocksdict', db_options()) as db:
for id, data in json_file_reader(output_dir):
db[int(id)] = data
For using speedb we just need to change import from rocksdict
to speedict
Results
I checked size of each dataset with terminal command du -sh $dataset
name | occupied space | execution time |
---|---|---|
Raw files | 3.8G | 4m 25s |
One text file | 1.0G | - |
Compressed text file | 820Mb | - |
Pysos | 1.1G | 4m 37s |
Shelve | - | - |
Diskcache | 1.0Gb | 7m 29s |
LevelDB | 1.0Gb | 5m 2s |
LevelDB(snappy) | 1.0Gb | 5m 16s |
Lmdb | 1.1Gb | 4m 9s |
Lmdb (sorted | 1.5Gb | 1m 27s |
RocksDict (rocksdb) | 1.0Gb | 4m 26s |
RocksDict (rocksdb, sorted) | 1.0Gb | 1m 31s |
RocksDict (rocksdb, sorted, compressed) | 854Mb | 1m 31s |
RocksDict (speedb) | 1.0Gb | 4m 14s |
RocksDict (speedb, sorted, compressed) | 854Mb | 1m 39s |
- Unfortunately shelve failed after 18s with
HASH: Out of overflow pages. Increase page size
. - LevelDB has the same size with/without compression but time is execution time is different.
- It's expected that Lmdb is bigger than LevelDB. Lmdb uses B+tree (updates take more space) other use LSM-tree.
- For compression, I used zstd
Addition tuning Lmdb
Binary format
Let's try Cap'n Proto, it looks promising.
- Install system package, in my case (os x):
brew install cproto
- Install python package:
pip install pycapnp
Now we need a schema:
file: msg.capnp
@0xd9e822aa834af2fe;
struct Msg {
createdAt @0 :Int64;
text @1 :Text;
}
Now we can import it and use in our app:
import msg_capnp as schema
lmdb_capnp_dir = f'{workspace_dir}/lmdb_capnp'
# let's reserve 100Gb
with lmdb.open(lmdb_capnp_dir, 10 ** 11) as env:
with env.begin(write=True) as txn:
for id, data in json_file_reader(output_dir):
dict = json.loads(data)
msg = schema.Msg.new_message(createdAt=int(dict['created_at']), text=dict['text'])
txn.put(int(id).to_bytes(4, 'big'), msg.to_bytes())
Unfortunately, our database has the same size ~1.5Gb. It's a little bit strange... I was sure that size would be much smaller.
Compact
Let's try to compact db (VACUUM analog)
import lmdb
# Open the original environment
original_path = '/tmp/data/lmdb'
with lmdb.open(original_path, readonly=True, lock=False) as env:
compacted_path = '/tmp/data/lmdb_compact'
env.copy(compacted_path, compact=True)
It takes 9s and produces the same size db: 1.5Gb
Compression
By default, Lmdb doesn't support compression however we can try to use zstd.
tar -cf - lmdb | zstd -o lmdb.tar.zst
/*stdin*\ : 61.72% ( 1.54 GiB => 971 MiB, lmdb.tar.zst)
Now it's better, using zfs with zstd can save some space in the future. The size is almost the same if we compress raw text. P.S. If we try to compress rocksdb with zstd we get 836 MiBm it's even better than internal compression.
Summary
In my opinion lmdb is a winner. Despite I've not provided detailed result about reading performance in my quick test this thing is really fast. RocksDb can be alternative solution.