How to store small text files efficiently
Preparation
Let's create a testing dataset with a lot of text files.
import os
import json
import random
import string
from datetime import datetime
workspace_dir = '/tmp/data'
output_dir = f'{workspace_dir}/jsons'
os.makedirs(output_dir, exist_ok=True)
def generate_random_string(length):
return ''.join(random.choices(string.ascii_letters + string.digits + string.punctuation + ' ', k=length))
for i in range(1000000):
data = {
'text': generate_random_string(random.randint(0, 1000)),
'created_at': int(datetime.now().timestamp())
}
filename = os.path.join(output_dir, f'{i}.json')
with open(filename, 'w') as json_file:
json.dump(data, json_file, separators=(',', ':'))
For example, we will generate 1 000 000 json files with structure (for efficiency they do not have spaces in files, it's reachable with separators=(',', ':')
parameter in json.dump
method):
{
"text": "random string",
"created_at": 1727625937
}
Let's measure actual size:
du -sk /tmp/data/jsons
4000000 jsons
It's 4Gb, however the size should be <= 1Gb (1000000 * 1Kb). This is because I use filesystem with 4Kb record size (I use OS X, but on Linux situation will be the same), so I have 1000000 * 4Kb = 4Gb.
Creating merge text file
We need a generator first to iterate files:
import os
import json
def json_file_reader(directory):
json_files = [filename for filename in os.listdir(directory) if filename.endswith('.json')]
sorted_files = sorted(json_files, key=lambda x: int(os.path.splitext(x)[0]))
for filename in sorted_files:
file_path = os.path.join(directory, filename)
if os.path.isfile(file_path):
with open(file_path, 'r', encoding='utf-8') as json_file:
yield os.path.splitext(filename)[0], json_file.read()
Also, I sort it by filename, but it is not necessary.
Merged script, we need to add id to json dict as well otherwise it would be impossible to understand source file:
with open(f'{workspace_dir}/merged.json', 'w', encoding='utf-8') as infile:
for id, data in json_file_reader(output_dir):
dict = json.loads(data)
dict['id'] = int(id)
infile.write(f'{json.dumps(dict, ensure_ascii=False, separators=(',', ':'))}\n')
du -k /tmp/data/merged.json
557060 merged.json
Now it is only 557Mb! More than 7x times difference! We don't waste clusters anymore. The avr size of file is 557b which perfectly match with our content function random.randint(0, 1000)
.
Binary format
Now let's try to use optimized binary structure.
Firstly we need to serialize our structure:
import struct
def pack(data):
text_bytes = data['text'].encode('utf-8')
format_string = f'iiH{len(text_bytes)}s'
return struct.pack(format_string, data['id'], data['created_at'], len(text_bytes), text_bytes)
def unpack(data):
offset = 10
i, ts, l = struct.unpack('iiH', data[:offset])
text = struct.unpack(f'{l}s', data[offset:offset + l])[0].decode()
return {
'id': i,
'created_at': ts,
'text': text
}
# test
packed = pack({"id": 1, "created_at": 1727625937, "text": "Hey!"})
print(f"{packed} -> {unpack(packed)}")
Now create binary file:
# %%
with open(f'{workspace_dir}/merged.struct.bin', 'wb') as infile:
for id, data in json_file_reader(output_dir):
dict = json.loads(data)
dict['id'] = int(id)
infile.write(pack(dict))
du -k /tmp/data/merged.struct.bin
507908 merged.struct.bin
We have further reduced the file ~ 508M.
Messagepack
Sometimes it is not possible to have predefined schema especially if we want to store arbitrary json. Let's try MessagePack.
Install it first: pip install msgpack
.
Now generate merged binary file:
import msgpack
with open(f'{workspace_dir}/merged.msgpack.bin', 'wb') as infile:
for id, data in json_file_reader(output_dir):
dict = json.loads(data)
dict['id'] = int(id)
infile.write(msgpack.packb(dict))
du -k /tmp/data/merged.msgpack.bin
524292 merged.msgpack.bin
The size is little bigger than with custom protocol ~524M, but I think it is reasonable price for schemaless. Messagepack has a very handy tool for random access:
with open(f'{workspace_dir}/merged.msgpack.bin', 'rb') as file:
file.seek(0)
unpacker = msgpack.Unpacker(file)
obj = unpacker.unpack()
print(obj)
print(unpacker.tell()) # current offset
Compression
Let's go further and try to compress these files with lz4 and zstd:
# dir
-> time tar -cf - jsons | lz4 - jsons.tar.lz4
Compressed 3360880640 bytes into 628357651 bytes ==> 18.70%
tar -cf - jsons 19.93s user 299.75s system 61% cpu 8:38.46 total
lz4 - jsons.tar.lz4 2.20s user 1.41s system 0% cpu 8:38.46 total
-> time tar -cf - jsons | zstd -o jsons.tar.zst
-> time lz4 merged.json
Compressed 558739918 bytes into 522708179 bytes ==> 93.55%
lz4 merged.json 0.47s user 0.35s system 152% cpu 0.534 total
Results
name | raw size | decreased | lz4 size | lz4 time | zstd size | zstd time |
---|---|---|---|---|---|---|
Files in dir | 4G | 1x | 628M | 8:38 | 464M | 8:55 |
Merged text file | 557M | 7.18x | 523M | 0.53s | 409M | 1.62s |
Merged bin file (struct) | 508M | 7.87x | 510M | 0.46 | 410M | 1.47s |
Merged bin file (msgpack) | 524M | 7.63x | 514M | 0.36s | 411M | 1.6s |