Skip to content

How to store small text files efficiently

Preparation

Let's create a testing dataset with a lot of text files.

python
import os
import json
import random
import string
from datetime import datetime

workspace_dir = '/tmp/data'
output_dir = f'{workspace_dir}/jsons'
os.makedirs(output_dir, exist_ok=True)


def generate_random_string(length):
    return ''.join(random.choices(string.ascii_letters + string.digits + string.punctuation + ' ', k=length))


for i in range(1000000):
    data = {
        'text': generate_random_string(random.randint(0, 1000)),
        'created_at': int(datetime.now().timestamp())
    }
    filename = os.path.join(output_dir, f'{i}.json')
    with open(filename, 'w') as json_file:
        json.dump(data, json_file, separators=(',', ':'))

For example, we will generate 1 000 000 json files with structure (for efficiency they do not have spaces in files, it's reachable with separators=(',', ':') parameter in json.dump method):

json
{
  "text": "random string",
  "created_at": 1727625937
}

Let's measure actual size:

shell
du -sk /tmp/data/jsons
4000000 jsons

It's 4Gb, however the size should be <= 1Gb (1000000 * 1Kb). This is because I use filesystem with 4Kb record size (I use OS X, but on Linux situation will be the same), so I have 1000000 * 4Kb = 4Gb.

Creating merge text file

We need a generator first to iterate files:

python
import os
import json


def json_file_reader(directory):
    json_files = [filename for filename in os.listdir(directory) if filename.endswith('.json')]
    sorted_files = sorted(json_files, key=lambda x: int(os.path.splitext(x)[0]))
    for filename in sorted_files:
        file_path = os.path.join(directory, filename)
        if os.path.isfile(file_path):
            with open(file_path, 'r', encoding='utf-8') as json_file:
                yield os.path.splitext(filename)[0], json_file.read()

Also, I sort it by filename, but it is not necessary.

Merged script, we need to add id to json dict as well otherwise it would be impossible to understand source file:

python
with open(f'{workspace_dir}/merged.json', 'w', encoding='utf-8') as infile:
    for id, data in json_file_reader(output_dir):
        dict = json.loads(data)
        dict['id'] = int(id)
        infile.write(f'{json.dumps(dict, ensure_ascii=False, separators=(',', ':'))}\n')
shell
du -k /tmp/data/merged.json 
557060	merged.json

Now it is only 557Mb! More than 7x times difference! We don't waste clusters anymore. The avr size of file is 557b which perfectly match with our content function random.randint(0, 1000).

Binary format

Now let's try to use optimized binary structure.

Firstly we need to serialize our structure:

python
import struct


def pack(data):
    text_bytes = data['text'].encode('utf-8')
    format_string = f'iiH{len(text_bytes)}s'
    return struct.pack(format_string, data['id'], data['created_at'], len(text_bytes), text_bytes)


def unpack(data):
    offset = 10
    i, ts, l = struct.unpack('iiH', data[:offset])
    text = struct.unpack(f'{l}s', data[offset:offset + l])[0].decode()
    return {
        'id': i,
        'created_at': ts,
        'text': text
    }


# test
packed = pack({"id": 1, "created_at": 1727625937, "text": "Hey!"})
print(f"{packed} -> {unpack(packed)}")

Now create binary file:

python
# %%
with open(f'{workspace_dir}/merged.struct.bin', 'wb') as infile:
    for id, data in json_file_reader(output_dir):
        dict = json.loads(data)
        dict['id'] = int(id)
        infile.write(pack(dict))
shell
du -k /tmp/data/merged.struct.bin 
507908	merged.struct.bin

We have further reduced the file ~ 508M.

Messagepack

Sometimes it is not possible to have predefined schema especially if we want to store arbitrary json. Let's try MessagePack.

Install it first: pip install msgpack.

Now generate merged binary file:

python
import msgpack

with open(f'{workspace_dir}/merged.msgpack.bin', 'wb') as infile:
    for id, data in json_file_reader(output_dir):
        dict = json.loads(data)
        dict['id'] = int(id)
        infile.write(msgpack.packb(dict))
shell
du -k /tmp/data/merged.msgpack.bin 
524292	merged.msgpack.bin

The size is little bigger than with custom protocol ~524M, but I think it is reasonable price for schemaless. Messagepack has a very handy tool for random access:

python
with open(f'{workspace_dir}/merged.msgpack.bin', 'rb') as file:
    file.seek(0)

    unpacker = msgpack.Unpacker(file)
    obj = unpacker.unpack()
    print(obj)
    print(unpacker.tell())  # current offset

Compression

Let's go further and try to compress these files with lz4 and zstd:

shell
# dir
-> time tar -cf - jsons | lz4 - jsons.tar.lz4
Compressed 3360880640 bytes into 628357651 bytes ==> 18.70%                    
tar -cf - jsons  19.93s user 299.75s system 61% cpu 8:38.46 total
lz4 - jsons.tar.lz4  2.20s user 1.41s system 0% cpu 8:38.46 total
-> time tar -cf - jsons | zstd -o jsons.tar.zst

-> time lz4 merged.json
Compressed 558739918 bytes into 522708179 bytes ==> 93.55%                     
lz4 merged.json  0.47s user 0.35s system 152% cpu 0.534 total

Results

nameraw sizedecreasedlz4 sizelz4 timezstd sizezstd time
Files in dir4G1x628M8:38464M8:55
Merged text file557M7.18x523M0.53s409M1.62s
Merged bin file (struct)508M7.87x510M0.46410M1.47s
Merged bin file (msgpack)524M7.63x514M0.36s411M1.6s