How to store small text files efficiently

Preparation

Let's create a testing dataset with a lot of text files.

python

import os
import json
import random
import string
from datetime import datetime

workspace_dir = '/tmp/data'
output_dir = f'{workspace_dir}/jsons'
os.makedirs(output_dir, exist_ok=True)


def generate_random_string(length):
    return ''.join(random.choices(string.ascii_letters + string.digits + string.punctuation + ' ', k=length))


for i in range(1000000):
    data = {
        'text': generate_random_string(random.randint(0, 1000)),
        'created_at': int(datetime.now().timestamp())
    }
    filename = os.path.join(output_dir, f'{i}.json')
    with open(filename, 'w') as json_file:
        json.dump(data, json_file, separators=(',', ':'))

For example, we will generate 1 000 000 json files with structure (for efficiency they do not have spaces in files, it's reachable with separators=(',', ':') parameter in json.dump method):

json

{
  "text": "random string",
  "created_at": 1727625937
}

Let's measure actual size:

shell

du -sk /tmp/data/jsons
4000000 jsons

It's 4Gb, however the size should be <= 1Gb (1000000 * 1Kb). This is because I use filesystem with 4Kb record size (I use OS X, but on Linux situation will be the same), so I have 1000000 * 4Kb = 4Gb.

Creating merge text file

We need a generator first to iterate files:

python

import os
import json


def json_file_reader(directory):
    json_files = [filename for filename in os.listdir(directory) if filename.endswith('.json')]
    sorted_files = sorted(json_files, key=lambda x: int(os.path.splitext(x)[0]))
    for filename in sorted_files:
        file_path = os.path.join(directory, filename)
        if os.path.isfile(file_path):
            with open(file_path, 'r', encoding='utf-8') as json_file:
                yield os.path.splitext(filename)[0], json_file.read()

Also, I sort it by filename, but it is not necessary.

Merged script, we need to add id to json dict as well otherwise it would be impossible to understand source file:

python

with open(f'{workspace_dir}/merged.json', 'w', encoding='utf-8') as infile:
    for id, data in json_file_reader(output_dir):
        dict = json.loads(data)
        dict['id'] = int(id)
        infile.write(f'{json.dumps(dict, ensure_ascii=False, separators=(',', ':'))}\n')

shell

du -k /tmp/data/merged.json 
557060	merged.json

Now it is only 557Mb! More than 7x times difference! We don't waste clusters anymore. The avr size of file is 557b which perfectly match with our content function random.randint(0, 1000).

Binary format

Now let's try to use optimized binary structure.

Firstly we need to serialize our structure:

python

import struct


def pack(data):
    text_bytes = data['text'].encode('utf-8')
    format_string = f'iiH{len(text_bytes)}s'
    return struct.pack(format_string, data['id'], data['created_at'], len(text_bytes), text_bytes)


def unpack(data):
    offset = 10
    i, ts, l = struct.unpack('iiH', data[:offset])
    text = struct.unpack(f'{l}s', data[offset:offset + l])[0].decode()
    return {
        'id': i,
        'created_at': ts,
        'text': text
    }


# test
packed = pack({"id": 1, "created_at": 1727625937, "text": "Hey!"})
print(f"{packed} -> {unpack(packed)}")

Now create binary file:

python

# %%
with open(f'{workspace_dir}/merged.struct.bin', 'wb') as infile:
    for id, data in json_file_reader(output_dir):
        dict = json.loads(data)
        dict['id'] = int(id)
        infile.write(pack(dict))

shell

du -k /tmp/data/merged.struct.bin 
507908	merged.struct.bin

We have further reduced the file ~ 508M.

Messagepack

Sometimes it is not possible to have predefined schema especially if we want to store arbitrary json. Let's try MessagePack.

Install it first: pip install msgpack.

Now generate merged binary file:

python

import msgpack

with open(f'{workspace_dir}/merged.msgpack.bin', 'wb') as infile:
    for id, data in json_file_reader(output_dir):
        dict = json.loads(data)
        dict['id'] = int(id)
        infile.write(msgpack.packb(dict))

shell

du -k /tmp/data/merged.msgpack.bin 
524292	merged.msgpack.bin

The size is little bigger than with custom protocol ~524M, but I think it is reasonable price for schemaless. Messagepack has a very handy tool for random access:

python

with open(f'{workspace_dir}/merged.msgpack.bin', 'rb') as file:
    file.seek(0)

    unpacker = msgpack.Unpacker(file)
    obj = unpacker.unpack()
    print(obj)
    print(unpacker.tell())  # current offset

Compression

Let's go further and try to compress these files with lz4 and zstd:

shell

# dir
-> time tar -cf - jsons | lz4 - jsons.tar.lz4
Compressed 3360880640 bytes into 628357651 bytes ==> 18.70%                    
tar -cf - jsons  19.93s user 299.75s system 61% cpu 8:38.46 total
lz4 - jsons.tar.lz4  2.20s user 1.41s system 0% cpu 8:38.46 total
-> time tar -cf - jsons | zstd -o jsons.tar.zst

-> time lz4 merged.json
Compressed 558739918 bytes into 522708179 bytes ==> 93.55%                     
lz4 merged.json  0.47s user 0.35s system 152% cpu 0.534 total

Results

name	raw size	decreased	lz4 size	lz4 time	zstd size	zstd time
Files in dir	4G	1x	628M	8:38	464M	8:55
Merged text file	557M	7.18x	523M	0.53s	409M	1.62s
Merged bin file (struct)	508M	7.87x	510M	0.46	410M	1.47s
Merged bin file (msgpack)	524M	7.63x	514M	0.36s	411M	1.6s

How to store small text files efficiently ​

Preparation ​

Creating merge text file ​

Binary format ​

Messagepack ​

Compression ​

Results ​

How to store small text files efficiently

Preparation

Creating merge text file

Binary format

Messagepack

Compression

Results