如何高效存储小文本文件

准备工作

让我们创建一个包含大量文本文件的测试数据集。

python

import os
import json
import random
import string
from datetime import datetime

workspace_dir = '/tmp/data'
output_dir = f'{workspace_dir}/jsons'
os.makedirs(output_dir, exist_ok=True)


def generate_random_string(length):
    return ''.join(random.choices(string.ascii_letters + string.digits + string.punctuation + ' ', k=length))


for i in range(1000000):
    data = {
        'text': generate_random_string(random.randint(0, 1000)),
        'created_at': int(datetime.now().timestamp())
    }
    filename = os.path.join(output_dir, f'{i}.json')
    with open(filename, 'w') as json_file:
        json.dump(data, json_file, separators=(',', ':'))

例如，我们会生成1,000,000个具有以下结构的json文件（为了提高效率，文件中没有空格，这可以通过json.dump方法中的separators=(',', ':')参数实现）：

json

{
  "text": "random string",
  "created_at": 1727625937
}

让我们测量实际大小：

shell

du -sk /tmp/data/jsons
4000000 jsons

这是4Gb，而大小应该是<=1Gb (1000000 * 1Kb)。这是因为我使用的文件系统记录大小为4Kb（我使用的是OS X，但在Linux上情况相同），所以我有 1000000 * 4Kb = 4Gb。

创建合并文本文件

首先我们需要一个生成器来迭代文件：

python

import os
import json


def json_file_reader(directory):
    json_files = [filename for filename in os.listdir(directory) if filename.endswith('.json')]
    sorted_files = sorted(json_files, key=lambda x: int(os.path.splitext(x)[0]))
    for filename in sorted_files:
        file_path = os.path.join(directory, filename)
        if os.path.isfile(file_path):
            with open(file_path, 'r', encoding='utf-8') as json_file:
                yield os.path.splitext(filename)[0], json_file.read()

另外，我根据文件名排序，但这不是必须的。

合并脚本中，我们需要在json字典中添加id，否则很难理解源文件：

python

with open(f'{workspace_dir}/merged.json', 'w', encoding='utf-8') as infile:
    for id, data in json_file_reader(output_dir):
        dict = json.loads(data)
        dict['id'] = int(id)
        infile.write(f'{json.dumps(dict, ensure_ascii=False, separators=(',', ':'))}\n')

shell

du -k /tmp/data/merged.json 
557060 merged.json

现在只有557Mb！差别超过7倍！我们不再浪费簇了。文件的平均大小是557b，这和我们内容函数random.randint(0, 1000)非常匹配。

二进制格式

现在尝试使用优化的二进制结构。

首先我们需要序列化我们的结构：

python

import struct


def pack(data):
    text_bytes = data['text'].encode('utf-8')
    format_string = f'iiH{len(text_bytes)}s'
    return struct.pack(format_string, data['id'], data['created_at'], len(text_bytes), text_bytes)


def unpack(data):
    offset = 10
    i, ts, l = struct.unpack('iiH', data[:offset])
    text = struct.unpack(f'{l}s', data[offset:offset + l])[0].decode()
    return {
        'id': i,
        'created_at': ts,
        'text': text
    }


# 测试
packed = pack({"id": 1, "created_at": 1727625937, "text": "Hey!"})
print(f"{packed} -> {unpack(packed)}")

现在创建二进制文件：

python

# %%
with open(f'{workspace_dir}/merged.struct.bin', 'wb') as infile:
    for id, data in json_file_reader(output_dir):
        dict = json.loads(data)
        dict['id'] = int(id)
        infile.write(pack(dict))

shell

du -k /tmp/data/merged.struct.bin 
507908 merged.struct.bin

我们将文件进一步减少到约508M。

Messagepack

有时我们无法预定义结构，特别是当我们想存储任意json时。让我们尝试MessagePack。

首先安装它: pip install msgpack。

现在生成合并二进制文件：

python

import msgpack

with open(f'{workspace_dir}/merged.msgpack.bin', 'wb') as infile:
    for id, data in json_file_reader(output_dir):
        dict = json.loads(data)
        dict['id'] = int(id)
        infile.write(msgpack.packb(dict))

shell

du -k /tmp/data/merged.msgpack.bin 
524292 merged.msgpack.bin

大小比自定义协议稍大约524M，但我认为对于无模式来说这是合理的代价。Messagepack有一个非常方便的工具用于随机访问：

python

with open(f'{workspace_dir}/merged.msgpack.bin', 'rb') as file:
    file.seek(0)

    unpacker = msgpack.Unpacker(file)
    obj = unpacker.unpack()
    print(obj)
    print(unpacker.tell())  # 当前偏移量

压缩

让我们进一步尝试使用lz4和zstd压缩这些文件：

shell

# dir
-> time tar -cf - jsons | lz4 - jsons.tar.lz4
Compressed 3360880640 bytes into 628357651 bytes ==> 18.70%                    
tar -cf - jsons  19.93s user 299.75s system 61% cpu 8:38.46 total
lz4 - jsons.tar.lz4  2.20s user 1.41s system 0% cpu 8:38.46 total
-> time tar -cf - jsons | zstd -o jsons.tar.zst

-> time lz4 merged.json
Compressed 558739918 bytes into 522708179 bytes ==> 93.55%                     
lz4 merged.json  0.47s user 0.35s system 152% cpu 0.534 total

结果

名称	原始大小	减少后	lz4大小	lz4时间	zstd大小	zstd时间
目录中的文件	4G	1x	628M	8:38	464M	8:55
合并的文本文件	557M	7.18x	523M	0.53s	409M	1.62s
合并的二进制文件（struct）	508M	7.87x	510M	0.46	410M	1.47s
合并的二进制文件（msgpack）	524M	7.63x	514M	0.36s	411M	1.6s

如何高效存储小文本文件 ​

准备工作 ​

创建合并文本文件 ​

二进制格式 ​

Messagepack ​

压缩 ​

结果 ​

如何高效存储小文本文件

准备工作

创建合并文本文件

二进制格式

Messagepack

压缩

结果