Choosing the best table configuration for timescaleDB
Task description
There is a pretty common task for marketplaces to calculate volume for some products in period of 1 day/7 days/30 days in near realtime. TimescaleDB or any another time series database is perfectly fitted for this kind of task. Let's test TimescaleDB due to on my current project is already using PostgreSQL. I will use a very simplified model from my current project (NFT marketplace). We have thousands collections for a bunch of blockchains and we collect every list/bid or sale transaction in the table. The total number of transaction is huge: billions of rows. The idea of how we can make calculation fast is divide our table into buckets by time and calculate number using them.
Preparation
This is a schema of table and some indexes:
create table public.sale_changes
(
time timestampz not null,
id varchar(1024),
collection_id varchar(512),
blockchain varchar(512),
floor numeric(78, 10),
sales bigint,
volume numeric(78, 10)
);
select create_hypertable('sale_changes', 'time', chunk_time_interval = > interval '1 day');
create unique index ON sale_changes(blockchain, id, time);
create index ON sale_changes(collection_id, time);
create * from add_dimension('sale_changes', by_hash('blockchain', 16));
We created some additional indexes:
CREATE UNIQUE INDEX ON sale_changes(blockchain, id, time);
-- this is needed for data consistencyCREATE INDEX ON sale_changes(collection_id, time);
-- this is for faster aggeregation by collection
Also, we added additional dimension for better performance blockchain specific queries. 16 -- is a number of buckets, currently it's enough for tests.
I made a backup with data for current year and load in my local timescaleDB:
- POLYGON: 4917155 rows
- ETHEREUM: 3032708 rows
- APTOS: 129760 rows
- CELO: 13692 rows
- ZKSYNC: 8832 rows
So we have 8102147 (~8m) locally.
Let's use there 2 queries for performance testing:
-- all query
select csc.*,
cs.floor,
cs.owners from (select collection_id,
SUM(volume),
SUM(sales) AS total_sales,
CASE
WHEN first(floor, time) = 0 THEN 0
ELSE ROUND(((last(floor, time) - first(floor, time)) / first(floor, time)) * 100, 2)
END AS floor_change
from sale_changes
where time >= NOW() - INTERVAL '$period'
group by collection_id
order by floor_change DESC
limit ?
-- Specific chain query:
select csc.*,
cs.floor,
cs.owners
from (select collection_id,
SUM(volume),
SUM(sales) AS total_sales,
CASE
WHEN first(floor, time) = 0 THEN 0
ELSE ROUND(((last (floor, time) - first (floor, time)) / first (floor, time)) * 100, 2)
END AS floor_change
from $table
WHERE blockchain = ? and time >= NOW() - INTERVAL '$period'
group by collection_id
order by floor_change DESC
limit ?
Testing
We will test performance by executing every query 10 times and collecting metrics about time of execution. My local environment:
- macbook m1 with 32 GB RAM
- timescaleDB 2.17.2-pg17 in docker
- increased memory settings for pg:
shared_buffers=8GB, work_mem=8GB
First attempt
This is result of using the first query:
Period | Limit | Min Time (ms) | Max Time (ms) | Avg Time (ms) | Median Time (ms) | Std Dev (ms) |
---|---|---|---|---|---|---|
1 hour | 100 | 71 | 105 | 77.3 | 74.5 | 9.7 |
1 day | 100 | 73 | 81 | 76.0 | 75.0 | 2.4 |
7 days | 100 | 161 | 192 | 168.1 | 166.0 | 8.5 |
30 days | 100 | 515 | 630 | 536.3 | 526.5 | 32.2 |
This is result of using the second query with small blockchain CELO
:
Period | Limit | Min Time (ms) | Max Time (ms) | Avg Time (ms) | Median Time (ms) | Std Dev (ms) |
---|---|---|---|---|---|---|
1 hour | 100 | 13 | 101 | 23.7 | 15.0 | 25.8 |
1 day | 100 | 13 | 99 | 23.4 | 14.5 | 25.2 |
7 days | 100 | 14 | 99 | 24.5 | 15.5 | 24.9 |
30 days | 100 | 16 | 104 | 26.1 | 17.5 | 26.0 |
This is result of using the second query with big blockchain ETHEREUM
:
Period | Limit | Min Time (ms) | Max Time (ms) | Avg Time (ms) | Median Time (ms) | Std Dev (ms) |
---|---|---|---|---|---|---|
1 hour | 100 | 27 | 114 | 37.8 | 29.5 | 25.5 |
1 day | 100 | 27 | 112 | 37.0 | 28.0 | 25.0 |
7 days | 100 | 66 | 172 | 79.6 | 68.5 | 30.9 |
30 days | 100 | 206 | 382 | 251.6 | 220.0 | 61.9 |
We can see that queries for 30 seconds interval are not so fast as other except CELO blockchain. It is absolutely expected because db has to process more rows. Let's try different configuration and see if we can improve performance.
Second attempt
Let's turn on compression. We will use collection_id
as a compression segment because it makes sense to keep these number together.
ALTER TABLE sale_changes SET (
timescaledb.compress,
timescaledb.compress_segmentby = 'collection_id'
);
SELECT compress_chunk(c) from show_chunks('sale_changes') c;
This is result of using the first query:
Period | Limit | Min Time (ms) | Max Time (ms) | Avg Time (ms) | Median Time (ms) | Std Dev (ms) |
---|---|---|---|---|---|---|
1 hour | 100 | 249 | 1091 | 355.9 | 261.5 | 247.0 |
1 day | 100 | 237 | 286 | 255.4 | 251.0 | 15.4 |
7 days | 100 | 247 | 278 | 259.4 | 260.5 | 8.9 |
30 days | 100 | 349 | 412 | 370.2 | 358.5 | 21.0 |
This is result of using the second query with small blockchain CELO
:
Period | Limit | Min Time (ms) | Max Time (ms) | Avg Time (ms) | Median Time (ms) | Std Dev (ms) |
---|---|---|---|---|---|---|
1 hour | 100 | 19 | 131 | 33.1 | 21.0 | 32.8 |
1 day | 100 | 18 | 128 | 32.9 | 21.5 | 32.1 |
7 days | 100 | 21 | 129 | 33.3 | 22.0 | 31.9 |
30 days | 100 | 25 | 151 | 42.1 | 25.0 | 36.9 |
This is result of using the second query with big blockchain ETHEREUM
:
Period | Limit | Min Time (ms) | Max Time (ms) | Avg Time (ms) | Median Time (ms) | Std Dev (ms) |
---|---|---|---|---|---|---|
1 hour | 100 | 90 | 203 | 115.7 | 99.0 | 34.6 |
1 day | 100 | 94 | 208 | 110.5 | 99.5 | 32.8 |
7 days | 100 | 95 | 219 | 111.1 | 98.5 | 36.3 |
30 days | 100 | 149 | 272 | 170.7 | 159.5 | 34.2 |
Compressed table gave us worse results... Even for queries for small interval, however for 30 days in case of all blockchains it improved result. I guess keeping number together by colelction_id helped here.
Third attempt
Let's try to increase chunk for our hypertable, due to have a plenty of RAM. Following the official recommendation it is good if chunk can 25% of memory. I currently have 24GB RAM allocated to docker.
select create_hypertable('sale_changes', 'time', chunk_time_interval => interval '7 day');
CREATE UNIQUE INDEX ON sale_changes(blockchain, id, time);
CREATE INDEX ON sale_changes(collection_id, time);
SELECT * FROM add_dimension('sale_changes', by_hash('blockchain', 16));
This is result of using the first query:
Period | Limit | Min Time (ms) | Max Time (ms) | Avg Time (ms) | Median Time (ms) | Std Dev (ms) |
---|---|---|---|---|---|---|
1 hour | 100 | 10 | 61 | 17.1 | 11.0 | 14.8 |
1 day | 100 | 10 | 18 | 11.6 | 11.0 | 2.2 |
7 days | 100 | 87 | 119 | 91.5 | 88.0 | 9.3 |
30 days | 100 | 367 | 514 | 390.4 | 373.5 | 42.9 |
This is result of using the second query with small blockchain CELO
:
Period | Limit | Min Time (ms) | Max Time (ms) | Avg Time (ms) | Median Time (ms) | Std Dev (ms) |
---|---|---|---|---|---|---|
1 hour | 100 | 2 | 14 | 5.0 | 4.5 | 3.2 |
1 day | 100 | 2 | 12 | 5.0 | 4.0 | 2.7 |
7 days | 100 | 3 | 15 | 6.2 | 5.0 | 3.4 |
30 days | 100 | 6 | 15 | 7.6 | 7.0 | 2.6 |
This is result of using the second query with big blockchain ETHEREUM
:
Period | Limit | Min Time (ms) | Max Time (ms) | Avg Time (ms) | Median Time (ms) | Std Dev (ms) |
---|---|---|---|---|---|---|
1 hour | 100 | 3 | 14 | 5.9 | 5.0 | 2.8 |
1 day | 100 | 5 | 13 | 6.0 | 5.0 | 2.4 |
7 days | 100 | 32 | 47 | 37.6 | 34.5 | 5.4 |
30 days | 100 | 129 | 140 | 133.3 | 132.0 | 3.2 |
Now we have much better results! I don't like only all chains query for 30 days period. CELO/ETHEREUM queries are really fast.
Forth attempt
Let's try to use compression for 7 days bucket.
This is result of using the first query:
Period | Limit | Min Time (ms) | Max Time (ms) | Avg Time (ms) | Median Time (ms) | Std Dev (ms) |
---|---|---|---|---|---|---|
1 hour | 100 | 47 | 153 | 60.3 | 49.5 | 31.0 |
1 day | 100 | 50 | 62 | 54.8 | 54.0 | 4.4 |
7 days | 100 | 62 | 73 | 67.1 | 67.0 | 3.6 |
30 days | 100 | 139 | 182 | 150.4 | 145.0 | 12.5 |
This is result of using the second query with small blockchain CELO
:
Period | Limit | Min Time (ms) | Max Time (ms) | Avg Time (ms) | Median Time (ms) | Std Dev (ms) |
---|---|---|---|---|---|---|
1 hour | 100 | 18 | 39 | 22.6 | 20.0 | 6.1 |
1 day | 100 | 20 | 36 | 25.1 | 23.0 | 5.7 |
7 days | 100 | 6 | 21 | 8.0 | 7.0 | 4.4 |
30 days | 100 | 7 | 21 | 9.4 | 8.0 | 4.0 |
This is result of using the second query with big blockchain ETHEREUM
:
Period | Limit | Min Time (ms) | Max Time (ms) | Avg Time (ms) | Median Time (ms) | Std Dev (ms) |
---|---|---|---|---|---|---|
1 hour | 100 | 21 | 34 | 23.8 | 23.0 | 3.5 |
1 day | 100 | 23 | 59 | 29.0 | 24.5 | 10.5 |
7 days | 100 | 31 | 58 | 38.2 | 36.5 | 7.4 |
30 days | 100 | 59 | 74 | 64.5 | 63.5 | 4.2 |
Finally, our queries are blazing fast. The slowest one is still 30 days all chains, however median time <0.2s so I would say it is fast for aggregation among millions of rows.
Charts
Conclusion
Let's take a look at table sizes using query:
select
hypertable_name,
pg_size_pretty(hypertable_size(format('%I.%I', hypertable_schema, hypertable_name)::regclass))
from timescaledb_information.hypertables;
- 1day chunk table: 2.4 Gb
- 1day chunk table with compression: 2.6 Gb
- 7day chunk table: 2.2 Gb
- 7day chunk table with compression: 0.3 Gb
It is almost 10x differences between the first table and the last one, it explains why 7day bucket table with compression is so fast. Configuration with 7 days chunk and compression is clear winner.