Choosing the best table configuration for timescaleDB

Task description

There is a pretty common task for marketplaces to calculate volume for some products in period of 1 day/7 days/30 days in near realtime. TimescaleDB or any another time series database is perfectly fitted for this kind of task. Let's test TimescaleDB due to on my current project is already using PostgreSQL. I will use a very simplified model from my current project (NFT marketplace). We have thousands collections for a bunch of blockchains and we collect every list/bid or sale transaction in the table. The total number of transaction is huge: billions of rows. The idea of how we can make calculation fast is divide our table into buckets by time and calculate number using them.

Preparation

This is a schema of table and some indexes:

sql

create table public.sale_changes
(
    time          timestampz not null,
    id            varchar(1024),
    collection_id varchar(512),
    blockchain    varchar(512),
    floor         numeric(78, 10),
    sales         bigint,
    volume        numeric(78, 10)
);
select create_hypertable('sale_changes', 'time', chunk_time_interval = > interval '1 day');
create unique index ON sale_changes(blockchain, id, time);
create index ON sale_changes(collection_id, time);
create * from add_dimension('sale_changes', by_hash('blockchain', 16));

We created some additional indexes:

CREATE UNIQUE INDEX ON sale_changes(blockchain, id, time); -- this is needed for data consistency
CREATE INDEX ON sale_changes(collection_id, time); -- this is for faster aggeregation by collection

Also, we added additional dimension for better performance blockchain specific queries. 16 -- is a number of buckets, currently it's enough for tests.

I made a backup with data for current year and load in my local timescaleDB:

POLYGON: 4917155 rows
ETHEREUM: 3032708 rows
APTOS: 129760 rows
CELO: 13692 rows
ZKSYNC: 8832 rows

So we have 8102147 (~8m) locally.

Let's use there 2 queries for performance testing:

sql

-- all query
select csc.*,
       cs.floor,
       cs.owners from (select collection_id,
       SUM(volume),
       SUM(sales)  AS total_sales,
       CASE
           WHEN first(floor, time) = 0 THEN 0
           ELSE ROUND(((last(floor, time) - first(floor, time)) / first(floor, time)) * 100, 2)
           END         AS floor_change
from sale_changes
where time >= NOW() - INTERVAL '$period'
group by collection_id
order by floor_change DESC
limit ?

-- Specific chain query:
select csc.*,
       cs.floor,
       cs.owners
from (select collection_id,
             SUM(volume),
             SUM(sales) AS total_sales,
             CASE
                 WHEN first(floor, time) = 0 THEN 0
          ELSE ROUND(((last (floor, time) - first (floor, time)) / first (floor, time)) * 100, 2)
          END AS floor_change
                           from $table
WHERE blockchain = ? and time >= NOW() - INTERVAL '$period'
group by collection_id
order by floor_change DESC
limit ?

Testing

We will test performance by executing every query 10 times and collecting metrics about time of execution. My local environment:

macbook m1 with 32 GB RAM
timescaleDB 2.17.2-pg17 in docker
increased memory settings for pg: shared_buffers=8GB, work_mem=8GB

First attempt

This is result of using the first query:

Period	Limit	Min Time (ms)	Max Time (ms)	Avg Time (ms)	Median Time (ms)	Std Dev (ms)
1 hour	100	71	105	77.3	74.5	9.7
1 day	100	73	81	76.0	75.0	2.4
7 days	100	161	192	168.1	166.0	8.5
30 days	100	515	630	536.3	526.5	32.2

This is result of using the second query with small blockchain CELO:

Period	Limit	Min Time (ms)	Max Time (ms)	Avg Time (ms)	Median Time (ms)	Std Dev (ms)
1 hour	100	13	101	23.7	15.0	25.8
1 day	100	13	99	23.4	14.5	25.2
7 days	100	14	99	24.5	15.5	24.9
30 days	100	16	104	26.1	17.5	26.0

This is result of using the second query with big blockchain ETHEREUM:

Period	Limit	Min Time (ms)	Max Time (ms)	Avg Time (ms)	Median Time (ms)	Std Dev (ms)
1 hour	100	27	114	37.8	29.5	25.5
1 day	100	27	112	37.0	28.0	25.0
7 days	100	66	172	79.6	68.5	30.9
30 days	100	206	382	251.6	220.0	61.9

We can see that queries for 30 seconds interval are not so fast as other except CELO blockchain. It is absolutely expected because db has to process more rows. Let's try different configuration and see if we can improve performance.

Second attempt

Let's turn on compression. We will use collection_id as a compression segment because it makes sense to keep these number together.

sql

ALTER TABLE sale_changes SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'collection_id'
    );
SELECT compress_chunk(c) from show_chunks('sale_changes') c;

This is result of using the first query:

Period	Limit	Min Time (ms)	Max Time (ms)	Avg Time (ms)	Median Time (ms)	Std Dev (ms)
1 hour	100	249	1091	355.9	261.5	247.0
1 day	100	237	286	255.4	251.0	15.4
7 days	100	247	278	259.4	260.5	8.9
30 days	100	349	412	370.2	358.5	21.0

This is result of using the second query with small blockchain CELO:

Period	Limit	Min Time (ms)	Max Time (ms)	Avg Time (ms)	Median Time (ms)	Std Dev (ms)
1 hour	100	19	131	33.1	21.0	32.8
1 day	100	18	128	32.9	21.5	32.1
7 days	100	21	129	33.3	22.0	31.9
30 days	100	25	151	42.1	25.0	36.9

This is result of using the second query with big blockchain ETHEREUM:

Period	Limit	Min Time (ms)	Max Time (ms)	Avg Time (ms)	Median Time (ms)	Std Dev (ms)
1 hour	100	90	203	115.7	99.0	34.6
1 day	100	94	208	110.5	99.5	32.8
7 days	100	95	219	111.1	98.5	36.3
30 days	100	149	272	170.7	159.5	34.2

Compressed table gave us worse results... Even for queries for small interval, however for 30 days in case of all blockchains it improved result. I guess keeping number together by colelction_id helped here.

Third attempt

Let's try to increase chunk for our hypertable, due to have a plenty of RAM. Following the official recommendation it is good if chunk can 25% of memory. I currently have 24GB RAM allocated to docker.

sql

select create_hypertable('sale_changes', 'time', chunk_time_interval => interval '7 day');
CREATE UNIQUE INDEX ON sale_changes(blockchain, id, time);
CREATE INDEX ON sale_changes(collection_id, time);
SELECT * FROM add_dimension('sale_changes', by_hash('blockchain', 16));

This is result of using the first query:

Period	Limit	Min Time (ms)	Max Time (ms)	Avg Time (ms)	Median Time (ms)	Std Dev (ms)
1 hour	100	10	61	17.1	11.0	14.8
1 day	100	10	18	11.6	11.0	2.2
7 days	100	87	119	91.5	88.0	9.3
30 days	100	367	514	390.4	373.5	42.9

This is result of using the second query with small blockchain CELO:

Period	Limit	Min Time (ms)	Max Time (ms)	Avg Time (ms)	Median Time (ms)	Std Dev (ms)
1 hour	100	2	14	5.0	4.5	3.2
1 day	100	2	12	5.0	4.0	2.7
7 days	100	3	15	6.2	5.0	3.4
30 days	100	6	15	7.6	7.0	2.6

This is result of using the second query with big blockchain ETHEREUM:

Period	Limit	Min Time (ms)	Max Time (ms)	Avg Time (ms)	Median Time (ms)	Std Dev (ms)
1 hour	100	3	14	5.9	5.0	2.8
1 day	100	5	13	6.0	5.0	2.4
7 days	100	32	47	37.6	34.5	5.4
30 days	100	129	140	133.3	132.0	3.2

Now we have much better results! I don't like only all chains query for 30 days period. CELO/ETHEREUM queries are really fast.

Forth attempt

Let's try to use compression for 7 days bucket.

This is result of using the first query:

Period	Limit	Min Time (ms)	Max Time (ms)	Avg Time (ms)	Median Time (ms)	Std Dev (ms)
1 hour	100	47	153	60.3	49.5	31.0
1 day	100	50	62	54.8	54.0	4.4
7 days	100	62	73	67.1	67.0	3.6
30 days	100	139	182	150.4	145.0	12.5

This is result of using the second query with small blockchain CELO:

Period	Limit	Min Time (ms)	Max Time (ms)	Avg Time (ms)	Median Time (ms)	Std Dev (ms)
1 hour	100	18	39	22.6	20.0	6.1
1 day	100	20	36	25.1	23.0	5.7
7 days	100	6	21	8.0	7.0	4.4
30 days	100	7	21	9.4	8.0	4.0

This is result of using the second query with big blockchain ETHEREUM:

Period	Limit	Min Time (ms)	Max Time (ms)	Avg Time (ms)	Median Time (ms)	Std Dev (ms)
1 hour	100	21	34	23.8	23.0	3.5
1 day	100	23	59	29.0	24.5	10.5
7 days	100	31	58	38.2	36.5	7.4
30 days	100	59	74	64.5	63.5	4.2

Finally, our queries are blazing fast. The slowest one is still 30 days all chains, however median time <0.2s so I would say it is fast for aggregation among millions of rows.

Charts

Conclusion

Let's take a look at table sizes using query:

sql

select 
    hypertable_name, 
    pg_size_pretty(hypertable_size(format('%I.%I', hypertable_schema, hypertable_name)::regclass))
from timescaledb_information.hypertables;

1day chunk table: 2.4 Gb
1day chunk table with compression: 2.6 Gb
7day chunk table: 2.2 Gb
7day chunk table with compression: 0.3 Gb

It is almost 10x differences between the first table and the last one, it explains why 7day bucket table with compression is so fast. Configuration with 7 days chunk and compression is clear winner.

Choosing the best table configuration for timescaleDB ​

Task description ​

Preparation ​

Testing ​

First attempt ​

Second attempt ​

Third attempt ​

Forth attempt ​

Charts ​

Conclusion ​

Choosing the best table configuration for timescaleDB

Task description

Preparation

Testing

First attempt

Second attempt

Third attempt

Forth attempt

Charts

Conclusion