Skip to content

Choosing the best table configuration for timescaleDB

Task description

There is a pretty common task for marketplaces to calculate volume for some products in period of 1 day/7 days/30 days in near realtime. TimescaleDB or any another time series database is perfectly fitted for this kind of task. Let's test TimescaleDB due to on my current project is already using PostgreSQL. I will use a very simplified model from my current project (NFT marketplace). We have thousands collections for a bunch of blockchains and we collect every list/bid or sale transaction in the table. The total number of transaction is huge: billions of rows. The idea of how we can make calculation fast is divide our table into buckets by time and calculate number using them.

Preparation

This is a schema of table and some indexes:

sql
create table public.sale_changes
(
    time          timestampz not null,
    id            varchar(1024),
    collection_id varchar(512),
    blockchain    varchar(512),
    floor         numeric(78, 10),
    sales         bigint,
    volume        numeric(78, 10)
);
select create_hypertable('sale_changes', 'time', chunk_time_interval = > interval '1 day');
create unique index ON sale_changes(blockchain, id, time);
create index ON sale_changes(collection_id, time);
create * from add_dimension('sale_changes', by_hash('blockchain', 16));

We created some additional indexes:

  • CREATE UNIQUE INDEX ON sale_changes(blockchain, id, time); -- this is needed for data consistency
  • CREATE INDEX ON sale_changes(collection_id, time); -- this is for faster aggeregation by collection

Also, we added additional dimension for better performance blockchain specific queries. 16 -- is a number of buckets, currently it's enough for tests.

I made a backup with data for current year and load in my local timescaleDB:

  • POLYGON: 4917155 rows
  • ETHEREUM: 3032708 rows
  • APTOS: 129760 rows
  • CELO: 13692 rows
  • ZKSYNC: 8832 rows

So we have 8102147 (~8m) locally.

Let's use there 2 queries for performance testing:

sql
-- all query
select csc.*,
       cs.floor,
       cs.owners from (select collection_id,
       SUM(volume),
       SUM(sales)  AS total_sales,
       CASE
           WHEN first(floor, time) = 0 THEN 0
           ELSE ROUND(((last(floor, time) - first(floor, time)) / first(floor, time)) * 100, 2)
           END         AS floor_change
from sale_changes
where time >= NOW() - INTERVAL '$period'
group by collection_id
order by floor_change DESC
limit ?

-- Specific chain query:
select csc.*,
       cs.floor,
       cs.owners
from (select collection_id,
             SUM(volume),
             SUM(sales) AS total_sales,
             CASE
                 WHEN first(floor, time) = 0 THEN 0
          ELSE ROUND(((last (floor, time) - first (floor, time)) / first (floor, time)) * 100, 2)
          END AS floor_change
                           from $table
WHERE blockchain = ? and time >= NOW() - INTERVAL '$period'
group by collection_id
order by floor_change DESC
limit ?

Testing

We will test performance by executing every query 10 times and collecting metrics about time of execution. My local environment:

  • macbook m1 with 32 GB RAM
  • timescaleDB 2.17.2-pg17 in docker
  • increased memory settings for pg: shared_buffers=8GB, work_mem=8GB

First attempt

This is result of using the first query:

PeriodLimitMin Time (ms)Max Time (ms)Avg Time (ms)Median Time (ms)Std Dev (ms)
1 hour1007110577.374.59.7
1 day100738176.075.02.4
7 days100161192168.1166.08.5
30 days100515630536.3526.532.2

This is result of using the second query with small blockchain CELO:

PeriodLimitMin Time (ms)Max Time (ms)Avg Time (ms)Median Time (ms)Std Dev (ms)
1 hour1001310123.715.025.8
1 day100139923.414.525.2
7 days100149924.515.524.9
30 days1001610426.117.526.0

This is result of using the second query with big blockchain ETHEREUM:

PeriodLimitMin Time (ms)Max Time (ms)Avg Time (ms)Median Time (ms)Std Dev (ms)
1 hour1002711437.829.525.5
1 day1002711237.028.025.0
7 days1006617279.668.530.9
30 days100206382251.6220.061.9

We can see that queries for 30 seconds interval are not so fast as other except CELO blockchain. It is absolutely expected because db has to process more rows. Let's try different configuration and see if we can improve performance.

Second attempt

Let's turn on compression. We will use collection_id as a compression segment because it makes sense to keep these number together.

sql
ALTER TABLE sale_changes SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'collection_id'
    );
SELECT compress_chunk(c) from show_chunks('sale_changes') c;

This is result of using the first query:

PeriodLimitMin Time (ms)Max Time (ms)Avg Time (ms)Median Time (ms)Std Dev (ms)
1 hour1002491091355.9261.5247.0
1 day100237286255.4251.015.4
7 days100247278259.4260.58.9
30 days100349412370.2358.521.0

This is result of using the second query with small blockchain CELO:

PeriodLimitMin Time (ms)Max Time (ms)Avg Time (ms)Median Time (ms)Std Dev (ms)
1 hour1001913133.121.032.8
1 day1001812832.921.532.1
7 days1002112933.322.031.9
30 days1002515142.125.036.9

This is result of using the second query with big blockchain ETHEREUM:

PeriodLimitMin Time (ms)Max Time (ms)Avg Time (ms)Median Time (ms)Std Dev (ms)
1 hour10090203115.799.034.6
1 day10094208110.599.532.8
7 days10095219111.198.536.3
30 days100149272170.7159.534.2

Compressed table gave us worse results... Even for queries for small interval, however for 30 days in case of all blockchains it improved result. I guess keeping number together by colelction_id helped here.

Third attempt

Let's try to increase chunk for our hypertable, due to have a plenty of RAM. Following the official recommendation it is good if chunk can 25% of memory. I currently have 24GB RAM allocated to docker.

sql
select create_hypertable('sale_changes', 'time', chunk_time_interval => interval '7 day');
CREATE UNIQUE INDEX ON sale_changes(blockchain, id, time);
CREATE INDEX ON sale_changes(collection_id, time);
SELECT * FROM add_dimension('sale_changes', by_hash('blockchain', 16));

This is result of using the first query:

PeriodLimitMin Time (ms)Max Time (ms)Avg Time (ms)Median Time (ms)Std Dev (ms)
1 hour100106117.111.014.8
1 day100101811.611.02.2
7 days1008711991.588.09.3
30 days100367514390.4373.542.9

This is result of using the second query with small blockchain CELO:

PeriodLimitMin Time (ms)Max Time (ms)Avg Time (ms)Median Time (ms)Std Dev (ms)
1 hour1002145.04.53.2
1 day1002125.04.02.7
7 days1003156.25.03.4
30 days1006157.67.02.6

This is result of using the second query with big blockchain ETHEREUM:

PeriodLimitMin Time (ms)Max Time (ms)Avg Time (ms)Median Time (ms)Std Dev (ms)
1 hour1003145.95.02.8
1 day1005136.05.02.4
7 days100324737.634.55.4
30 days100129140133.3132.03.2

Now we have much better results! I don't like only all chains query for 30 days period. CELO/ETHEREUM queries are really fast.

Forth attempt

Let's try to use compression for 7 days bucket.

This is result of using the first query:

PeriodLimitMin Time (ms)Max Time (ms)Avg Time (ms)Median Time (ms)Std Dev (ms)
1 hour1004715360.349.531.0
1 day100506254.854.04.4
7 days100627367.167.03.6
30 days100139182150.4145.012.5

This is result of using the second query with small blockchain CELO:

PeriodLimitMin Time (ms)Max Time (ms)Avg Time (ms)Median Time (ms)Std Dev (ms)
1 hour100183922.620.06.1
1 day100203625.123.05.7
7 days1006218.07.04.4
30 days1007219.48.04.0

This is result of using the second query with big blockchain ETHEREUM:

PeriodLimitMin Time (ms)Max Time (ms)Avg Time (ms)Median Time (ms)Std Dev (ms)
1 hour100213423.823.03.5
1 day100235929.024.510.5
7 days100315838.236.57.4
30 days100597464.563.54.2

Finally, our queries are blazing fast. The slowest one is still 30 days all chains, however median time <0.2s so I would say it is fast for aggregation among millions of rows.

Charts

Conclusion

Let's take a look at table sizes using query:

sql
select 
    hypertable_name, 
    pg_size_pretty(hypertable_size(format('%I.%I', hypertable_schema, hypertable_name)::regclass))
from timescaledb_information.hypertables;
  • 1day chunk table: 2.4 Gb
  • 1day chunk table with compression: 2.6 Gb
  • 7day chunk table: 2.2 Gb
  • 7day chunk table with compression: 0.3 Gb

It is almost 10x differences between the first table and the last one, it explains why 7day bucket table with compression is so fast. Configuration with 7 days chunk and compression is clear winner.