Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing query performance during ingestion process #221

Open
drnextgis opened this issue Oct 14, 2023 · 3 comments
Open

Optimizing query performance during ingestion process #221

drnextgis opened this issue Oct 14, 2023 · 3 comments

Comments

@drnextgis
Copy link
Collaborator

drnextgis commented Oct 14, 2023

I noticed that this query takes a long time during the ingestion process:

SELECT
    tstzrange(min(datetime), max(datetime),'[]'),
    tstzrange(min(end_datetime), max(end_datetime), '[]')
FROM _items_1

If I'm not mistaken, it originates from here.

As we can see from the output of the EXPLAIN statement, it performs a sequential scan:

postgis=> EXPLAIN SELECT                                                             
    tstzrange(min(datetime), max(datetime),'[]'),
    tstzrange(min(end_datetime), max(end_datetime), '[]')
FROM _items_1;
                                           QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=1445345.11..1445345.12 rows=1 width=64)
   ->  Gather  (cost=1445344.88..1445345.09 rows=2 width=32)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=1444344.88..1444344.89 rows=1 width=32)
               ->  Parallel Seq Scan on _items_1  (cost=0.00..1399274.94 rows=4506994 width=16)
(5 rows)

My initial assumption was that the end_datetime property is not properly indexed:

postgis=> EXPLAIN SELECT tstzrange(min(datetime), max(datetime), '[]') FROM _items_1;
                                                                  QUERY PLAN                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=1.25..1.26 rows=1 width=32)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.43..0.62 rows=1 width=8)
           ->  Index Only Scan Backward using _items_1_datetime_end_datetime_idx on _items_1  (cost=0.43..2036102.77 rows=10816785 width=8)
                 Index Cond: (datetime IS NOT NULL)
   InitPlan 2 (returns $1)
     ->  Limit  (cost=0.43..0.62 rows=1 width=8)
           ->  Index Only Scan using _items_1_datetime_end_datetime_idx on _items_1 _items_1_1  (cost=0.43..2036102.77 rows=10816785 width=8)
                 Index Cond: (datetime IS NOT NULL)
(9 rows)
postgis=> EXPLAIN SELECT tstzrange(min(end_datetime), max(end_datetime), '[]') FROM _items_1;
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=1422810.13..1422810.14 rows=1 width=32)
   ->  Gather  (cost=1422809.91..1422810.12 rows=2 width=16)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=1421809.91..1421809.92 rows=1 width=16)
               ->  Parallel Seq Scan on _items_1  (cost=0.00..1399274.94 rows=4506994 width=8)
(5 rows)

However, I later discovered that PgSTAC uses a multicolumn index on datetime and end_datetime here. I believe this was done intentionally. @bitner, do you think it would make sense to build an individual index for end_datetime?

Based on what I observe, having this index enables us to avoid a sequential scan:

CREATE INDEX "end_datetime_idx" ON items USING BTREE(end_datetime ASC);
@bitner
Copy link
Collaborator

bitner commented Oct 16, 2023

We definitely sacrifice some ingest performance for the sake of read performance.

In Postgres, adding additional indexes is not always the best thing -- it adds additional overhead that actually slows down inserts/updates as the additional index needs to get updated and it leads to contention for what indexes end up staying in memory rather than needing to be read from disk.

The multi-column index on datetime, end_datetime is optimal and better than two separate indexes for any search that uses the datetime= filter. It works well for anything that needs an index on either solely the datetime column or things that mix datetime/end_datetime, but it does not work well for things that are solely querying the end_datetime which really never happens in anything from the search endpoint.

As this is an index case that is really only used when updating partition statistics (which is generally done as part of an ingest process). The other option would be to use the ability to defer statistics and indexing by using the query queue. As I get things ready for the workshop I'm giving next week, I'm going to be beefing up the docs for the Query Queue, but basically it allows deferring certain calls (like updating statistics) to a job that gets run from a cron rather than with the transaction itself.

@drnextgis
Copy link
Collaborator Author

drnextgis commented Oct 16, 2023

Thank you for sharing this information! Based on what I've observed, it seems that the loader is already utilizing the update_partition_stats_q function here. To run it asynchronously, it appears that I need to install pg_cron and then execute something similar to the code below:

postgis=> SET pgstac.use_queue = 't';
SET

postgis=> SELECT * from query_queue;
 query | added 
-------+-------
(0 rows)

-- this is what happens in 'load_partition'
postgis=> SELECT update_partition_stats_q('_items_1');
 update_partition_stats_q 
--------------------------
 
(1 row)

-- checking that the query is in the queue
postgis=> SELECT * from query_queue;
                      query                      |             added             
-------------------------------------------------+-------------------------------
 SELECT update_partition_stats('_items_1', 'f'); | 2023-10-16 16:09:19.585958+00
(1 row)


postgis=> SELECT cron.schedule('0 0 * * *', 'CALL run_queued_queries();');

Is my understanding correct?

@bitner
Copy link
Collaborator

bitner commented Nov 7, 2023

Yep, that is correct.

I would note that as with all settings, if you set it using SET pgstac.use_queue='t'; Only queries that were "queued" during that session will be deferred. In order to always use the queue you would need to either set it at the role or database level ie ALTER DATABASE pgstac SET pgstac.use_queue = 't'; or by using the pgstac_settings table (settings will preferentially use the pgstac.* settings over anything in the pgstac_settings table, but not all DBaaS providers allow the use of arbitrary settings).

You can also run anything that is in the queue manually using either SELECT run_queud_queries_intransaction(); or CALL run_queued_queries();. The difference being that the "function" variant run_queued_queries_intransaction will run all the queued queries within a single transaction (and if there are any errors in any of them, all of them will be rolled back) whereas using the "procedure" variant will commit each query separately as it goes through them. Procedures are slightly different in Postgres in that they can have transactional control, but they must be called using CALL rather than SELECT.

If you didn't want to use the cron, another option if you were doing, say a large batch of ingests would be to set use_queue before doing the ingests. Ingest the data. Call run_queueued_queries(). and then reset use_queue to false.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants