-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizing query performance during ingestion process #221
Comments
We definitely sacrifice some ingest performance for the sake of read performance. In Postgres, adding additional indexes is not always the best thing -- it adds additional overhead that actually slows down inserts/updates as the additional index needs to get updated and it leads to contention for what indexes end up staying in memory rather than needing to be read from disk. The multi-column index on datetime, end_datetime is optimal and better than two separate indexes for any search that uses the As this is an index case that is really only used when updating partition statistics (which is generally done as part of an ingest process). The other option would be to use the ability to defer statistics and indexing by using the query queue. As I get things ready for the workshop I'm giving next week, I'm going to be beefing up the docs for the Query Queue, but basically it allows deferring certain calls (like updating statistics) to a job that gets run from a cron rather than with the transaction itself. |
Thank you for sharing this information! Based on what I've observed, it seems that the loader is already utilizing the postgis=> SET pgstac.use_queue = 't';
SET
postgis=> SELECT * from query_queue;
query | added
-------+-------
(0 rows)
-- this is what happens in 'load_partition'
postgis=> SELECT update_partition_stats_q('_items_1');
update_partition_stats_q
--------------------------
(1 row)
-- checking that the query is in the queue
postgis=> SELECT * from query_queue;
query | added
-------------------------------------------------+-------------------------------
SELECT update_partition_stats('_items_1', 'f'); | 2023-10-16 16:09:19.585958+00
(1 row)
postgis=> SELECT cron.schedule('0 0 * * *', 'CALL run_queued_queries();'); Is my understanding correct? |
Yep, that is correct. I would note that as with all settings, if you set it using You can also run anything that is in the queue manually using either If you didn't want to use the cron, another option if you were doing, say a large batch of ingests would be to set use_queue before doing the ingests. Ingest the data. Call run_queueued_queries(). and then reset use_queue to false. |
I noticed that this query takes a long time during the ingestion process:
If I'm not mistaken, it originates from here.
As we can see from the output of the
EXPLAIN
statement, it performs a sequential scan:My initial assumption was that the
end_datetime
property is not properly indexed:However, I later discovered that PgSTAC uses a multicolumn index on
datetime
andend_datetime
here. I believe this was done intentionally. @bitner, do you think it would make sense to build an individual index forend_datetime
?Based on what I observe, having this index enables us to avoid a sequential scan:
The text was updated successfully, but these errors were encountered: