Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pypgstac load items partially without exceptions #285

Open
drnextgis opened this issue Jul 1, 2024 · 0 comments
Open

pypgstac load items partially without exceptions #285

drnextgis opened this issue Jul 1, 2024 · 0 comments

Comments

@drnextgis
Copy link
Collaborator

drnextgis commented Jul 1, 2024

The pypgstac load items command does not handle item loading atomically. There are scenarios where some items fail to load without pypgstac raising any exceptions. This is a serious issue because it creates the illusion that everything has been loaded correctly from the client's perspective, while in reality, some items are missing. Below is a reproducible test case that demonstrates this behavior:

  1. Create 2 collections with a month partition
  2. Prepare two files with items: items1.njson and items2.njson. Each file contains 10 items with the same datetime, belonging to collection1 (items1.njson) and 10 to collection2 (items2.njson).
  3. Ingest them concurrently:
$ pypgstac load items items1.njson --dsn postgresql://username:password@pgstac:5432/postgis & \
  pypgstac load items items2.njson --dsn postgresql://username:password@pgstac:5432/postgis
  1. Check the count. There should be 20 items, but only 10 were ingested:
postgis=# SELECT COUNT(1) FROM items;
 count 
-------
    10
(1 row)

Logs indicate a deadlock during the update_partition_stats function execution:

pgstac      | 2024-07-01 20:44:12.047 UTC [141] ERROR:  deadlock detected
pgstac      | 2024-07-01 20:44:12.047 UTC [141] DETAIL:  Process 141 waits for AccessExclusiveLock on relation 18386 of database 16384; blocked by process 140.
pgstac      | 	Process 140 waits for AccessExclusiveLock on relation 18972 of database 16384; blocked by process 141.
pgstac      | 	Process 141: SELECT update_partition_stats_q($1);
pgstac      | 	Process 140: SELECT update_partition_stats_q($1);
pgstac      | 2024-07-01 20:44:12.047 UTC [141] HINT:  See server log for query details.
pgstac      | 2024-07-01 20:44:12.047 UTC [141] CONTEXT:  SQL statement "REFRESH MATERIALIZED VIEW partitions"
pgstac      | 	PL/pgSQL function update_partition_stats(text,boolean) line 35 at SQL statement
pgstac      | 	SQL statement "SELECT update_partition_stats('_items_3_202003', 'f');"
pgstac      | 	PL/pgSQL function run_or_queue(text) line 11 at EXECUTE
pgstac      | 	SQL statement "SELECT run_or_queue(
pgstac      | 	        format('SELECT update_partition_stats(%L, %L);', _partition, istrigger)
pgstac      | 	    )"
pgstac      | 	PL/pgSQL function update_partition_stats_q(text,boolean) line 4 at PERFORM
pgstac      | 2024-07-01 20:44:12.047 UTC [141] STATEMENT:  SELECT update_partition_stats_q($1);

I understand that concurrent data ingestion into the same partition is not recommended. However, it would be helpful to receive a clear exception when something goes wrong. Additionally, if I'm correct, in this example we are ingesting data into different partitions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant