Add config option `mongo_count_timeout` to skip the global count per request #1757

ml-evs · 2023-08-24T00:02:30Z

It seems that our mongo implementation is very slow for large collections, in part because of the global structure count required for each filter. This PR adds the ability to disable that (and thus disable data_returned).

cc @eimrek,

codecov · 2023-08-24T00:13:17Z

Codecov Report

❗ No coverage uploaded for pull request base (master@22f51a1). Click here to learn what that means.
The diff coverage is 77.77%.

@@            Coverage Diff            @@
##             master    #1757   +/-   ##
=========================================
  Coverage          ?   90.77%           
=========================================
  Files             ?       74           
  Lines             ?     4627           
  Branches          ?        0           
=========================================
  Hits              ?     4200           
  Misses            ?      427           
  Partials          ?        0

Flag	Coverage Δ
project	`90.77% <77.77%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
optimade/server/config.py	`93.75% <100.00%> (ø)`
...made/server/entry_collections/entry_collections.py	`95.83% <100.00%> (ø)`
optimade/server/routers/utils.py	`95.12% <ø> (ø)`
optimade/server/entry_collections/mongo.py	`91.42% <71.42%> (ø)`

JPBergsma · 2023-09-22T15:14:36Z

optimade/server/config.py

+    mongo_count_timeout: int = Field(
+        5,
+        description="""Number of seconds to allow MongoDB to perform a full database count before falling back to `null`.
+This operation can require a full COLLSCAN for empty queries which can be prohibitively slow if the database does not fit into the active set, hence a timeout can drastically speed-up response times.""",


Maybe I do not fully understand what you are writing here, but I think MongoDB should know how many entries there are in a collection, so for an empty filter the query should not be slow. For a more complex query for which MongoDB cannot drastically reduce the number of entries using one of the already existing indexes this would still be a useful feature though.

You are right, it shouldn't be slow, but previously for an empty filter we were just naively calling count_documents which does still do a full scan that can be very slow. Now, I am using estimated_document_count for this case, which uses simple collection metadata to just return the number. However for filters, it can still make use of this timeout.

eimrek · 2023-09-27T16:00:56Z

Gave this a try on the MC server for the big database, and everything works well.

See https://dev-optimade.materialscloud.org/archive/li-ion-conductors/v1/structures

Would be good to have this merged. Thanks!

However, the links:next is still broken for our APIs, but i realized it's broken also for releases 0.25.1 and 0.25.2, so it's not related to this change.

…query

ml-evs mentioned this pull request Aug 24, 2023

MongoDB slow for large databases materialscloud-org/optimade-maker#24

Closed

ml-evs force-pushed the ml-evs/add_data_returned_skip branch 2 times, most recently from 9791fb8 to 4a1beeb Compare September 18, 2023 21:23

ml-evs changed the title ~~Add config option elide_data_returned to skip the global count per request~~ Add config option mongo_count_timeout to skip the global count per request Sep 18, 2023

ml-evs force-pushed the ml-evs/add_data_returned_skip branch from 4a1beeb to 06d41a6 Compare September 18, 2023 21:31

ml-evs marked this pull request as ready for review September 22, 2023 13:30

ml-evs requested review from CasperWA and JPBergsma as code owners September 22, 2023 13:30

JPBergsma reviewed Sep 22, 2023

View reviewed changes

ml-evs added the server Issues pertaining to the example server implementation label Sep 27, 2023

ml-evs added 5 commits September 27, 2023 20:22

Add config option elide_data_returned to skip the global count per …

ad83688

…query

Fix type hint

c709db4

Try latest MongoDB

b721b43

Use MongoDB estimated count in case of empty filter

e7ed485

Replace elide_data_returned with count timeout

13521da

ml-evs force-pushed the ml-evs/add_data_returned_skip branch from 06d41a6 to 13521da Compare September 27, 2023 18:22

ml-evs merged commit 7269566 into master Sep 27, 2023
11 checks passed

ml-evs deleted the ml-evs/add_data_returned_skip branch September 27, 2023 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add config option `mongo_count_timeout` to skip the global count per request #1757

Add config option `mongo_count_timeout` to skip the global count per request #1757

ml-evs commented Aug 24, 2023 •

edited

Loading

codecov bot commented Aug 24, 2023 •

edited

Loading

JPBergsma Sep 22, 2023

ml-evs Sep 22, 2023

eimrek commented Sep 27, 2023

Add config option mongo_count_timeout to skip the global count per request #1757

Add config option mongo_count_timeout to skip the global count per request #1757

Conversation

ml-evs commented Aug 24, 2023 • edited Loading

codecov bot commented Aug 24, 2023 • edited Loading

Codecov Report

JPBergsma Sep 22, 2023

Choose a reason for hiding this comment

ml-evs Sep 22, 2023

Choose a reason for hiding this comment

eimrek commented Sep 27, 2023

Add config option `mongo_count_timeout` to skip the global count per request #1757

Add config option `mongo_count_timeout` to skip the global count per request #1757

ml-evs commented Aug 24, 2023 •

edited

Loading

codecov bot commented Aug 24, 2023 •

edited

Loading