Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MongoDB slow for large databases #24

Closed
eimrek opened this issue Aug 23, 2023 · 6 comments
Closed

MongoDB slow for large databases #24

eimrek opened this issue Aug 23, 2023 · 6 comments

Comments

@eimrek
Copy link
Member

eimrek commented Aug 23, 2023

I spent some time investigating the slowness of the li-ion-conductors optimade database.

Here's the info endpoint: https://dev-optimade.materialscloud.org/archive/li-ion-conductors/v1/info

Accessing the /structures endpoint takes over 2 minutes.

I turned on performance profiling (db.setProfilingLevel(1)) and here's the part of the log for accessing the /structures endpoint (only the slow commands should show up here, meaning everything else was fast, i think):

/structures profiling (click to show)
> db.system.profile.find().pretty()
{
	"op" : "command",
	"ns" : "li-ion-conductors.structures",
	"command" : {
		"aggregate" : "structures",
		"pipeline" : [
			{
				"$match" : {
					
				}
			},
			{
				"$group" : {
					"_id" : 1,
					"n" : {
						"$sum" : 1
					}
				}
			}
		],
		"cursor" : {
			
		},
		"lsid" : {
			"id" : UUID("f822cd93-f176-4405-8c3a-062a5c7e79d8")
		},
		"$db" : "li-ion-conductors"
	},
	"keysExamined" : 0,
	"docsExamined" : 4396695,
	"cursorExhausted" : true,
	"numYield" : 7910,
	"nreturned" : 1,
	"locks" : {
		"FeatureCompatibilityVersion" : {
			"acquireCount" : {
				"r" : NumberLong(7913)
			}
		},
		"ReplicationStateTransition" : {
			"acquireCount" : {
				"w" : NumberLong(7913)
			}
		},
		"Global" : {
			"acquireCount" : {
				"r" : NumberLong(7913)
			}
		},
		"Database" : {
			"acquireCount" : {
				"r" : NumberLong(7912)
			}
		},
		"Collection" : {
			"acquireCount" : {
				"r" : NumberLong(7912)
			}
		},
		"Mutex" : {
			"acquireCount" : {
				"r" : NumberLong(2)
			}
		}
	},
	"flowControl" : {
		
	},
	"storage" : {
		"data" : {
			"bytesRead" : NumberLong("17760670501"),
			"timeReadingMicros" : NumberLong(135074290)
		}
	},
	"responseLength" : 141,
	"protocol" : "op_msg",
	"millis" : 139328,
	"planSummary" : "COLLSCAN",
	"ts" : ISODate("2023-08-22T14:51:08.005Z"),
	"client" : "172.18.0.1",
	"allUsers" : [ ],
	"user" : ""
}

Some key points:

  • Basically, what seems to be the slow part, is just the counting of the total number of structures. ($match everything in the collection; group by $_id and then just sum the number).
  • it takes 139328 milliseconds to run
  • almost all of the time (135074 milliseconds) is spent on reading 17 GB of data from disk.
  • This command has to do a full scan of the documents (COLLSCAN). I don't think this command, in it's current state, could be sped up by using indexes.

I am wondering if this functionality could be implemented in a more efficient way. For example, db.structures.count() runs instantly.

Just for additional information, initially also accessing a single structure was as slow (2+ min), e.g. via

https://dev-optimade.materialscloud.org/archive/li-ion-conductors/v1/structures/5b5b4b01-5b7e-48ad-8e17-8077f9b0b5d2

But after I added the id index with db.structures.createIndex({ id: 1 }), it's fast now.

Pinging @ml-evs @unkcpz @superstar54 for comments/ideas regarding the "counting" speedup. i suspect this is probably something that should be adapted in https://github.com/Materials-Consortia/optimade-python-tools?

@ml-evs
Copy link
Collaborator

ml-evs commented Aug 23, 2023

Yeah, the counting is a pain but I don't see an easy way around it. Each query needs to know the total number of structures it returns in the response, according to the spec. The only speed-up we could do would be to implement some session-based cursor so that we don't have to repeat queries with a new skip every request (and thus don't have to repeat the count).

As the data_returned field is only a "SHOULD" in the specification, we could add a config option in optimade-python-tools that disables it (I can make a PR for this and we can test it).

@ml-evs
Copy link
Collaborator

ml-evs commented Aug 24, 2023

You can try installing from Materials-Consortia/optimade-python-tools#1757 and using the CONFIG.elide_data_returned = True setting to disable the count

@ml-evs
Copy link
Collaborator

ml-evs commented Aug 24, 2023

* Basically, what seems to be the slow part, is just the counting of the total number of structures. (`$match` everything in the collection; group by `$_id` and then just sum the number).
* This command has to do a full scan of the documents (`COLLSCAN`). I don't think this command, in it's current state, could be sped up by using indexes.

I am wondering if this functionality could be implemented in a more efficient way. For example, db.structures.count() runs instantly.

Reading this again, do you just mean that count() without a filter is very fast (and thus just listing all structures without a filter should be too), or do you mean that the count step itself is fast? It might instead be that the COLLSCAN comes from a sort on an unindexed field

@eimrek
Copy link
Member Author

eimrek commented Aug 24, 2023

Thanks for the comments and the modification Materials-Consortia/optimade-python-tools#1757. I'll test it when i have a moment.

Regarding your last question, just to explain it a bit more:

if I run

> db.structures.aggregate( [
...    { "$match": {} },
...    { "$group": { "_id": 1,"n": { $sum: 1 } } }
... ] )

{ "_id" : 1, "n" : 4396695 }
> 

You see that it just outputs the count. this takes 2+ minutes.

If i do

> db.structures.count()
4396695

I get the same result immediately.

Of course, for more complicated queries, this probably doesn't hold.

Regarding if this is the only slow part, i'm not sure. Nothing else was in the profiling log. I suspect that if count() was used, then the whole /structures endpoint would be fast, but I'm not 100% sure. I guess we'll find out when I test Materials-Consortia/optimade-python-tools#1757.

@eimrek
Copy link
Member Author

eimrek commented Sep 1, 2023

Hi @ml-evs, the changes in Materials-Consortia/optimade-python-tools#1757 indeed made the API fast, see here:

https://dev-optimade.materialscloud.org/archive/li-ion-conductors/v1/structures

However, there seems to be a small issue, the "links": ""next" data seems to be missing, and when doing

https://dev-optimade.materialscloud.org/archive/li-ion-conductors/v1/structures?page_offset=25

It shows the "links": ""next" as the first value, but to the current page, instead of the next one.

@eimrek
Copy link
Member Author

eimrek commented Oct 3, 2023

@eimrek eimrek closed this as completed Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants