Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REST Error on pulling a large amount of data via patterns (Legacy API) #3234

Closed
matthewcarbone opened this issue Aug 9, 2023 · 3 comments
Closed

Comments

@matthewcarbone
Copy link
Contributor

Description

When attempting to pull a "large" amount of data via the MPRester get_data() method, even just getting Materials Project IDs consistent with a pattern, e.g.

pattern = "Ti-O-*-*-*"  # All titanium oxides with 5 unique atom types
with MPRester(api_key) as mpr:
    result = mpr.get_data(pattern, prop="material_id")

This leads to a REST error where it appears the query is too large. E.g.,

MPRestError: BSON document too large (39439050 bytes) - the connected server supports BSON document sizes up to 16777216 bytes.. Content: b'{"valid_response": false, "error": "BSON document too large (39439050 bytes) - the connected server supports BSON document sizes up to 16777216 bytes.", "version": {"db": "2020_09_08", "pymatgen": "2022.0.8", "rest": "2.0"}, "created_at": "2023-08-09T08:25:12.741976", "traceback": "Traceback (most recent call last):\n File \"/var/www/python/matgen_prod/materials_django/rest/rest.py\", line 95, in wrapped\n d = func(*args, **kwargs)\n File \"/var/www/python/matgen_prod/materials_django/materials/rest.py\", line 121, in get_vasp_property\n entries = mdb.mat_qe.get_entries(crit, False, supported_properties)\n File \"/opt/miniconda3/envs/mpprod3/lib/python3.8/site-packages/matgendb/query_engine.py\", line 301, in get_entries\n for c in self.query(fields, criteria):\n File \"/opt/miniconda3/envs/mpprod3/lib/python3.8/site-packages/matgendb/query_engine.py\", line 654, in _result_generator\n for r in self._results:\n File \"/opt/miniconda3/envs/mpprod3/lib/python3.8/site-packages/pymongo/cursor.py\", line 1189, in next\n if len(self.__data) or self._refresh():\n File \"/opt/miniconda3/envs/mpprod3/lib/python3.8/site-packages/pymongo/cursor.py\", line 1104, in _refresh\n self.__send_message(q)\n File \"/opt/miniconda3/envs/mpprod3/lib/python3.8/site-packages/pymongo/cursor.py\", line 930, in __send_message\n response = client._send_message_with_response(\n File \"/opt/miniconda3/envs/mpprod3/lib/python3.8/site-packages/pymongo/mongo_client.py\", line 1138, in _send_message_with_response\n return self._reset_on_error(\n File \"/opt/miniconda3/envs/mpprod3/lib/python3.8/site-packages/pymongo/mongo_client.py\", line 1156, in _reset_on_error\n return func(*args, **kwargs)\n File \"/opt/miniconda3/envs/mpprod3/lib/python3.8/site-packages/pymongo/server.py\", line 105, in send_message_with_response\n sock_info.send_message(data, max_doc_size)\n File \"/opt/miniconda3/envs/mpprod3/lib/python3.8/site-packages/pymongo/pool.py\", line 593, in send_message\n raise DocumentTooLarge(\npymongo.errors.DocumentTooLarge: BSON document too large (39439050 bytes) - the connected server supports BSON document sizes up to 16777216 bytes.\n"}'

Repro

See above

Expected behavior

I believe there should be some protocol to split up the query or something. It seems a bit odd I cannot pull data like this, and I am not sure what the alternative would be. Again, I am just attempting to get the Materials Project IDs themselves. In principle, not even pulling structural data.

Is this something that is fixed on the new API? Regardless it should probably be working everywhere, I'd think.

Thanks!

Environment

MacOS M1 Ventura 13.4.1
pymatgen==2022.5.26

@shyuep
Copy link
Member

shyuep commented Aug 10, 2023

  1. Pls use the new API.
  2. This is actually not a good way to use this method. You can easily get all the mpids AND formulas/chemical systems in one shot and just postprocess that data to get the mpids of the specific systems.
  3. Even if you prefer to use this method, having two wild cards makes it very difficult due to the sheer number of combinations. You can always use one wild-card with a loop on the other element.

@shyuep shyuep closed this as completed Aug 10, 2023
@matthewcarbone
Copy link
Contributor Author

matthewcarbone commented Aug 10, 2023

@shyuep due respect none of these points answered my question. I would prefer if you reopened the issue so we can discuss how to make this feature better!

  1. I would very much like to, but I can't for my uses. For instance: FEFFDictSet write_input appears to be bugged in some cases #3187
  2. I'm aware. I set the pulled properties to just the Materials IDs in order to demonstrate that even pulling the minimum amount of data leads to this error. Even so, what method are you referring to here?
  3. This is a reasonable recommendation but IMO should be implemented under the hood in PMG. I don't think the user should have to deal with this type of subtlety. Do you agree?

@shyuep
Copy link
Member

shyuep commented Aug 10, 2023

I just fixed the FEFFDictSet issue. That should allow you to use the new API.
As for implementing it under the hood, the premise is that you are asking for 92x92x92 = 778688 chemical systems (each * is approximately 92 elements of the periodic table), each with tens, if not hundreds of structures. There is a lot of overlap in there too (because the total number of chemical systems even exceeds the total number of structures in the Materials Project). So this is not a reasonable query that can be handled even if we did a loop. In fact, the reasonable query in your case would be to find all materials containing Ti and O, and then set nelements=5 to fix the total number of elements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants