-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RMG Improvement Proposal [RIP]: Make RMG-database
a SQL Database [COMMENTS WELCOME!]
#2708
Comments
This is a big proposal! There are certainly some details I'm interested in, but maybe this isn't the best forum for a nuts and bolts discussion. More broadly I'm interested to hear:
|
Excellent proposal, I fully support the aspects of converting the database from executing Python files to SQL. I'm also interested to hear thoughts about version controlling the new database. Some initial thoughts about the kinetics section: It could be OK to keep only the training reactions, and automatically generate the trees every time the database is updated, storing the updated tree with the generated rules. Another point to consider is that our kinetics data has Another point for discussion if gas phase vs. liquid phase: Many of our kinetic libraries have mixed reactions in the sense that some of them could be used for liquid phase with an appropriate correction, while others are strictly gas phase reactions (e.g., with specific 3rd body colliders or with a PDep expression). We could consider adding an attribute for a library reaction of whether it is appropriate for gas/liquid/or both. |
Thank you both for your speedy feedback! I will quote reply to individual points to make sure we cover everything. Matt's Notes
For users who just want to contribute a couple known reactions, thermochemical values, etc. they would be able to clone RMG-database, edit the YAML files to add/modify their results (made easier by the reduced formatting), and then open a PR. They would never be responsible for generating the database; GitHub Continuous Integration would handle that. For users who want to add a new library for their own local installation, the procedure would be the same as it is now, with the exception that they would need to first add the data and then generate the database. We plan to make this generation code part of the For developers, the workflow would be largely the same as it is now, but just editing in YAML format. We would thoroughly document the process for building the actual database file so that they can ensure their changes are correct and unit tests are passed, though that will also be done through GitHub actions anyway.
Reading from the database would take place via Python functions that @jonwzheng and I write as part of the Modifying the database would happen by modifying YAML files, and then rebuilding the database from the YAML files. It is these same YAML files containing the data which would be version controlled. This was chosen so that git diffs would still be meaningful, though YAML is not the final choice (open to suggestions), and so that users could make edits without learning any 'style' (like the current Python formatting).
Our goal is that users can interact with
I covered this partially above, but will restate - our goal is that many of the functions that RMG-Py uses for navigating the database and accessing records can be implemented in As far as speed/scaling/benchmarks - we will keep these suggestions in mind and work something up!
For the time being, we have chosen the I believe that for remote computing systems, the only Alon's Notes
This is definitely something we can look to include. When we look to edit the tree fitting and training generation code, we can consider adding fields to the database and steps in the algorithm which achieve these goals.
When we get to the stage of integrating
Great suggestion! This actually lends itself very well to some SQL programming constructs. What I imagine is that we have one table that stores our known Reactions. We then have a separate table, which is linked to that Reactions table by the unique identifier for each reaction, that stores Correction terms. It is then trivial to write a SQL command which loads reactions with (or without) corrections, etc. that can then be wrapped in Python and made easily available to RMG. Our Proposed Path ForwardAfter some offline discussion, @jonwzheng and I have come up with this proposed plan for following through on this RIP.
|
Cool! I definitely appreciate moving away from the current database format! So my understanding is that:
If this is correct, particularly when running RMG, why not instead:
For a typical RMG user the second way seems to me to achieve the same result, but be much simpler and less error prone: less dependencies and avoids the trouble of maintaining an active database. This of course also doesn't preclude loading the yaml database into a SQL database for other purposes outside of running RMG. Side note on tree generation: One could definitely build a workflow for retraining the rate coefficient SIDTs without too much work. The vast majority of the trees train virtually instantly or in a few minutes. However, a couple of the larger SIDTs can take hours to train, which may change some considerations. |
Important
Please read and comment on this issue if you are a developer or user of RMG - we need lots of input!
The purpose of this issue is to centralize discussion around a significant change that @JacksonBurns and @jonwzheng are proposing for
RMG-database
.Also see the first RIP here: #2684
This issue is styled after the Python Enhancement Proposal (see PEP 733 for an example), thus the name 'RMG Improvement Proposal', or RIP for short.
RMG-database
TodayRMG-database
is a collection of Python files organized in a layout reflecting their contents. This includes:statmech
,transport
,thermo
, andsolvation
- containslibraries
of literature results for individual molecules and thengroups
for making estimateskinetics
- similar to the above, except thatgroups
belong tofamilies
which also containrules
andtraining
data for generating saidgroups
andrules
.surface
- limited library with some surface data.reference_sets
- actually not Python files, butyml
files containing "vetted" values for RMG/Arkane to fit data to perform BACsquantum_corrections
- Python dictionaries containing frequency scale factors, BACs, and AECsIn order to interact with
RMG-database
, one must have a working installation ofRMG-Py
and use its associateddata
classes to access numbers stored here.RMG-Py
itself runs Pythonexec
function on these files to load them into global memory, once per process.Challenges with Today's
RMG-database
This format introduces many 'hard' and 'soft' challenges, which are detailed below:
exec
in this context (and in general) is bad coding practiceRMG-database
is only possible viaRMG-Py
, whose dependency requirements are so restrictive that it functionally forbids runningRMG-database
with anything else installedRMG-database
is difficult since it requires formatted Python files, does not interface with any common data munging software, and as mentioned before is difficult to debugRMG-website
kinetics_check_siblings_for_parents
checks that siblings nodes are not also parent nodes, which can easily be validated at creation time in a SQL databasecheck_surface_thermo_libraries_have_surface_attributes
checks that certain entries have expected rows, which we can enforce by simple construction in a SQL databaseOur Proposal and a Working Demo
@jonwzheng and I (@JacksonBurns) propose we overhaul
RMG-database
from the ground up as a SQL database.To that end, a working demo has been built showing how
statmech
could be converted into a SQL database, see this repository: https://github.com/JacksonBurns/rmgdbIn short, we do the following for each of the sub-databases in RMG-database that follow the library + family/group setup:
Define a set of Tables which represent each of the RMG Classes that are called within
RMG-database
, such asLinearRotor
, and a 'base' table to hold all the calls toentry
in each of the sub-databases. The aforementioned validation can be implemented here using SQL constructs like Triggers, Constraints, Key Relationships, etcExample of libraries schema
Example of groups schema
exec
the files inRMG-database
, but trick them into generating our new Tables rather than RMG classes.Dump the database into a plaintext format like
.yml
which would replace the Python source files we currently have, like this:.yml
files, enabling users to contribute toRMG-database
by just editing the.yml
files (proper configuration and formatting can be enforced by GitHub actions).Critically, this would allow the users to run one-liner commands with only pandas installed to then interact with the
RMG-database
, like this:This would make it trivially easy to access
RMG-database
and its wealth of chemical data.There are further benefits on the
RMG-Py
side of things.Navigating the decision tree structure is dramatically faster than the current setup because it uses the SQL adjacency list layout for storing hierarchical data, enabling tree navigation by simple matching of integers.
Accessing the data in this way will also have a massive positive impact on
RMG-Py
's memory consumption - the current setup requires each parallel process to load the entire database into memory, whereas this would be shared among all processes and allow easy loading of only the required data.Next Steps, Drawbacks, and Open Questions
The amount of value of the data (and how difficult it is to get to it) in
RMG-database
makes this step worth doing on its own.Part of the reason that the linked demo already exists is because @jonwzheng and I will likely see it to its end even if just for our own usage, since we would like to be able to access
RMG-database
in other projects.The purpose for this issue, then, is to discuss what issues this could bring up with
RMG-Py
and how we can mitigate them during the design process.Difficulty of Integration with
RMG-Py
andRMG-website
The database is arguably the most important piece of the
RMG-Py
-puzzle, and so it is used throughout the source code in many different functions.This is not a critical issue, since re-implementing any of the needed functionality will just be a matter of effort, but it is worth mentioning.
Also promising is that most of this functionality never changes (i.e., we will always need to find all the ancestors for a given node, a function which would never change), so once we implement it in SQL and wrap it in Python it can just sit.
We would need to update all of our new user documentation. Workshop materials from previous years will also become out-of-date.
More serious is the integration with the various notebooks and scripts people have assembled over the years to create
RMG-database
. While we can do the best we can to provide examples of changing themain
RMG code to work with the new database, things like the group fitting notebooks will need serious overhauls both on themain
branch and for people locally, as well as theRMG-website
source code.On that note...
Backwards Compatibility
This would be totally backwards incompatible with previous versions.
Outstanding PRs, as well, would need to be restarted completely in order to work.
...and...
What do we Keep?
This is perhaps the biggest open question.
From our understanding, the
libraries
(used as lookup tables during RMG simulations) in each of the sub-databases contain data scraped from literature/simulated by us for various chemicals and chemicals reactions - we would definitely keep those.The
training
directories, as well as therules
andgroups
files are less obvious to handle.We believe that the
training
reactions can be generated automatically from the libraries, though that has not been done for all meaning that some of them are hand constructed.Similarly, some of the
rules
andgroups
appear to be hand-built whereas others are machine generated.We ask this question for two reasons - it will inform the design of the database, and because it will determine the scope of the work.
If it turns out that we want a way to automatically refit all the trees whenever we push new data, thus replacing the
rules
,groups
, andtraining
(?), we could incorporate that into this larger effort.Please let us know your thoughts and any suggestions you have about how to best approach this - especially those related to the kinetics database, specifically the workflow of library -> training -> rules/groups.
After this issue has been opened, we will schedule a board meeting to discuss the way forward.
Thank you!
The text was updated successfully, but these errors were encountered: