This repository contains the code and data used in constructing the thermally-driven metal-insulator transition (MIT) classifiers, which are 3 binary classifiers: a Metal vs. non-Metal model, an Insulator vs. non-Insulator model and an MIT vs. non-MIT model.
Check out our paper on Chemistry of Materials:
Georgescu, A. B.; Ren, P.; Toland, A. R.; Zhang, S.; Miller, K. D.; Apley, D. W.; Olivetti, E. A.; Wagner, N.; Rondinelli, J. M. Database, Features, and Machine Learning Model to Identify Thermally Driven Metal−Insulator Transition Compounds. Chem. Mater. 2021. DOI: 10.1021/acs.chemmater.1c00905.
Note: The results in the Chem. Mater. paper are produced with the code and data sets in release v1.2.2.
- Model Description
- General Workflow
- Demo Notebooks
The research question of this project is whether a machine learning classification model can predict temperature-driven metal-insulator transition behavior based on a series of compositional and structural descriptors/features of a given compound.
The training algorithm or the model type chosen for this task is an XGBoost tree classifier implemented in the Python programming language. XGBoost models have helped won numerous Kaggle competitions and have been shown to perform well on classification tasks. For this research project, if you wonder why we chose XGBoost over other model types and why binary classification over multi-class classification, you can refer to this section. The takeaway is that XGBoost is consistently among the best performing model types and that it is faster to train compared to other models with comparable performance. The performance across all model types on binary classifications is also better than that on multi-class classifications.
Since the vast majority of the training data comes from oxides and there are not that many well-documented oxides that exhibit MIT behavior, the training dataset as a result is quite small for machine learning standards (343 observations / rows). Thus, the models, especially with a high dimensional feature set, can easily overfit and there is an ongoing effort to expand and find new MIT materials to add to the dataset. Thus, as we continue to expand our dataset, the models trained on the dataset are also subject to change over the course of time.
We strongly encourage people to contribute temperature-driven MIT materials that aren't already included in our dataset. Please include your name, institution, the CIF file and reference publications in your email and send them to Professor James M. Rondinelli.
You can also suggest new MIT material(s) by opening an issue with the New MIT material
template.
The CIF files are obtained through online databases such as ICSD database, Springer Materials and Materials Project in addition to a few hand generated ones. The vast majority of CIF files are high-quality experimental structures files from the ICSD database, with a few from the Springer and Materials Project databases.
Note: Unfortunately, we can not directly share the collected CIF files due to copyright concerns. However, you can find the material ID of the
compounds included in our dataset here
(you should look at the struct_file_path
column to find the IDs). Should you have access, you can use those IDs
to download CIF files from ICSD, Springer and Materials Project.
You will find 4 suffixes in struct_file_path
which correspond to 4 sources as follows.
Suffix | Source |
---|---|
CollCode | ICSD |
SD | Springer Materials |
MP | Materials Project |
HandGenerated | Generated by hand based on publications |
This step creates an ionization lookup table that is used in the subsequent featurization process.
A total of 164 compositional and structural features are generated using a combination of matminer and our in-house handbuilt featurizers. These features then undergo further processing and selection down the pipeline.
After a brief exploratory data anaylsis, it is found that the raw output from the featurizers contains features with missing values, zero-variance (i.e. the feature value is the same for all compounds) and high linear correlation (greater than 0.95). Therefore, the data cleaning process is carried out in the following order:
- Drop rows / compounds with more than 10 missing features
- Impute missing values with KNNImputer
- For each row with missing values, find the 5 nearest neighbors using features that are not missing
- Impute missing values based on features in the 5 nearest neighbors weighted by their distance
- Remove features with zero variance
- Remove features with high linear correlation
- Find features with a linear correlation greater than 0.95
- Drop one of the two features in each pair of highly correlated features
After data cleaning, the dataset now has 106 (105 numeric & 1 one-hot-encoded categorical with 2 levels) features remaining and will be referred to as the full feature set from now on.
The model building process follows an iterative approach. During the first iteration, the cleaned-up full feature set is fed into the classifiers, trained and then evaluated. Then with the help of SHAP values and domain knowledge, features with high importance are selected and used as input to the second iteration of model training and evaluation.
The training process starts with hyperparameter tuning with grid search cross validation. The default parameter search grid for the XGBClassifier is as follows.
Parameter | Search space |
---|---|
n_estimators | [10, 20, 30, 40, 80, 100, 150, 200] |
max_depth | [2, 3, 4, 5] |
learning_rate | np.logspace(-3, 2, num=6) |
subsample | [0.5, 0.6, 0.7, 0.8, 0.9, 1.0] |
scale_pos_weight | [num_of_negative_class / num_of_positive_class] |
base_score | [0.3, 0.5, 0.7] |
random_state | [seed] |
The scoring metric during tuning is f1_weighted. The best tuned parameters are then stored for model evaluation,
Due to the scarcity of training examples, stratified 5-fold cross validation (cv) is used to evaluate model performance instead of a hold-out test set. There are 4 evaluation metrics used:
Since the cross validation splits depend on the random seed, a list of 10 seeds (integers from 0 to 9) are used to take into account the variation in model performance due to different splits from different seeds. For each seed, a stratified 5-fold cv is carried out, from which the median / mean values for the metrics are obtained. With 10 seeds, there are 10 median / mean values for each metric and finally a median / mean value is calculated from those 10 values, along with the interquartile range / standard deviation respectively. Essentially, the values reported are either a median of medians by default or an average of averages should you choose so.
After model evaluation, the models are trained on the entire dataset (343 compounds with the full feature set) with the best parameters and then stored.
Using the stored models, a SHAP analysis is carried out to find the most important features. These important features are further screened using domain knowledge. Currently, 10 features are selected to create a reduced feature set. This feature selection step mainly serves to prevent overfitting.
With this reduced feature set, the entire model building process is repeated and the models are re-tuned, re-evaluated and re-trained on the reduced feature set.
The trained classifiers are made available to the larger materials science community through Jupyter notebooks hosted via the Binder service. One can immediately upload a CIF file and easily make a prediction using our classifiers directly in the web browser.
The models served on the Binder server are by default based on the reduced feature set.
There are several Jupyter notebooks available for easier result replication and demonstration purposes. You can immediately launch interactive versions of these notebooks in your web browser by clicking on the binder icon above or clicking on the subsection titles below.
Note: Any changes made on the server will not be saved unless you download a copy of the notebook onto your local machine.
You can replicate the workflow by using the notebooks in the following order.
This notebook generates the ionization energy lookup spreadsheet.
This notebook allows you to generate features for all the structures. As mentioned before, since we cannot share the structure files, running this notebook will not work due to the absence of CIF files.
This notebook presents an exploratory data analysis along with a data cleaning process on the output dataset from generate_compound_features.ipynb.
This notebook contains the code that tunes, trains and evaluates the models along with a SHAP analysis on models trained with the full feature set. It is NOT recommended to train the models directly on the Binder server since it is a very memory intensive process (it will also take a very long time to train!). The Binder container by default has 2GB of RAM and if the memory limit is exceeded, there is a possibility that the kernel will restart and you'll have to start over. That being said, you are welcome to download the repository onto your local machine and play around with the model parameters and selection.
This notebook demonstrates the prediction pipeline through which a prediction is made on a new structure that is not included in the original training set. You can even upload your own CIF structure and get a prediction! If you just want to play around with the trained models or make a prediction on a structure of your own choice, you can start here.
This notebook answers the question of "Why should one choose XGBoost over some other models?" by comparing the classification performance of 6 model types on the full feature set across 4 classification tasks. The model types are as follows.
Model type | Description |
---|---|
DummyClassifier | Naive models that are always random guessing (baseline performance) |
LogisticRegression | Linear classifiers with L2 regularization |
DecisionTreeClassifier | Generic decision tree classifiers |
RandomForestClassifier | Ensemble decision tree classifiers |
GradientBoostingClassifier | Gradient-boosting tree classifiers |
XGBoostClassifier | Extreme gradient-boosting tree classifiers |
The 4 classification tasks are:
- Metal vs. non-Metals (Insulators + MITs)
- Insulator vs. non-Insulators (Metals + MITs)
- MIT vs. non-MIT (Metals + Insulators)
- Multi-class classification
The metrics and evaluation method are the same as the process mentioned earlier. The comparison results are summarized in this table. A summary plot is also provided for easier interpretation.
This notebook presents a brief SHAP analysis on models trained with the reduced feature set.
This is a brief tutorial notebook that explains some sub-functions in the compound_featurizer.py file.
This notebook provides a benchmark of how "good" the handbuilt featurizer is against values from Table 2 & 3 of Torrance et al.
This notebook contains visualization plots to be included in the paper.