ZeoSyn: A Comprehensive Zeolite Synthesis Dataset Enabling Machine-learning Rationalization of Hydrothermal Parameters
This is the official repository of
ZeoSyn: A Comprehensive Zeolite Synthesis Dataset Enabling Machine-learning Rationalization of Hydrothermal Parameters (Published in ACS Central Science, 2024)
Elton Pan,† Soonhyoung Kwon,‡ Zach Jensen,† Mingrou Xie,‡ Rafael Gomez-Bombarelli,† Manuel Moliner,¶ Yuriy Roman,‡ and Elsa Olivetti∗,†
† MIT Materials Science & Engineering, ‡ MIT Chemical Engineering, ¶ ITQ-UPV
ZeoSyn is a large zeolite synthesis dataset comprising 23,961 zeolite synthesis routes, 233 zeolite topologies and 921 organic structure-directing agents (OSDAs). Each unique synthesis route consists of a comprehensive set of key synthesis parameters:
- Gel compositions (molar ratios between heteroatoms, mineralizing agents, and water)
- Reaction conditions (crystallization/aging temperature and time)
- Organic structure-directing agent (SMILES)
- Resultant zeolite product (3-letter IZA code)
We highly encourage you to check out our Demo notebook for a gentle introduction (< 3 min 🎉) on the key components of dataset + SHAP for frameworks and building units.
(a) Example of a zeolite synthesis route in the
dataset, consisting of the gel composition, inorganic precursors, reaction conditions, organic
structure-directing agent (OSDA), and the resultant zeolite framework. Paper metadata of
the scientific paper containing the synthesis route is also provided. (b) Frequency of elements
present in the dataset. The values correspond to the log number of synthetic routes with a
specific element. (c) Total number of synthesis routes of small, medium, large, and extra-large pore zeolites extracted from literature across time in the dataset. Distributions of key gel composition variables in the dataset, including ratio between (d) heteroatoms, and (e)
mineralizing agents, metal cations and OSDA ratios (T = ∑i ni where ni is the amount of the ith heteroatom present in synthesis).
Zeolite frameworks can be divided into different categories based on their maximum ring
size. ZeoSyn contains 5250, 5494, 5769, and 716 synthesis routes for small (8MR), medium
(10MR), large (12MR), and extra-large pore (>12MR) zeolites, respectively.
(a) Hierarchical clustering of the top 50 most frequent OSDAs in the dataset,
labled with the main classes of molecular structures. Splits are obtained through agglomer-
ative hierarchical clustering of OSDA Morgan fingerprints. Each OSDA is colored by its
molecular volume (orange), and median largest included sphere of zeolites formed by the
OSDA (purple). The concomitant intensities of the colors show a positive correlation between the two properties. (b) Positive correlation between zeolite largest included sphere vs.
OSDA volume. Red points refer to high asphericity, which account for outliers (c) Positive
correlation between zeolite ring size vs. OSDA volume.
(a) Framework-level SHAP analysis revealing the top 10 (out of 43) most important
synthesis parameters favoring the formation of specific frameworks. Each framework belongs
to 1 out of 3 types of synthesis based on its top synthesis parameters: 1) Gel-dominated synthesis (CAN, KFI) where most top parameters are inorganic-related, 2) OSDA-dominated
synthesis (ISV, ITE) where most top parameters are OSDA-related, and 3) balanced syn-
thesis (IWW, RUT) where even attribution is given to inorganic and OSDA parameters.
Every point is an individual synthesis colored by the value of synthesis parameter (orange
and blue colors indicate high and low values, respectively). (b) CBU-level SHAP analysis
of large CBUs showing OSDA parameters favoring their formation.
The code in this repo has been tested on a Linux machine running Python 3.8.8
Run the following terminal commands
- Clone repo to local directory
git clone https://github.com/eltonpan/zeosyn_dataset.git
- Set up and activate conda environment
cd zeosyn_dataset
conda env create -f env/env.yml
conda activate zeosyn
- Add conda environment to Jupyter notebook
conda install -c anaconda ipykernel
python -m ipykernel install --user --name=zeosyn
- Open jupyter notebooks
jupyter notebook <notebook_name>.ipynb
make sure the zeosyn
is the environment under dropdown menu Kernel
> Change kernel
All data required to reproduce results in the paper can be found in datasets/
directory:
ZEOSYN.xlsx
: ZeoSyn datasetosda_descriptors.csv
: Descriptors of organic structure-directing agentszeolite_descriptors.csv
: Descriptors of zeolite frameworks
All visualizations, model training and SHAP analysis in the paper can be reproduced by running the code in the following:
- visualization.ipynb
(~10 min, in-depth visualization of dataset)
- classifier.ipynb
(~15 min, deeper dive into zeolite classifier + SHAP for frameworks, building units, competing phases and intergrowths)
If you are not using Colab: Computation of SHAP values takes a while (~2 hours to run on 32 CPU cores). To avoid computation of SHAP, you can choose to download and load the precomputed SHAP values:
- Download
shap_values.pkl
from here - Place
shap_values.pkl
inshap/
directory - Make sure the following block in
classifier.ipynb
is uncommented
with open('shap/shap_values.pkl', 'rb') as handle:
shap_values = pickle.load(handle)
If you use this dataset or code, please cite this paper:
@article{pan2024zeosyn,
title={ZeoSyn: A Comprehensive Zeolite Synthesis Dataset Enabling Machine-Learning Rationalization of Hydrothermal Parameters},
author={Pan, Elton and Kwon, Soonhyoung and Jensen, Zach and Xie, Mingrou and G{\'o}mez-Bombarelli, Rafael and Moliner, Manuel and Rom{\'a}n-Leshkov, Yuriy and Olivetti, Elsa},
journal={ACS Central Science},
volume={10},
number={3},
pages={729},
year={2024},
publisher={American Chemical Society}
}
If you have any questions, please contact us at [email protected].
- Test conda installation on non-Linux systems
- Check OSDA and zeolite descriptors have any redundant data
- Add Bibtex
- Update Bibtex (after issue is out)
- Add Colab notebook option
- Add contact info