ZeoSyn: A Comprehensive Zeolite Synthesis Dataset Enabling Machine-learning Rationalization of Hydrothermal Parameters

This is the official repository of

ZeoSyn: A Comprehensive Zeolite Synthesis Dataset Enabling Machine-learning Rationalization of Hydrothermal Parameters (Published in ACS Central Science, 2024)

Elton Pan,† Soonhyoung Kwon,‡ Zach Jensen,† Mingrou Xie,‡ Rafael Gomez-Bombarelli,† Manuel Moliner,¶ Yuriy Roman,‡ and Elsa Olivetti∗,†

† MIT Materials Science & Engineering, ‡ MIT Chemical Engineering, ¶ ITQ-UPV

ZeoSyn is a large zeolite synthesis dataset comprising 23,961 zeolite synthesis routes, 233 zeolite topologies and 921 organic structure-directing agents (OSDAs). Each unique synthesis route consists of a comprehensive set of key synthesis parameters:

Gel compositions (molar ratios between heteroatoms, mineralizing agents, and water)
Reaction conditions (crystallization/aging temperature and time)
Organic structure-directing agent (SMILES)
Resultant zeolite product (3-letter IZA code)

1) Quick demo

We highly encourage you to check out our Demo notebook for a gentle introduction (< 3 min 🎉) on the key components of dataset + SHAP for frameworks and building units.

⚠️Note: We strongly recommend the Chrome browser for the Google Colab notebooks

2) The ZeoSyn dataset

A) Overview

(a) Example of a zeolite synthesis route in the dataset, consisting of the gel composition, inorganic precursors, reaction conditions, organic structure-directing agent (OSDA), and the resultant zeolite framework. Paper metadata of the scientific paper containing the synthesis route is also provided. (b) Frequency of elements present in the dataset. The values correspond to the log number of synthetic routes with a specific element. (c) Total number of synthesis routes of small, medium, large, and extra-large pore zeolites extracted from literature across time in the dataset. Distributions of key gel composition variables in the dataset, including ratio between (d) heteroatoms, and (e) mineralizing agents, metal cations and OSDA ratios (T = ∑i ni where ni is the amount of the ith heteroatom present in synthesis).

B) Zeolite frameworks

Zeolite frameworks can be divided into different categories based on their maximum ring size. ZeoSyn contains 5250, 5494, 5769, and 716 synthesis routes for small (8MR), medium (10MR), large (12MR), and extra-large pore (>12MR) zeolites, respectively.

C) Organic structure-directing agents

(a) Hierarchical clustering of the top 50 most frequent OSDAs in the dataset, labled with the main classes of molecular structures. Splits are obtained through agglomer- ative hierarchical clustering of OSDA Morgan fingerprints. Each OSDA is colored by its molecular volume (orange), and median largest included sphere of zeolites formed by the OSDA (purple). The concomitant intensities of the colors show a positive correlation between the two properties. (b) Positive correlation between zeolite largest included sphere vs. OSDA volume. Red points refer to high asphericity, which account for outliers (c) Positive correlation between zeolite ring size vs. OSDA volume.

D) SHAP analysis reveals key zeolite structure-synthesis relationships

(a) Framework-level SHAP analysis revealing the top 10 (out of 43) most important synthesis parameters favoring the formation of specific frameworks. Each framework belongs to 1 out of 3 types of synthesis based on its top synthesis parameters: 1) Gel-dominated synthesis (CAN, KFI) where most top parameters are inorganic-related, 2) OSDA-dominated synthesis (ISV, ITE) where most top parameters are OSDA-related, and 3) balanced syn- thesis (IWW, RUT) where even attribution is given to inorganic and OSDA parameters. Every point is an individual synthesis colored by the value of synthesis parameter (orange and blue colors indicate high and low values, respectively). (b) CBU-level SHAP analysis of large CBUs showing OSDA parameters favoring their formation.

3) Setup and installation

The code in this repo has been tested on a Linux machine running Python 3.8.8

Run the following terminal commands

Clone repo to local directory

  git clone https://github.com/eltonpan/zeosyn_dataset.git

Set up and activate conda environment

  cd zeosyn_dataset

  conda env create -f env/env.yml

  conda activate zeosyn

Add conda environment to Jupyter notebook

  conda install -c anaconda ipykernel

  python -m ipykernel install --user --name=zeosyn

Open jupyter notebooks

  jupyter notebook <notebook_name>.ipynb

make sure the zeosyn is the environment under dropdown menu Kernel > Change kernel

4) Code reproducibility

All data required to reproduce results in the paper can be found in datasets/ directory:

ZEOSYN.xlsx: ZeoSyn dataset
osda_descriptors.csv: Descriptors of organic structure-directing agents
zeolite_descriptors.csv: Descriptors of zeolite frameworks

All visualizations, model training and SHAP analysis in the paper can be reproduced by running the code in the following:

visualization.ipynb (~10 min, in-depth visualization of dataset)
classifier.ipynb (~15 min, deeper dive into zeolite classifier + SHAP for frameworks, building units, competing phases and intergrowths)

If you are not using Colab: Computation of SHAP values takes a while (~2 hours to run on 32 CPU cores). To avoid computation of SHAP, you can choose to download and load the precomputed SHAP values:

Download shap_values.pkl from here
Place shap_values.pkl in shap/ directory
Make sure the following block in classifier.ipynb is uncommented

  with open('shap/shap_values.pkl', 'rb') as handle:
      shap_values = pickle.load(handle)

5) Cite

If you use this dataset or code, please cite this paper:

@article{pan2024zeosyn,
  title={ZeoSyn: A Comprehensive Zeolite Synthesis Dataset Enabling Machine-Learning Rationalization of Hydrothermal Parameters},
  author={Pan, Elton and Kwon, Soonhyoung and Jensen, Zach and Xie, Mingrou and G{\'o}mez-Bombarelli, Rafael and Moliner, Manuel and Rom{\'a}n-Leshkov, Yuriy and Olivetti, Elsa},
  journal={ACS Central Science},
  volume={10},
  number={3},
  pages={729},
  year={2024},
  publisher={American Chemical Society}
}

6) Contact

If you have any questions, please contact us at [email protected].

To-do:

Test conda installation on non-Linux systems
Check OSDA and zeolite descriptors have any redundant data
Add Bibtex
Update Bibtex (after issue is out)
Add Colab notebook option
Add contact info

Name		Name	Last commit message	Last commit date
Latest commit History 192 Commits
cbu		cbu
dataset		dataset
env		env
figures		figures
framework/framework_shap		framework/framework_shap
s4		s4
shap		shap
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
classifier.ipynb		classifier.ipynb
demo.ipynb		demo.ipynb
utils.py		utils.py
visualization.ipynb		visualization.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ZeoSyn: A Comprehensive Zeolite Synthesis Dataset Enabling Machine-learning Rationalization of Hydrothermal Parameters

1) Quick demo

2) The ZeoSyn dataset

A) Overview

B) Zeolite frameworks

C) Organic structure-directing agents

D) SHAP analysis reveals key zeolite structure-synthesis relationships

3) Setup and installation

4) Code reproducibility

5) Cite

6) Contact

To-do:

About

Releases 1

Packages

Contributors 2

Languages

License

eltonpan/zeosyn_dataset

Folders and files

Latest commit

History

Repository files navigation

ZeoSyn: A Comprehensive Zeolite Synthesis Dataset Enabling Machine-learning Rationalization of Hydrothermal Parameters

1) Quick demo

2) The ZeoSyn dataset

A) Overview

B) Zeolite frameworks

C) Organic structure-directing agents

D) SHAP analysis reveals key zeolite structure-synthesis relationships

3) Setup and installation

4) Code reproducibility

5) Cite

6) Contact

To-do:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages