Title - Persistent Mayer Homology-Based Machine Learning Models for Protein-Ligand Binding Affinity Prediction.
Authors - Hongsong Feng, Li Shen, Jian Liu, and Guo-Wei Wei
- Introduction
- Model Architecture
- Prerequisites
- Datasets
- Modeling with PMH-Based Features
- Generation of PMH-Based Features for Protein-Ligand Complex
- Results
- License
- Citation
Artificial intelligence-assisted drug design is revolutionizing the pharmaceutical industry. Effective molecular features are crucial for accurate machine learning predictions, and advanced mathematics plays a key role in designing these features. Persistent homology theory, which equips topological invariants with persistence, provides valuable insights into molecular structures. The calculation of Betti numbers is based on a differential that typically satisfies (d^2 = 0). Our recent work has extended this concept by employing Mayer homology with a generalized differential that satisfies (d^N = 0) for (N \geq 2), leading to the development of Persistent Mayer Homology (PMH) theory. This theory offers richer Betti number information across various scales. In this study, we utilize PMH to create a novel multiscale topological featurization approach for molecular representation. These PMH-based molecular features serve as valuable tools for descriptive and predictive analysis in molecular data and machine learning. By integrating these features with machine learning algorithms, we build highly accurate predictive models. Benchmark tests on established protein-ligand datasets, including PDBbind-2007, PDBbind-2013, and PDBbind-2016, demonstrate the superior performance of our models in predicting protein-ligand binding affinities.
Keywords: Persistent homology, Persistent Mayer homology, Protein-ligand binding affinity.
A schematic illustration of the overall PMH-based knot data analysis (KDA) platform is shown below.
Further details are provided in the paper, offering context and additional information about the architecture and its components.
- numpy 1.21.0
- scipy 1.7.3
- scikit-learn 1.0.2
- python 3.10.12
- biopandas 0.4.1
- Biopython 1.75
Datasets | Total | Training Set | Test Set |
---|---|---|---|
PDBbind-v2007 | 1300 | 1105 Label | 195 Label |
PDBbind-v2013 | 2959 | 2764 Label | 195 Label |
PDBbind-v2016 | 4057 | 3767 Label | 290 Label |
- PDBbind Raw Data: Protein-ligand complex structures. Download from the PDBbind database.
- Label: The .csv file containing the protein ID and corresponding binding affinity for PDBbind data.
Datasets | Training Set | Test Set | PCC | RMSE (kcal/mol) |
---|---|---|---|---|
PDBbind-v2007 result | 1105 | 195 | 0.824 | 1.95 |
PDBbind-v2013 result | 2764 | 195 | 0.787 | 2.036 |
PDBbind-v2016 result | 3767 | 290 | 0.834 | 1.755 |
Note: Twenty gradient boosting regressor tree (GBRT) models were built for each dataset with distinct random seeds to address initialization-related errors. The PMH-based features were paired with GBRT. The predictions can be found in the Results folder. Transformer-based sequence features were also generated and paired with GBRT to build machine learning models. All predictions can be found in the Results folder.
# Example: Generating the PMH features for PDB 2p7z. The PDB file is located in PDB/2p7z folder and the generated features are saved in features/2p7z
python codes/PMH.py
This project is licensed under the MIT License - see the LICENSE file for details.
- Hongsong Feng, Li Shen, Jian Liu, and Guo-Wei Wei, "Persistent Mayer homology-based machine learning models for protein-ligand binding affinity prediction"