Skip to content

TencentAI4S/tfold

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

header


English | 简体中文

This package provides an implementation of the inference pipeline of tFold, including tFold-Ab, tFold-Ag and tFold-TCR.

demo

We also provide:

  1. An pre-trained language named ESM-PPI, works to extract both the intra-chain and inter-chain information of the protein complex to generate features for the down-streaming task.
  2. The test set we construct in our paper.
  3. A human germline antibody frameworks library to guide antibody generation using tFold-Ag.

Any publication that discloses findings arising from using this source code or the model parameters should cite the tFold paper.

Please also refer to the Supplementary Information for a detailed description of the method.

If you have any questions, please contact the tFold team at [email protected].

For business partnership opportunities, please contact [email protected].

Main models

Shorthand Dataset Description
ESM-PPI UniRef50, PDB, PPI, Antibody General-purpose protein language model, further pre-trained using ESM2 with 650M parameters. Can be used to predict multimer structure directly from individual sequences
ESM-PPI-tcr UniRef50, PDB, PPI, Antibody, TCR, peptide General-purpose protein language model, further pre-trained using ESM2 with 650M parameters. Can be used to predict multimer structure directly from individual sequences
tFold-Ab SAbDab (before 31 December 2021) SOTA antibody structure prediction model. MSA-free prediction with ESM-PPI
tFold-Ag SAbDab (before 31 December 2021) SOTA antibody-antigen complex structure prediction model. Can be used for virtual screening of binding antibodies and antibody design
tFold-TCR STCRDab (before 31 December 2021) SOTA TCR-complex structure prediction model. MSA-free prediction with ESM-PPI. Can be used for TCR design

OverView

Main Results

Unbound Antibody Prediction (SAbDab-22H2-Ab)

Model RMSD-CDR-H3 DockQ
AlphaFold-Multimer 3.07 0.773
Chai-1 3.25 0.772
IgFold 3.37 0.715
DeepAb 3.73 0.721
ImmuneBuilder 3.46 0.749
tFold-Ab 3.01 0.770

Unbound Nanobody Prediction (SAbDab-22H2-Nano)

Model RMSD-CDR-H3
AlphaFold 3.96
Chai-1 3.57
IgFold 4.64
ImmuneBuilder 3.79
ESMFold 3.80
OmegaFold 3.63
tFold-Ab 3.57

Antibody-Antigen Complex Prediction (SAbDab-22H2-AbAg)

Model DockQ Success Rate
AlphaFold-Multimer 0.158 18.2
AlphaFold-3 0.257 32.3
tFold-Ag 0.217 28.3

unliganded TCR Prediction (STCRDab-22-TCR)

Model RMSD-CDR-A3 RMSD-CDR-B3 DockQ
AlphaFold-Multimer 1.89 1.62 0.785
AlphaFold-3 1.80 1.50 0.769
TCRModel2 1.77 1.52 0.795
tFold-TCR 1.66 1.35 0.795

unbound pMHC Prediction (STCRDab-22-pMHC)

Model DockQ
AlphaFold-Multimer 0.927
AlphaFold-3 0.926
tFold-TCR 0.908

TCR-pMHC Complex Prediction (STCRDab-22-TCR_pMHC)

Model DockQ RMSD Success Rate
AlphaFold-Multimer 0.490 3.601 83.3
AlphaFold-3 0.496 3.094 72.2
tFold-TCR 0.496 2.413 94.4

Installation

  1. Clone the package
git clone https://github.com/TencentAI4S/tfold.git
cd tfold
  1. Prepare the environment
  • Please follow the instructions in INSTALL.md to set up the environment
  1. Download pre-trained weights under params directory (Optional)

Note:

If you download the weights in the folder ./checkpoints, you can proceed directly with the following steps.

If you don't download the weights, the weights will be downloaded automatically when you run the code. 4. Download sequence databases for mas searching (only needed for tFold-Ag)

sh scripts/setup_database.sh

Dataset

  1. Test set we construct in our paper
  1. Human germline antibody frameworks library to guide antibody generation

Our repository supports two methods of use, direct use or pip installation.

Quick Start

You can use a fasta file (--fasta) or a json file (--json) as input.

tFold-Ab

Example 1: predicting the structure of a antibody & nanobody using tFold-Ab

# antibody
python projects/tfold_ab/predict.py --fasta examples/fasta.files/7ox3_A_B.fasta --output examples/predictions/7ox3_A_B.pdb

# nanobody
python projects/tfold_ab/predict.py --fasta examples/fasta.files/7ocj_B.fasta --output examples/predictions/7ocj_B.pdb

tFold-Ag

Example 1: predicting the structure of a antibody-antigen complex & nanobody-antigen complex with pre-computed MSA

# antibody-antigen complex
python projects/tfold_ag/predict.py --fasta examples/fasta.files/8df5_A_B_R.fasta --msa examples/msa.files/8df5_R.a3m --output examples/predictions/8df5_A_B_R.pdb

# nanobody-antigen complex
python projects/tfold_ag/predict.py --fasta examples/fasta.files/7sai_C_NA_A.fasta --msa examples/msa.files/7sai_A.a3m --output examples/predictions/7sai_C_NA_A.pdb

Example 2: Generate MSA for structure predictions using MMseqs2

python projects/tfold_ag/gen_msa.py --fasta_file=examples/fasta.files/PD-1.fasta --output_dir=examples/PD-1

Example 3: predicting the structure of a antibody-antigen complex & nanobody-antigen complex with inter-chain features

# generate inter-chain feature (ppi)
python projects/tfold_ag/gen_icf_feat.py --pid_fpath=examples/fasta.files/8df5_A_B_R.fasta --fas_dpath=examples/fasta.files/ --pdb_dpath=examples/pdb.files.native/ --icf_dpath=examples/icf.files.ppi --icf_type=ppi

# antibody-antigen complex prediction with inter-chain feature
python projects/tfold_ag/predict.py --fasta examples/fasta.files/8df5_A_B_R.fasta --msa examples/msa.files/8df5_R.a3m --icf examples/icf.files.ppi/8df5_A_B_R.pt --output examples/predictions/8df5_A_B_R.pdb --model_version ppi

Example 4: CDRs loop deisgn with tFold-Ag with pre-computed MSA

python projects/tfold_ag/predict.py --fasta examples/fasta.files/7urf_O_P_A.cdrh3.fasta --msa examples/msa.files/7urf_A.a3m --output examples/predictions/7urf_O_P_A.pdb

tFold-TCR

Example 1: predicting the structure of a TCR complex

# TCR
python projects/tfold_tcr/predict.py --json examples/tcr_example.json --output examples/predictions/ --model_version TCR

# pMHC complex
python projects/tfold_tcr/predict.py --json examples/pmhc_example.json --output examples/predictions/ --model_version pMHC

# Complex
python projects/tfold_tcr/predict.py --json examples/tcr_pmhc_example.json --output examples/predictions/ --model_version Complex

Quick Start with Pip Installation

Direct installation from pypi:

  pip install tfold

or install from source code:

  cd tfold
  pip install .

After pip install, you can load and use a pretrained model as follows:

Extract cross-chain information using ESM-PPI

import torch
import tfold

# Download the pre-trained model
model_path = tfold.model.esm_ppi_650m_ab()

# Load the model
model = tfold.model.PPIModel.restore(model_path)

# Prepare antibody sequences (can be single or multiple sequences)
data = [
        'QVQLVQSGAEVKKPGASVKVSCKASGYPFTSYGISWVRQAPGQGLEWMGWISTYNGNTNYAQKFQGRVTMTTDTSTTTGYMELRRLRSDDTAVYYCARDYTRGAWFGESLIGGFDNWGQGTLVTVSS', # Heavy chain
        'EIVLTQSPGTLSLSPGERATLSCRASQTVSSTSLAWYQQKPGQAPRLLIYGASSRATGIPDRFSGSGSGTDFTLTISRLEPEDFAVYYCQQHDTSLTFGGGTKVEIK' # Light chain
]
ppi_output = model(data)

The output keys are (['labl', 'mask', 'pred', 'sfea', 'pfea']).

Each key in the outputs dictionary represents a different component of the model's output:

  • 'labl': Original token indices from tokn_mat_orig that represent the unmasked (original) amino acid sequences. These serve as ground truth labels for training.
  • 'mask': Binary mask tensor indicating which positions were masked during inference. It's used to identify which positions should be predicted in the masked language modeling task.
  • 'pred': The model's logits (raw prediction scores before softmax) for each position in the sequence. These are the actual predictions made by the model.
  • 'sfea': Single-residue features/embeddings extracted from the final layer representations. These are residue-level embeddings with dimension self.c_s for each amino acid position.
  • 'pfea': Pair features representing interactions between residues, derived from attention weights. These capture the relationships between each pair of residues in the sequence with dimension self.c_z.

Predict antibody structures with tFold-Ab

import torch
import tfold

# Download the pre-trained model
ppi_model_path = tfold.model.esm_ppi_650m_ab()
tfold_model_path = tfold.model.tfold_ab_trunk()

# Load the model
model = tfold.deploy.PLMComplexPredictor.restore_from_module(ppi_model_path, tfold_model_path)

# Prepare antibody sequences (can be single or multiple sequences)
data =[
        {
          "sequence": 'QVQLVQSGAEVKKPGASVKVSCKASGYPFTSYGISWVRQAPGQGLEWMGWISTYNGNTNYAQKFQGRVTMTTDTSTTTGYMELRRLRSDDTAVYYCARDYTRGAWFGESLIGGFDNWGQGTLVTVSS', # Heavy chain
          "id": 'H'
          },
        {
          "sequence": 'EIVLTQSPGTLSLSPGERATLSCRASQTVSSTSLAWYQQKPGQAPRLLIYGASSRATGIPDRFSGSGSGTDFTLTISRLEPEDFAVYYCQQHDTSLTFGGGTKVEIK', # Light chain
          "id": 'L'
          }]
output_path = '8df5_A_B_R.pdb'

model.infer_pdb(data, output_path)

Predict the structure of a antibody-antigen complex with tFold-Ag

import torch
import tfold

# Download the pre-trained model of ESM-PPI
ppi_model_path = tfold.model.esm_ppi_650m_ab()
# Download the pre-trained model of alphaFold
alphafold_path  = tfold.model.alpha_fold_4_ptm()
# Download base model for tFold-Ag
tfold_model_path = tfold.model.tfold_ag_base()

# Download the ppi model for tFold-Ag
# tfold_model_path = tfold.model.tfold_ag_ppi()

# Load the model
model = tfold.deploy.AgPredictor(ppi_model_path, alphafold_path, tfold_model_path)

# Prepare antibody-antigen sequences
msa_path = 'examples/msa.files/8df5_R.a3m'
with open(msa_path) as f:
   msa, deletion_matrix = tfold.protein.parser.parse_a3m(f.read())

# if you don't have msa, you can use the following code to generate msa
#from projects.tfold_ag.gen_msa import generate_msa
#with open('8df5_R.fasta', 'w') as f:
#    f.write('>8df5_R\nMGILPSPGMPALLSLVSLLSVLLMGCVAETGTRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGNIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVKGFNCYFPLQSYGFQPTYGVGYQPYRVVVLSFELLHAPATVCGPKKSTHHHHHHHHGGSSGLNDIFEAQKIEWHE')
#generate_msa('8df5_R.fasta', output_dir='examples/msa.files/')
#with open('examples/msa.files/8df5_R.a3m') as f:
#   msa, deletion_matrix = tfold.protein.parser.parse_a3m(f.read())


data = [
         {
             "id": "H",
             "sequence": "QVQLVQSGAEVKKPGASVKVSCKASGYPFTSYGISWVRQAPGQGLEWMGWISTYNGNTNYAQKFQGRVTMTTDTSTTTGYMELRRLRSDDTAVYYCARDYTRGAWFGESLIGGFDNWGQGTLVTVSS"
         },
         {
             "id": "L",
             "sequence": "EIVLTQSPGTLSLSPGERATLSCRASQTVSSTSLAWYQQKPGQAPRLLIYGASSRATGIPDRFSGSGSGTDFTLTISRLEPEDFAVYYCQQHDTSLTFGGGTKVEIK"
         },
         {
             "id": "A",
             "sequence": "MGILPSPGMPALLSLVSLLSVLLMGCVAETGTRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGNIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVKGFNCYFPLQSYGFQPTYGVGYQPYRVVVLSFELLHAPATVCGPKKSTHHHHHHHHGGSSGLNDIFEAQKIEWHE",
             "msa": msa,
             "deletion_matrix": deletion_matrix
         }
        ]

output_path = '8df5_A_B_R.pdb'

model.infer_pdb(data, output_path)

Predict TCR structures using tFold-TCR

import torch
import tfold

# Download the pre-trained model
ppi_model_path = tfold.model.esm_ppi_650m_tcr()
tfold_model_path = tfold.model.tfold_tcr_trunk()

# Load the model
model = tfold.deploy.TCRPredictor.restore_from_module(ppi_model_path, tfold_model_path)

# Prepare TCR sequences
data =[
            {
                "id": "B",
                "sequence": "NAGVTQTPKFQVLKTGQSMTLQCSQDMNHEYMSWYRQDPGMGLRLIHYSVGAGITDQGEVPNGYNVSRSTTEDFPLRLLSAAPSQTSVYFCASSYSIRGSRGEQFFGPGTRLTVL"
            },
            {
                "id": "A",
                "sequence": "AQEVTQIPAALSVPEGENLVLNCSFTDSAIYNLQWFRQDPGKGLTSLLLIQSSQREQTSGRLNASLDKSSGRSTLYIAASQPGDSATYLCAVTNQAGTALIFGKGTTLSVSS"
            }
        ]

output_path = '6zkw_E_D_A_B_C.pdb'

model.infer_pdb(data, output_path)

Predict TCR-pMHC structures using tFold-TCR

import torch
import tfold

# Download the pre-trained model
ppi_model_path = tfold.model.esm_ppi_650m_tcr()
tfold_model_path = tfold.model.tfold_tcr_pmhc_trunk()

# Load the model
model = tfold.deploy.TCRpMHCPredictor(ppi_model_path, tfold_model_path)

# Prepare TCR-pMHC sequences
data =[
            {
                "id": "B",
                "sequence": "NAGVTQTPKFQVLKTGQSMTLQCSQDMNHEYMSWYRQDPGMGLRLIHYSVGAGITDQGEVPNGYNVSRSTTEDFPLRLLSAAPSQTSVYFCASSYSIRGSRGEQFFGPGTRLTVL"
            },
            {
                "id": "A",
                "sequence": "AQEVTQIPAALSVPEGENLVLNCSFTDSAIYNLQWFRQDPGKGLTSLLLIQSSQREQTSGRLNASLDKSSGRSTLYIAASQPGDSATYLCAVTNQAGTALIFGKGTTLSVSS"
            },
            {
                "id": "M",
                "sequence": "GSHSLKYFHTSVSRPGRGEPRFISVGYVDDTQFVRFDNDAASPRMVPRAPWMEQEGSEYWDRETRSARDTAQIFRVNLRTLRGYYNQSEAGSHTLQWMHGCELGPDGRFLRGYEQFAYDGKDYLTLNEDLRSWTAVDTAAQISEQKSNDASEAEHQRAYLEDTCVEWLHKYLEKGKETLLHLEPPKTHVTHHPISDHEATLRCWALGFYPAEITLTWQQDGEGHTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPEPVTLRWKP"
            },
            {
                "id": "N",
                "sequence": "MIQRTPKIQVYSRHPAENGKSNFLNCYVSGFHPSDIEVDLLKNGERIEKVEHSDLSFSKDWSFYLLYYTEFTPTEKDEYACRVNHVTLSQPKIVKWDRDM"
            },
            {
                "id": "P",
                "sequence": "RLPAKAPLL"
            }
        ]

output_path = '6zkw_E_D_A_B_C.pdb'

model.infer_pdb(data, output_path)

Citing tFold

If you use tfold in your research, please cite our paper

@article{wu2024fast,
  title={Fast and accurate modeling and design of antibody-antigen complex using tFold},
  author={Wu, Fandi and Zhao, Yu and Wu, Jiaxiang and Jiang, Biaobin and He, Bing and Huang, Longkai and Qin, Chenchen and Yang, Fan and Huang, Ningqiao and Xiao, Yang and others},
  journal={bioRxiv},
  pages={2024--02},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

and old version of tFold-Ab

@article{wu2022tfold,
  title={tFold-ab: fast and accurate antibody structure prediction without sequence homologs},
  author={Wu, Jiaxiang and Wu, Fandi and Jiang, Biaobin and Liu, Wei and Zhao, Peilin},
  journal={bioRxiv},
  pages={2022--11},
  year={2022},
  publisher={Cold Spring Harbor Laboratory}
}

Our new pre-print paper on tFold-TCR will be coming soon

About

open source code for Tencent tFold

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages