This repo contains 6 DNA task and scripts for quick experiment. See full details in manuscript
Note: We strongly recommend that you browse the overall structure of our code at first. If you have any question, feel free to contact us.
In this frameowork, we support transformer and convolution based model. You can change model hyper-parameters by simply modifying the model config. Examples are listed in ./config
Name | Objective | Input_length | Output | Main Metric |
---|---|---|---|---|
promoter | Promoter | 500 | Probability, 1 | AUROC |
methyl96 | Methyl Probability | 1+500(flank) | Probability, 96 | SpearmanR |
track7878 | TF/DNase/Histone | 200+800(flank) | Probability, 7878 | AUPR |
expression218 | Gene Expression | 200+40800(flank) | log RPKM, 218 | SpearmanR |
snp49 | Casual SNP | 200+800(flank) | Probability | AUPR |
mpra10 | SNP Effect | 600 | log variant expression | SpearmanR |
More detailed description is in ./data/$dataset/metadata.json
. Preprocessed pipeline in ./preprocess
.
pip install torch
pip install -r requirements.txt
Download pre-trained models from the following links.
- Unsupervised Pretrained Model on HS1
- Pretrained Model on Track7878
All the datasets are processed from open resources. Download and preprocessing scripts are listed in ./preprocess
. Run the scripts to generate data in your local environments. You can also download data in this link.
Promoter dataset is an example. This required hg38.fa in ./data/genome
Run experiment simply by default setting:
python run_task.py --dataset-dir [directory_in_data_root] \
--save-dir [directory_to_save_experiment]
For example, you can train your promoter model from scratch by:
python run_task.py --dataset-dir promoter --save-dir experiment/promoter_default
Run python run_task.py -h
to check all the arguments. Check more examples ./scripts
folder
We provide a easy-to-use customized data pipeline. If you want to start experiment on your own dataset, organize your file:
- data_root
- customized_data
- metadata.json
- train.json
- ...
In metadata, you must specify fields like: (check more examples in ./data
folder)
{
"dataset_name": ...,
"dataset_args": {
"dataset_class": "DNATaskDataset",
"ref_file": "genome/hg19.fa", # if do not need, use 'null'
"train_file": "customized_data/train.json", # pass train_file to make trainer train model
"valid_file": ..., # pass valid_file to make trainer evaluate model after each epoch for training
...
},
"model_args": {
"task": "ModelForSequenceTask",
"final_dim": ...,
"loss_fn": {
"name": ...,
}
},
"metrics": ..., # the first metric will be the main score to save best model
}
We support JSON or HDF5 format data file. In JSON, a sample is structured like:
{"sequence": "ATGGCTC", "label": [1, 0]}
or
{"index": ["chr1", 0, 7, "+"], "label": [1, 0]}
In HDF5(for huge dataset storation), a sample is structured by two fields: index
and label
.
index: np.array([1, 0, 7, 1]) # (chr_num, start_pos, end_pos(exclusive), is_forward)
label: np.array([1, 0])
We support two modes of visualization. You can get roll-out attention score without any modification to model, however, this method sometimes does not perform well. (See ./visual_result/without_tscam
)
We used TS-CAM to enhance visualization in transformer-based model. To utilize its advantages, you should pass --tscam
when training model on a certain dataset. Then, model will provide informative class-specific visualization result. See example code in visualize.py
python visualize.py --load-dir experiment/promoter --save-dir visual_result/promoter
Use model.predict
method to infer for short sequence task (<1024bp) on your own data. For example:
# example for promoter detection.
# run by 'python $file --load-dir experiment/promoter'
from tools import get_config
from models import get_model
config = get_config()
model = get_model(config)
sequences = ['ATTCATCCAACTCTCCGTGAGCTCCCCTGGGTAGGAGTACAGTGGCAGCCAGTGTCCCCAGAAAACTGGCGCCTCCCCCCTCGCCGTGCGGGGCTAATTAACTCTTAGCCGGCGGGACCCTCCTCCTCCTCGGAGGTTGGCCAGGAGCAGCGCGGCATCCCAGGCGTTCCTGTCTGATGTCATAGGCTGCCGGCGATTGCGGAGAATCGCCACCACGCCTTTATGAAGGTCCCAACTTTGCCATCTGATACCCTTTACTACTGACAGGCGCTCAGCCAATCAGGAGCGGCGAGCGGGGTCTGGGGACCCGGAGCCGCCGAAGCCGTCTCGGGAACCGGCTCTTAACTCTTTGCGGCGGGCCCCGCAGCCGCCGAGGCACAGAGGGCGGGAGCAGGGCCAGGGGTCGGGAATCTGGGAGAGGGGCGCGAGCTAAAGAGCGGATGCCCGGAGGAAAGAAGGAAGGGCTGCGACGCCGCGGGGCTTGCAGGTGGTTCGCGGGG',
'ATGAAATACACATAAAAAACACACACATTAAATATTAATATATGCTTATTATTGTATTATGAATGAGGAAATAAAATATAACTTGGAATTTTTTTAAAACTTAAAAAAATACAATGGACTGAGCACTGAAATCAGAATATGCAGCTTATTTAGAACAAAATTCTACTTTTTCCCCTAAACTGTCCCTTAACATTGTCATCTCTCCTGCTAATCCTGCATTACCCTGGATCCTTCCTTTTTGTCTCTGCCTCCACTCACTGCTGCCTCTGCCATAAGCCTTCATACTCCAGCTGCTACACACTGCTGCTTCTATCCCTGAGGATTCCACGAGCATCCTTATTCTTCTGTCACTGATATGGTTCCTATTGGCATATCAAAAGTTATAGCCATATGAAGAAAAATCTAGGGATGCAGCAGCAGCAGCAGCAGTAGCAGTAGCAGCAACAGTCTATCAAGATGTTTTAATCTGGAATAAATTTCAGAATAGATCAATTCAGCAT'
] # 2 positive sample
model_output = model.predict(sequences)
print(model_output.logits_or_values)
For expression task, see example in predict_expression.py