exRNA Biomarker Discovery for Liquid Biopsy
Note:
- The exSEEK program starts from a data matrix of gene expression (read counts of each gene in each sample) and performs normalization, feature selection and evaluation.
- Meanwhile, we provide some pipelines and QC steps for the pre-process of exRNA-seq (including long and short cfRNA-seq/exoRNA-seq) raw data.
- We also recommend other alternatives for the pre-process, such as exceRpt, that is specifically developed for the process of exRNA-seq raw reads.
Table of Contents:
For easy installation, you can use the exSEEK image of docker with all dependencies installed:
docker pull ltbyshi/exseek
Alternatively, you can use use singularity or udocker to run the container for Linux kernel < 3 or if you don't have permission to use docker.
Run the main program exseek.py
from docker:
docker run --rm -it -v $PWD:/workspace -w /workspace ltbyshi/exseek exseek.py
The exSEEK directory was cloned to /apps/exseek in the docker.
You can create a bash script named exseek
and set the script executable:
#! /bin/bash
docker run --rm -it -v $PWD:/workspace -w /workspace ltbyshi/exseek exseek.py "$@"
After adding the file to one of the directory in the $PATH
variable, you can simply run: exseek
.
The basic usage of exSEEK is:
exseek ${step_name} -d ${dataset}
Note:
- Other arguments are passed to snakemake
- Specify number of processes to run in parallel with -j
${step_name}
is one ofnormalization
andcross_validation
.${dataset}
is the name of your dataset that should match the prefix of your configuration file described in the following section.
An example can be found in example_data
directory with the following structure:
example_data/
├── config
│ └── example.yaml
├── data
│ └── example
│ ├── batch_info.txt
│ ├── compare_groups.yaml
│ ├── sample_classes.txt
│ └── sample_ids.txt
└── output
└── example
└── count_matrix
└── mirna_and_domains_rna.txt
Note:
config/example.yaml
: configuration filedata/example/batch_info.txt
: table of batch informationdata/example/compare_groups.yaml
: configuration file for definition of positive and negative samplesdata/example/sample_classes.txt
: table of sample labelsoutput/example/count_matrix/mirna_and_domains_rna.txt
: input matrix of read counts
You can create your own data directory with the above directory structure. Multiple datasets can be put in the same directory by replacing "example" with your own dataset names.
More information about input and output files can be found on File Format page.
Run:
exseek normalization -d ${dataset}
This will generate normalized expression matrix for every combination of methods with the following file name pattern:
output/${dataset}/matrix_processing/filter.${imputation_method}.Norm_${normalization_method}.Batch_${batch_removal_method}_${batch_index}.${count_method}.txt
You can specify normalization methods by setting the value of normalization_method
and the batch removal methods
by setting the value of batch_removal_method
in in config/${dataset}.yaml
.
Supported normalization methods: TMM, RLE, CPM, CPM_top, UQ, null
Supported batch removal methods: limma, ComBat, RUV, null
When the method name is set to "null", the step is skipped.
${batch_index}
is the column number (start from 1) in config/${dataset}/batch_info.txt
to be used to remove batch effects.
Run:
exseek feature_selection -d ${dataset}
This will evaluate all combinations of feature selection methods and classifiers by cross-validation.
Three summary files will be generated:
output/${dataset}/summary/cross_validation/metrics.test.txt
output/${dataset}/summary/cross_validation/metrics.train.txt
output/${dataset}/summary/cross_validation/feature_stability.txt
Cross-validation results and trained models for individual combinations are in this directory:
output/${dataset}/feature_selection/filter.${imputation_method}.Norm_${normalization_method}.Batch_${batch_removal_method}_${batch_index}.${count_method}/${compare_group}/${classifier}.${n_select}.${selector}.${fold_change_filter_direction}
Selected list of features are in features.txt
.
Note: More information about output files can be found on File format page. Detailed parameters of feature selection and classifiers can be found in config/machine_learning.yaml.
- Click here to see details
Copyright (C) 2019 Tsinghua University, Beijing, China
This program is licensed with commercial restriction use license. Please see the LICENSE file for details.
Binbin Shi, Jingyi Cao, Xupeng Chen and Zhi John Lu (2019) exSEEK: an integrative computational framework for identifying extracellular RNA biomarkers in liquid biopsy