👥 Authors
- Wend Yam Donald Davy Ouedraogo & Aida Ouangraoua, CoBIUS LAB, Department of Computer Science, Faculty of Science, Université de Sherbrooke, Sherbrooke, Canada*
💡 If you are using our algorithm in your research, please cite our recent paper: Ouedraogo, W. Y. D. D., & Ouangraoua, A. (2023, April). Inferring Clusters of Orthologous and Paralogous Transcripts. In RECOMB International Workshop on Comparative Genomics (pp. 19-34).
📧 Contact: [email protected]
We present an algorithm for inferring clusters of orthologous and paralogous transcripts.
python3 (at leat python 3.6)
ETE toolkit
install the package
pip3 install transcriptorthology
import package and use the main function
from transcriptorthology.transcriptOrthology import inferring_transcripts_isoorthology
if __name__ == '__main__':
gtot_path = './execution/mapping_gene_to_transcripts/ENSGT00390000000080.fasta'
gt_path = './execution/NHX_trees/ENSGT00390000000080.nwk'
lower_bound = 0.7
transcripts_msa_path = './execution/transcripts_alignments/ENSGT00390000000080.alg'
tsm_conditions = 2
constraint = 1
output_folder = './execution/output_folder'
inferring_transcripts_isoorthology(transcripts_msa_path, gtot_path, gt_path, tsm_conditions, lower_bound, constraint, output_folder)
usage: transcriptOrthology.py [-h] -talg TRALIGNMENT
program parameters
-h, --help show this help message and exit
-talg TRALIGNMENT, --tralignment TRALIGNMENT
Multiple Sequences Alignment of transcripts in FASTA
mappings transcripts to corresponding genes
-nhxt NHXGENETREE, --nhxgenetree NHXGENETREE
NHX gene tree
-lowb LOWERBOUND, --lowerbound LOWERBOUND
a threshold for the selection of transcripts RBHs
-tsm TSMVALUE, --tsmvalue TSMVALUE
an integer(1|2|3|4|5|6) that refers to the transcript
similarity measure
-const CONSTRAINT, --constraint CONSTRAINT
an integer(0|1), constraint for the selection of recent paralogs
similarity measure
the output folder to store the results
parameter | definition | value format |
-talg --tralignment |
MSA of transcripts | FASTA format >{id_transcript}\n{sequence} |
-gtot --genetotranscripts |
mappings g(t) | FASTA format >{id_transcript}:{id_gene}\n |
-nhxt --nhxtgenetree |
gene tree | NHX format |
-lowb --lowerbound |
a lower bound to select RBHs transcripts. By default, equals to 0.5 | float between 0 and 1 |
-tsm --tsmvalue |
The similarity mesure(mean, length, unitary) | integer 1(tsm+unitary) | 2(tsm+length) | 3(tsm+mean) | 4(tsm++unitary) | 5(tsm++length) | 6(tsm++mean) |
-const --constraint |
constraint for the selection of recent paralogs | 0(not reciprocal) | 1(reciprocal) |
-outf --outputfolder |
folder to save results. The current program folder is set by default. | String |
Usage example
python3 ./scripts/transcriptOrthology.py -talg ./execution/inputs/transcripts_alignments/ENSGT00390000003967.alg -gtot ./execution/inputs/mapping_gene_to_transcripts/ENSGT00390000003967.fasta -nhxt ./execution/inputs/NHX_trees/ENSGT00390000003967.nhx -lowb 0.7 -outf ./execution/outputs/ -tsm 1 -const 1
sh ./execution_inferring_clusters.sh
Output expected
++++++++++++++++Starting ....
+++++++ All data were retrieved & the representation of subtranscribed sequences of genes into blocks are available.
+++++ Computing matrix ... in progress
+++++ Computing matrix ... status: Finished without errors in 0.42296433448791504 seconds
+++++ Searching for recent-paralogs ... status: processing
+++++ Searching for recent-paralogs ... status: finished in 0.11350250244140625 seconds
+++++ Searching for RBHs ... status: processing
+++++ Searching for RBHs ... status: finished in 0.09129834175109863 seconds
+++++ Construction of the orthology graph (Adding nodes ...) ... status: processing
+++++ Construction of the orthology graph (Adding nodes ...) ... status: finished in 0.524106502532959 seconds
+++++ Searching for connected components ... status: processing
+++++ Searching for connected components ... status: finished in 0.06076645851135254 seconds
Inputs files
- 1️⃣
tsmcomputing() ➡️ returns the similarity matrix (tsm+ | tsm) scores depending on the `tsmvalue` for all pairs of homologous transcripts.
usage: tsmComputing.py [-h] [-talg TRALIGNMENT] [-gtot GENETOTRANSCRIPTS] [-tsm TSMVALUE] [-outf OUTPUTFOLDER]
parsor program parameter
optional arguments: -h, --help show this help message and exit -talg TRALIGNMENT, --tralignment TRALIGNMENT -gtot GENETOTRANSCRIPTS, --genetotranscripts GENETOTRANSCRIPTS -tsm TSMVALUE, --tsmvalue TSMVALUE -outf OUTPUTFOLDER, --outputfolder OUTPUTFOLDER
- 2️⃣
Tclustering() ➡️ returns the orthology graph of transcripts.
usage: Tclustering.py [-h] [-m MATRIX] [-gtot GENETOTRANSCRIPTS] [-nhxt NHXGENETREE] [-lowb LOWERBOUND] [-outf OUTPUTFOLDER]
parsor program parameter
optional arguments: -h, --help show this help message and exit -m MATRIX, --matrix MATRIX -gtot GENETOTRANSCRIPTS, --genetotranscripts GENETOTRANSCRIPTS -nhxt NHXGENETREE, --nhxgenetree NHXGENETREE -lowb LOWERBOUND, --lowerbound LOWERBOUND -const CONSTRAINT, --constraint CONSTRAINT -outf OUTPUTFOLDER, --outputfolder OUTPUTFOLDER
- 3️⃣ transcriptOthology() ➡️ returns for each pair of homologous transcripts, their homology relationship type (recent-paralogs, ortho-paralogs or ortho-orthologs).
Outputs files
- 1️⃣ matrix.csv : similarity matrix score that present the tsm+ score between each pair of homologous transcripts.
- 2️⃣ blocks_transcripts.csv|blocks_genes : csv file describing the representation of blocks for each transcript(resp. gene).
- 3️⃣ start_orthology_graph.pdf|end_orthology_graph.pdf : orthology graph at the start of the algorithm(resp. at the end of the algorithm) showing only the pair relationships between recent-paralogs(resp. all the orthologous clusters). (:warning:only retrieved if the number of transcripts is not greater than 20)
- 4️⃣ orthologs.csv : csv files resuming the information of the isoorthology-clustering.
The folder data contains dataset used for the studies and also the results obtained.
Copyright © 2023 CoBIUS LAB