Skip to content

SQANTI SIM: eval

jmestret edited this page Sep 13, 2022 · 1 revision

Table of Contents

Introduction

sqanti-sim.py eval is the final step of the SQANTI-SIM pipeline. In this step, SQANTI3 is run, and SQANTI-SIM generates a report to evaluate the performance of the reconstruction pipeline employed to identify the transcript models. This mode takes your long read-defined transcriptome, the modified reference annotation and the reference genome used in the steps before to evaluate the performance of your pipeline.

In this step, you can provide orthogonal data such as CAGE Peak and short-read support (that maybe you used in your reconstruction pipeline) to generate some metrics in the report regarding this information. You can also provide a file with the number of reads per reconstructed transcript to evaluate the quantification.

Usage

SQANTI-SIM eval mode usage:

sqanti-sim.py eval [-h] --transcriptome TRANSCRIPTOME --gtf GTF --genome
                         GENOME -i TRANS_INDEX [-e EXPRESSION] [-o OUTPUT]
                         [-d DIR] [-c COVERAGE] [--SR_bam SR_BAM]
                         [--short_reads SHORT_READS] [--CAGE_peak CAGE_PEAK]
                         [--fasta]
                         [--aligner_choice {minimap2,deSALT,gmap,uLTRA}]
                         [--min_support MIN_SUPPORT] [-k CORES]

With the --help option you can display a complete description of the arguments:

sqanti-sim.py eval parse options

optional arguments:
  -h, --help            show this help message and exit
  --transcriptome TRANSCRIPTOME
                        Long-read-defined trancriptome reconstructed with your
                        pipeline in GTF, FASTA or FASTQ format
  --gtf GTF             \Reduced reference annotation in GTF format
  --genome GENOME       Reference genome FASTA
  -i TRANS_INDEX, --trans_index TRANS_INDEX
                        File with transcript information generated with
                        SQANTI-SIM (*_index.tsv)
  -e EXPRESSION, --expression EXPRESSION
                        Expression of transcript models (file without header
                        with two columns tab-separated: first with id and
                        second with quantified number of reads, no header)
  -o OUTPUT, --output OUTPUT
                        Prefix for output files
  -d DIR, --dir DIR     Directory for output files (default: .)
  -c COVERAGE, --coverage COVERAGE
                        Junction coverage files (provide a single file, comma-
                        delmited filenames, or a file pattern, ex:
                        "mydir/*.junctions")
  --SR_bam SR_BAM       Directory or fofn file with the sorted bam files of
                        Short Reads RNA-Seq mapped against the genome
  --short_reads SHORT_READS
                        File Of File Names (fofn, space separated) with paths
                        to FASTA or FASTQ from Short-Read RNA-Seq. If
                        expression or coverage files are not provided,
                        Kallisto (just for pair-end data) and STAR,
                        respectively, will be run to calculate them.
  --CAGE_peak CAGE_PEAK
                        CAGE Peak file in BED format (example FANTOM5)
  --fasta               Use when running SQANTI-SIM by using as input a
                        FASTA/FASTQ with the sequences of isoforms
  --aligner_choice {minimap2,deSALT,gmap,uLTRA}
                        If --fasta used, choose the aligner to map your
                        isoforms
  --min_support MIN_SUPPORT
                        Minimum number of supporting reads for an isoform
  -k CORES, --cores CORES
                        Number of cores to run in parallel

Running this mode with the minimum input will look as follows:

(SQANTI-SIM.env)$ python sqanti-sim.py eval \
			--isoforms long_read_transcriptome.gtf \
			--trans_index prefix_index.tsv \
			--gtf modified_reference_annotation.gtf \
			--genome reference_genome.fasta \

Arguments detailed explanation

Required input

These are the minimum parameters you will need to run sqanti-sim.py eval:

  • Long-read transcriptome (--transcriptome): The isoforms identified with your transcript reconstruction pipeline. The transcripts models can be in GTF, FASTA and GTF format. We recommend to use GTF format. If you are going to input it as FASTA or FASTQ, you must add the --fasta argument and you can choose the aligner with --aligner_choice.
  • Transcript index file (-i): This file is the prefix_index.tsv file generated in the previous sim step.
  • Reduced reference annotation in GTF format (--gtf): This file is the modified GTF reference annotation generated in the design step that you should have provided to your transcript reconstruction pipeline.
  • Reference genome in FASTA format (--genome): This is the reference genome in FASTA format.

Optional input

  • Expression file (-e/--expression): Two column tab-separated file with transcript models ids in the first column and the number of quantified reads in the second column. Don't add any header.
  • Short reads support (--short_reads): File Of File Names (fofn, space-separated) with paths to FASTA or FASTQ from Short-Read RNA-Seq.
  • CAGE Peak data (--cage_peak): FANTOM5 Cage Peak (BED format, optional)
  • Output prefix(-o): The output prefix for the index file. SQANTI-SIM will use "sqanti-sim" as the default prefix.
  • Output directory(-d): Output directory for output files. SQANTI-SIM will use the directory where the script was run as the default output directory.
  • Supporting reads (--min_support): Minimum number of supporting reads for a transcript model to be identified as a new isoform by your transcript reconstruction pipeline. This parameter doesn't affect the results; it just generates some extra metrics in the report according to your pipeline limitations/thresholds.
  • Parallelization(-k): Number of cores to run in parallel. Most SQANTI-SIM modes have code chunks that can be run in parallel. However the default option is to run SQANTI-SIM in one single thread.

Output explanation

prefix_SQANTI-SIM_report.html

This is the main output file of the SQANTI-SIM pipeline. This HTML report picks up all the performance metrics of your transcript identification pipeline.

  • Total transcripts: All simulated transcripts.
  • True Positives (TP): Reconstructed transcript models that match all the splice junctions with the true reference transcript and the difference in the TSS and TTS is less than 50 bp.
  • Partial True Positives (PTP): Reconstructed transcript models that match all the splice junctions with the true reference transcript, but differs badly from the annotated reference TSS and TTS.
  • False Positives (FP): Transcripts that were detected but weren't simulated.
  • False Negative (FN): Transcripts that were simulated but not detected.
  • Precision: TP / (TP + FP)
  • Sensitivity: TP / (TP + FN)
  • F-score: 2 * ((Precision * Sensitivity) / (Precision + Sensitivity))
  • Positive Detection Rate: (TP + PTP) / (TP + FN)
  • False Discovery Rate: (FP + PTP) / (TP + FP)
  • False Detection Rate: FP / (TP + FP)

prefix_index.tsv

A new column called "pipeline_performance" is added to the transcript index file indicating if the transcript was simulated but not detected "FN", if the transcript was simulated and detected "transcript id given by the reconstruction pipeline" and if the transcript was not simulated "absent" (sim_counts <= 0).

SQANTI-SIM_metrics{_min_supp}.tsv

Tab-separated file with a table containing the performance metrics computed by SQANTI-SIM for all the transcripts and for those with higher number of simulated reads than --min_support.

SQANTI3 output files

This is a directory with all the SQANTI3 output (/sqanti3/). A detailed explanation of its output can be found in the SQANTI3 wiki.