DE-kupl annotation is part of the DE-kupl package, and performs annotations of DE contigs identified by DE-kupl.
- bash (version >= 4.3.46)
- awk (version >= 4.1.3)
- R (version >= version 3.2.3) with libraries foreach, doParallel, GenomicFeatures & DESeq2
- bedtools (versions 2.24, 2.26 ; don't use version 2.25 !!)
- Blast (version >= 2.5.0+)
- GSNAP (version >= 2016-11-07)
- samtools (version >= 1.3)
Note: needs sufficient memory to index the genome (as much as gmap_build needs, a few tens of GB's).
Usage: ./getContigsAnnotation.sh < Required arguments > [Optional arguments]
Required arguments :
-a <merged-diff-counts.tsv.gz (contigs in "{A}_vs_{B}_kmer_counts" directory from Dekupl-run result)>
-g <genome in fasta (uncompressed). Use the same version as the annotation to avoid chromosome name issues>
-d <{A}vs{B}-DEGs.tsv (diff. genes in "gene_expression" directory from Dekupl-run result)>
-r <reference annotation (gff3 format, uncompressed). Use the same version as the genome file to avoid chromosome name issues>
-t <are the reads stranded ? (choose "yes" or "no")>
-e <normalized_counts.tsv (normalized gene counts in "gene_expression" directory from Dekupl-run result)>
-f <sample_conditions.tsv (design file in "metadata" directory from Dekupl-run result)>
-o <path to output directory>
-i <illumina adapters (you can use the file "adapters.fa" supplied with the program)>
Optional arguments :
-b <path to bin/ of blast scripts (default : in $PATH environment variable)>
-c <path to bedtools, preferentially 2.24 (default : in $PATH environment variable)>
-j <GSNAP genome index name (if the index of the genome has already been created, supply its name. Otherwise, the index is re-created with the name "genome_index")>
-k <path to directory of GSNAP genome index (if you have a former run, you can supply the full path of the \"temp_dir/mapping_output/\" inside your former output directory, in order to re-use the same genome index and save time)>
-m <path to bin/ of GSNAP (default : in $PATH environment variable)>
-p <padj diff. gene threshold (default : 0.05)>
-q <contig color (choose 1 or 2 ; default : 1) >
1 : contigs on forward strand are in red (contigs on reverse strand are in blue)
2 : contigs on forward strand are in blue (contigs on reverse strand are in red)
-s <path to samtools (default : in $PATH environment variable)>
-n <thread number (default : 1)>
-
After the downloading of
dekupl-annotation
, makechmod 755 *.{sh,R}
into the directory to avoid rights issues. -
Use the same genome & annotation files version as the one of the file
transcripts.fa
used by Kallisto in Dekupl-run (e.g by default it's gencode 24, if it has changed, you should take this into account here, to avoid incoherences in the results because of non-maching IDs).
-
Table
DiffContigsInfos.tsv
, summarizing for each contig, its location on the genome (if it's aligned), the neighborhood, the sequence alignment informations, and the differential expression informations. -
BED
file diff_contigs.bed
for the visualization ; it contains useful informations from the summarization table. -
Table
ContigsPerLoci.tsv
(only for stranded data for now) containing loci with differentially expressed contigs.
-
DiffContigsInfos.tsv
: summary of all supplied contigs, no matter that they are aligned or not. The filecontigs_ontology.ods
supplied the used ontology. The column "TERM" in this last file, gives all the columns added by the annotation in the summarizing tableDiffContigsInfos.tsv
. -
diff_contigs.bed
: BED12 file of aligned contigs with GSNAP/Blast, with strand-specific color (red : strand + ; blue : strand -), with intensity scaled on the abs(log2FC). -
ContigsPerLoci.tsv
: contigs grouped by loci (genic, antisense, intergenic, unmapped). The locus ID for a genic/antisense locus is the Ensembl ID followed by the strand (separated by "&"). For an intergenic locus, we have the concatenation of : chromosome, strand, 5'-gene, 3'-gene (separated by "&").
-
Contigs from
merged-diff-counts.tsv.gz
are converted in fasta. -
Contigs which align on the adapters are filtered out.
-
The clean contigs are mapped on the genome with GSNAP.
-
The BAM file is parsed to obtain a primary BED12 file (
diff_contigs.bed
), and a table summarizing the alignment (DiffContigsInfos.tsv
). This step is parallelized. -
Unmapped contigs with GSNAP are aligned on the genome with Blast. Those with full length alignment are pushed in the BED12 and the summarizing table.
-
The BED12 file is enriched with infos from merged-diff-counts.tsv.
-
Unaligned contigs are pushed in the summarizing table, with the appropriates values "NA", "F" or "none".
-
The summarizing table is enriched with neighborhood of the contigs (antisense, closest genes, etc).
-
From the summarizing table, a table of the contigs grouped by loci (genic, antisense, intergenic, unmapped) is built (
ContigsPerLoci.tsv
).
-
If Blast data bases for the human genome and the adapters are already created, these steps will be avoided (by the script blast.sh).
-
The Blast of unmapped contigs on the genome (after the mapping with GSNAP) can take long time even though it is parallelized ; in script blast.sh, you can split the input file in many chuncks, and run blast on each sub-file (with xargs) to speed up.