-
Notifications
You must be signed in to change notification settings - Fork 1
10. Annotation
The annotation part of GENERA permits protein prediction for prokaryotes and eukaryotes.
Three different tools can be used for annotation: Prodigal for prokaryotes, AMAW and Braker for eukaryotes.
These tools are available immediately from the containers, without Nextflow, at the exception of Braker.
$ singularity exec /scratch/ulg/GENERA/prodigal-2.6.3.sif prodigal -h
prodigal [-a trans_file] [-c] [-d nuc_file] [-f output_type]
[-g tr_table] [-h] [-i input_file] [-m] [-n] [-o output_file]
[-p mode] [-q] [-s start_file] [-t training_file] [-v]
-a: Write protein translations to the selected file.
-c: Closed ends. Do not allow genes to run off edges.
-d: Write nucleotide sequences of genes to the selected file.
-f: Select output format (gbk, gff, or sco). Default is gbk.
-g: Specify a translation table to use (default 11).
-h: Print help menu and exit.
-i: Specify FASTA/Genbank input file (default reads from stdin).
-m: Treat runs of N as masked sequence; don't build genes across them.
-n: Bypass Shine-Dalgarno trainer and force a full motif scan.
-o: Specify output file (default writes to stdout).
-p: Select procedure (single or meta). Default is single.
-q: Run quietly (suppress normal stderr output).
-s: Write all potential genes (with scores) to the selected file.
-t: Write a training file (if none exists); otherwise, read and use
the specified training file.
-v: Print version number and exit.
$ singularity exec /scratch/ulg/GENERA/amaw.sif amaw.pl --help
Usage:
amaw.pl --genome <file> --organism <Genus_species> --taxdir <dir> [options]
amaw.pl --help
amaw.pl --man
amaw.pl --usage
amaw.pl --version
Required arguments:
--genome [=] <file>
Path to your genomic data.
--organism [=] <Genus_species>
Name of your organism with the format Genus_species or
Genus_species_strain (for the augustus gene model name).
--taxdir [=] <dir>
Path to local mirror of the NCBI Taxonomy database,IF YOU USE
--proteins, THIS OPTION IS MANDATORY.
Options:
--sge
Use SGE job templating
--version3
Enables additional MAKER3 options.
--mpi
Uses MPI parallelization.
--queue [=] <queue>
bignode.q, smallnodes.q or gpunode.q. [default: smallnodes.q]
--email [=] <email>
Type your email address to get informations with the progress of the
different steps.
--jobname [=] <jobname>
basename for your jobs, by default it will be your organism namewith
a unique id.
--singularity
This option handles the Singularity container way to use the folder
paths when launching the different steps of AMAW.
--org-type [=] <org>
Eukaryotic or prokaryotic [Default: eukaryotic].
--est [=] <n>
Use of EST/RNA-seq evidence (boolean switch: 0/1) [default: 0].
--est-file [=] <file>
Use of a selected EST/RNA-seq file instead of downloading and
assembling SRA data from NCBI.
--max-storage [=] <int>
Maximal experiment size (in Gb) to download [default: 10].
--max-experiments [=] <int>
Maximal number of experiments to be downloaded (precedes
--max-storage). By default, this option is not applied.
--sra-list [=] <string>
Comma-separated string containing all RNA-Seq SRA accession numbers.
Currently, only Illumina paired-end read datasets are allowed. This
option disables the search of available organism-specific SRAs.
--transcript-db [=] <path>
Alternative way to use automatically transcript assemblies: select
the path to a folder where you put all your transcripts with this
nomenclature:
Genus_species*.fasta.
If no transcript file of your database matched your organism, SRA
search will be executed.
--proteins [=] <n>
Use of protein evidence (boolean switch: 0/1), if activated,
--taxdir option is needed. [default: 1]
--protein-file [=] <file>
Use of a selected protein file instead of using the default protein
database.
--outdir [=] <dirname>
Name of the output directory containing the results. [default:
amaw_output]
--maker-cpus [=] <n>
Number of cpus to use for maker runs. [default: 25]
--trinity-cpus [=] <n>
Number of cpus to use for Trinity. By default, it will calculate the
number of cpus to use in function of the number of reads.
--trinity-memory [=] <n>
Number of GB RAM of memory to use for Trinity. By default, this
value is calculated in function of number of reads (1GB of memory by
million of paired-end reads).
By default, this calculated value is restricted to 50 GB, provide a
value with this option to override this limit.
--rsem-cpus [=] <n>
Number of cpus to use for rsem. By default, it will calculate the
number of CPUs to use in function of the number of reads.
--augustus-prefix [=] <name>
Prefix to recognize Augustus gene models create for a user or a
project. By default, none.
--augustus-gm [=] <string>
Name of an already existing augustus gene model (see the list on the
augustus config/species/ folder).
This option only launches one MAKER run directly using the augustus
gene model.
--gm-db
Enables the use of previous gene models avaible for snap and
augustus (needs --snap-db and --augustus-db options).
This option will only launch one MAKER run directly using the SNAP
and augustus gene models (named as the --organism option in the
folders given with --snap-db and augustus-db).
--snap-db [=] <dir>
Path to snap gene model database (e.g. /home/snap-db/).
--augustus-db [=] <dir>
Path to augustus gene model database (e.g. /media/#
/apps/augustus-3.2.2/config/species/).
--augustus-new
This option removes the previous augustus gene model. For this give,
the programs need the directory with gene models (Needs
--augustus-db option).
--masking [=] <n>
Mask repetitive elements in the genome (strongly recommanded if the
genome sequences are not masked) (boolean switch: 0/1) [default: 1].
--soft-mask [=] <n>
Use soft-masking rather than hard-masking in BLAST (boolean switch:
0/1) [default: 1].
--split-length [=] <length>
Length for dividing up contigs into chunks (increases/decreases
memory usage) [default: 100000].
--pred-flank [=] <n>
Flank for extending evidence clusters sent to gene predictors
--min-intron [=] <n>
Minimum intron length (used for alignment polishing) [default: 20].
--pred-stats [=] <n>
Report AED and QI statistics for all predictions as well as models
(boolean switch: 0/1) [default: 1].
--blast-type [=] <type>
Set to 'ncbi+', 'ncbi' or 'wublast'.
--blastn-cov [=] <percent>
Blastn Percent Coverage Threshold EST-Genome Alignments [default:
0.8].
--blastn-id [=] <percent>
Blastn Percent Identity Threshold EST-Genome Alignments [default:
0.85].
--blastn-eval [=] <eval>
Blastn e-value cutoff [default: 1e-10].
--blastn-depth [=] <n>
Number of BLAST alignments, per query, to be used for annotation.
Low values decreases the memory use and the runtime for large
evidence datasets [default: 0].
--blastx-cov [=] <percent>
Blastx Percent Coverage Threhold Protein-Genome Alignments [default:
0.5].
--blastx-id [=] <percent>
Blastx Percent Identity Threhold Protein-Genome Alignments [default:
0.4].
--blastx-eval [=] <eval>
Blastx e-value cutoff [default: 1e-06].
--blastx-depth [=] <n>
Number of BLAST alignments, per query, to be used for annotation.
Low values decreases the memory use and the runtime for large
evidence datasets [default: 0].
--prot-dbs [=] <directory_path>
Path to the protein databases for the different eukaryotic clades.
--version
--usage
--help
--man
print the usual manual
The typical command for running the pipeline is as follows:
nextflow run Braker.nf --genome=genome.fna --prot=fungi --SRA=none --brakermode=prot --cpu=20 --currentpath=<PWD>
Mandatory arguments:
--genome Specify genome
--currentpath Specify your current full path (as obtenied by pwd), for TMPDIR
Optional arguments:
--brakermode Specify the mode of braker, with RNAseq + proteins (default = rnaseq) or with protein file only (= prot)
--SRA Specify rnaseq SRA list file, default = none
--prot Specify which prot file to use: fungi, protozoa, plants or test, default = fungi
--cpu number of cpus to use, default = 1
The user should provide a prokaryote genome, Prodigal will provide proteins and coding sequences.
singularity exec /scratch/ulg/GENERA/prodigal-2.6.3.sif prodigal -i GCF_000007145.1.fna -o GCF_000007145.1.out /
-a GCF_000007145.1.faa -d GCF_000007145.1.genes.fna
The user should provide a eukaryote genome, an organism name (with this format genus_species) which will be used for RNAseq search on the NCBI, and a gene model for Augustus. Please see point 3.2.1 and 3.2.2 for information on organism name and Augustus gene model.
$ singularity exec /scratch/ulg/GENERA/amaw.sif amaw.pl --genome=/mnt/<FIELD2> --organism=<FIELD3> /
--proteins=1 --est=1 --taxdir=/temp/taxdump/ --maker-cpus=20 --trinity-cpus=20 --rsem-cpus=20 /
--outdir=/mnt/GENERA-annotation --prot-dbs=/temp/prot_dbs/
AMAW automatically downloads RNAseq data from the SRA portal of the NCBI. To ensure that RNAseq experiments are available, a simple query and the NCBI can be run:
https://www.ncbi.nlm.nih.gov/sra/?term=Uroleptopsis_citrina+AND+RNA-Seq
A gene model can be provided, it is not mandatory, to help Augustus prediction. You can choose the best gene model, the closest to your genome, in the Augustus species list:
https://github.com/Gaius-Augustus/Augustus/tree/master/config/species.
Two option can limit the number of SRA used by AMAW.
The SRA can be listed with --sra-list (--sra-list=SRR14871474) or a maximum storage can be used --max-storage=10 (for 10gb).
Braker2 uses GeneMark-ET and Augustus for eukaryotes proteins annotation. Braker2, in opposition to AMAW, is not designed for non-model organism and doesn't search SRA automatically within the NCBI. The SRA number used by Braker should be provided by the user. Braker needs spliced alignments, produced here by Hisat2. The usage of too distant (in term of phylogeny) SRA will produce an error due to a low number of hints. The protein evidence used by Braker comes from OrthoDB databases (https://github.com/gatech-genemark/ProtHint#protein-database-preparation). Currently, only fungi are provided in the GENERA tools (if more files are needed, please add an issue on the git).
$ cp /scratch/ulg/GENERA/Nextflow-scripts/Braker.nf .
$ nextflow run Braker.nf --genome=genome.fna --prot=fungi --SRA=none --brakermode=prot --cpu=20 --currentpath=<PWD>
The genome name, present in the current directory, with .fna extension.
genome.fna
PATH of the current directory, obtained with $PWD command.
A '/' should end this PATH.
/scratch/ulg/bioec/lcornet/annot/
The user can specify the mode: with RNAseq + proteins (default = rnaseq) or with protein file only (= prot).
If the mode used is rnaseq, the user should provided a list of SRA numbers in a file.
The SRA numbers can be found with a query on NCBI:
https://www.ncbi.nlm.nih.gov/sra/?term=Uroleptopsis_citrina+AND+RNA-Seq
SRA.list
SRR7662950
The user should specify which OrthoDB database will be used by the workflow.
Create a new directory:
$ mkdir annot
$ cd annot
Copy the Annotation suite from the shared directory:
$ cp /scratch/ulg/GENERA/Nextflow-scripts/Annotation-* .
Get the path of your current directory, only the symlink:
$ pwd # this command will produce a path like: /scratch/ulg/bioec/lcornet/annot
Edit the in the job file with your needs, edits the prokaryote or eukaryote job file:
$ nano -w Annotation-proka.job
$ The file should look like this:
#!/bin/bash
# Submission script for Nic5
#SBATCH --time=5-01:00:00 # days-hh:mm:ss
#
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --mem-per-cpu=2625 # megabytes
#SBATCH --partition=bio
export OMP_NUM_THREADS=20
export MKL_NUM_THREADS=20
mkdir GENERA-annotation
singularity exec --bind <FIELD1>:/mnt /scratch/ulg/GENERA/prodigal-2.6.3.sif prodigal -i /mnt/<FIELD2>.fna
-o /mnt/GENERA.out -a /mnt/<FIELD2>.faa -d /mnt/<FI$
mv GENERA.out *.faa *.genes.fna GENERA-annotation/
#Notes : FIELD1 corresponds to the symlink path
Submit your job:
$ sbatch Annotation-proka.job
$ nano -w Annotation-euka.job
$ The file should look like this:
#!/bin/bash
# Submission script for Nic5
#SBATCH --time=5-01:00:00 # days-hh:mm:ss
#
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --mem-per-cpu=2625 # megabytes
#SBATCH --partition=bio
export OMP_NUM_THREADS=20
export MKL_NUM_THREADS=20
rm -rf augustus-config/
cp -r /scratch/ulg/GENERA/Databases/AMAW/augustus-config .
mkdir NCBI
cd NCBI
mkdir .ncbi
cp /scratch/ulg/GENERA/user-settings.mkfg .ncbi/
cd ../
singularity exec --bind /scratch/ulg/GENERA/Databases/AMAW:/temp,<FIELD1>/NCBI:${HOME},<FIELD1>:/mnt
--contain --workdir <FIELD1> /scratch/ulg/GENERA/amaw.sif amaw.pl --genome=/mnt/<FIELD2>
--organism=<FIELD3> --proteins=1 --est=1 --taxdir=/temp/taxdump/ --maker-cpus=20 --trinity-cpus=20
--rsem-cpus=20 --augustus-db=/mnt/augustus-config/ --outdir=/mnt/GENERA-annotation --prot-dbs=/temp/prot_dbs/
Submit your job:
$ sbatch Annotation-euka.job
Copy the Braker suite from the shared directory:
$ cp /scratch/ulg/GENERA/Nextflow-scripts/Braker.* .
Get the path of your current directory, symlink and original path:
$ pwd # this command will produce a path like: /scratch/ulg/bioec/lcornet/annot
$ readlink -f . # this command will produce a path like: /scratch/users/l/c/lcornet/annot
Use the two paths to complete the Nextflow config file
$ mv Braker.config nextflow.config
$ nano -w nextflow.config (ctrl -X to quit and save)
The file should look like this:
process {
withName:augustusCongig {
container = '/scratch/ulg/GENERA/braker-2.sif'
}
withName:getprot {
container = '/scratch/ulg/GENERA/braker-2.sif'
}
withName:abbr {
container = '/scratch/ulg/GENERA/Genome-downloader.sif'
}
withName:hisat2 {
container = '/scratch/ulg/GENERA/braker-2.sif'
}
withName:braker {
container = '/scratch/ulg/GENERA/braker-2.sif'
}
withName:results {
container = '/scratch/ulg/GENERA/braker-2.sif'
}
}
singularity.enabled = true
singularity.cacheDir = "$PWD"
singularity.autoMounts = false
singularity.runOptions = '-B /scratch/ulg/bioec/lcornet/annot -B
/scratch/users/l/c/lcornet/annot -B /scratch/ulg/GENERA/Databases/BRAKER/'
Edit the in the job file with your needs:
#!/bin/bash
# Submission script for Nic5
#SBATCH --time=20-01:00:00 # days-hh:mm:ss
#
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --mem-per-cpu=7500 # megabytes
#SBATCH --partition=bio
export OMP_NUM_THREADS=20
export MKL_NUM_THREADS=20
module --ignore-cache load Nextflow/21.08.0
nextflow run Braker.nf --genome=genome.fna --prot=fungi --brakermode=prot --SRA=none --cpu=20 \
--currentpath=/scratch/ulg/bioec/lcornet/annot/
Submit your job:
$ sbatch Braker.job
Annotation produces an output directory called GENERA-annotation, different for prokaryote or eukaryote.
GENERA-annotation:
GENERA.out #Log of prodigal
file.faa #protein sequences
file.genes.fna #CDS sequences
GENERA-annotation:
efetch_result #fetch results for RNAseq
fastq-dump.sh #command for RNAseq download
maker_bopts.ctl #maker run files
maker_exe.ctl #maker run files
maker_opts.ctl #maker run files
maker_run.sh #maker run files
RSEM.genes.results #RSEM run files
RSEM.isoforms.results #RSEM run files
rsem.sh #RSEM run files
SRR7662950_1.fastq #Fastq downloaded
SRR7662950_2.fastq #Fastq downloaded
transcript_filter.sh #Trinity run files
Trinity_filtered.fasta #Trinity results
trinity_out_dir/ #Trinity run files
trinity.sh #Trinity run files
final_output/ #AMAW results
run.all.maker.augustus.proteins.fasta #Final file with proteins
run.all.maker.non_overlapping_ab_initio.transcripts.fasta
run.all.maker.augustus.transcripts.fasta
run_fasta.gff
run.all.maker.non_overlapping_ab_initio.proteins.fasta
run.gff
A directory GENERA-braker is created by the workflow.
GENERA-braker/
GENERA-braker.log #Log of the workflow
BRAKER/ #Results of Braker2
augustus.hints.aa #Final file with proteins
It is important to clean your directory with the run of the script.
No cleaning is necessary for Prodigal but AMAW directory should be cleaned.
$ cd GENERA-annotation; rm -rf efetch_result fastq-dump.sh maker_bopts.ctl maker_exe.ctl maker_opts.ctl /
maker_run.sh rsem.sh *.fastq transcript_filter.sh trinity_out_dir/ trinity.sh
$ rm -rf work
Bio-MUST-Core: https://metacpan.org/pod/Bio::MUST::Core
Prodigal: https://github.com/hyattpd/Prodigal
Trinity: https://github.com/trinityrnaseq/trinityrnaseq/wiki
Bowtie 2: https://github.com/BenLangmead/bowtie2
Jellyfish: https://github.com/gmarcais/Jellyfish
Salmon: https://github.com/COMBINE-lab/salmon
Blast: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
Tandem Repeats Finder: https://tandem.bu.edu/trf/trf409.linux64.download.html
RMBlast: http://www.repeatmasker.org/RMBlast.html
Repeatmasker: https://www.repeatmasker.org/RepeatMasker/
Exonerate: https://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate
MAKER: https://www.yandell-lab.org/software/maker.html
Augustus: https://github.com/Gaius-Augustus/Augustus
Braker2: https://github.com/Gaius-Augustus/BRAKER
If you use this tool for prokaryotes please cite the Genera GitHub (https://github.com/Lcornet/GENERA) and the Prodigal paper (https://github.com/hyattpd/Prodigal). If you use this tool for eukaryotes, please cite the AMAW paper (https://www.biorxiv.org/content/10.1101/2021.12.07.471566v1) and the MAKER2 paper (https://www.yandell-lab.org/software/maker.html).
Nothing reported.