Skip to content

10. Annotation

Lcornet edited this page Nov 10, 2022 · 14 revisions

Annotation

Annot

1. Goal

The annotation part of GENERA permits protein prediction for prokaryotes and eukaryotes.
Three different tools can be used for annotation: Prodigal for prokaryotes, AMAW and Braker for eukaryotes.
These tools are available immediately from the containers, without Nextflow, at the exception of Braker.

2. HELP

2.1 Prodigal

$ singularity exec /scratch/ulg/GENERA/prodigal-2.6.3.sif prodigal -h
prodigal [-a trans_file] [-c] [-d nuc_file] [-f output_type]
         [-g tr_table] [-h] [-i input_file] [-m] [-n] [-o output_file]
         [-p mode] [-q] [-s start_file] [-t training_file] [-v]

     -a:  Write protein translations to the selected file.
     -c:  Closed ends.  Do not allow genes to run off edges.
     -d:  Write nucleotide sequences of genes to the selected file.
     -f:  Select output format (gbk, gff, or sco).  Default is gbk.
     -g:  Specify a translation table to use (default 11).
     -h:  Print help menu and exit.
     -i:  Specify FASTA/Genbank input file (default reads from stdin).
     -m:  Treat runs of N as masked sequence; don't build genes across them.
     -n:  Bypass Shine-Dalgarno trainer and force a full motif scan.
     -o:  Specify output file (default writes to stdout).
     -p:  Select procedure (single or meta).  Default is single.
     -q:  Run quietly (suppress normal stderr output).
     -s:  Write all potential genes (with scores) to the selected file.
     -t:  Write a training file (if none exists); otherwise, read and use
          the specified training file.
     -v:  Print version number and exit.

2.2 AMAW

$ singularity exec /scratch/ulg/GENERA/amaw.sif amaw.pl --help

Usage:
       amaw.pl --genome <file> --organism <Genus_species> --taxdir <dir> [options]
       amaw.pl --help
       amaw.pl --man
       amaw.pl --usage
       amaw.pl --version

Required arguments:
--genome [=] <file>
    Path to your genomic data.

--organism [=] <Genus_species>
    Name of your organism with the format Genus_species or
    Genus_species_strain (for the augustus gene model name).

--taxdir [=] <dir>
    Path to local mirror of the NCBI Taxonomy database,IF YOU USE
    --proteins, THIS OPTION IS MANDATORY.

Options:
--sge
    Use SGE job templating

--version3
    Enables additional MAKER3 options.

--mpi
    Uses MPI parallelization.

--queue [=] <queue>
    bignode.q, smallnodes.q or gpunode.q. [default: smallnodes.q]

--email [=] <email>
    Type your email address to get informations with the progress of the
    different steps.

--jobname [=] <jobname>
    basename for your jobs, by default it will be your organism namewith
    a unique id.

--singularity
    This option handles the Singularity container way to use the folder
    paths when launching the different steps of AMAW.

--org-type [=] <org>
    Eukaryotic or prokaryotic [Default: eukaryotic].

--est [=] <n>
    Use of EST/RNA-seq evidence (boolean switch: 0/1) [default: 0].

--est-file [=] <file>
    Use of a selected EST/RNA-seq file instead of downloading and
    assembling SRA data from NCBI.

--max-storage [=] <int>
    Maximal experiment size (in Gb) to download [default: 10].

--max-experiments [=] <int>
    Maximal number of experiments to be downloaded (precedes
    --max-storage). By default, this option is not applied.

--sra-list [=] <string>
    Comma-separated string containing all RNA-Seq SRA accession numbers.
    Currently, only Illumina paired-end read datasets are allowed. This
    option disables the search of available organism-specific SRAs.

--transcript-db [=] <path>
    Alternative way to use automatically transcript assemblies: select
    the path to a folder where you put all your transcripts with this
    nomenclature:

    Genus_species*.fasta.

    If no transcript file of your database matched your organism, SRA
    search will be executed.

--proteins [=] <n>
    Use of protein evidence (boolean switch: 0/1), if activated,
    --taxdir option is needed. [default: 1]

--protein-file [=] <file>
    Use of a selected protein file instead of using the default protein
    database.

--outdir [=] <dirname>
    Name of the output directory containing the results. [default:
    amaw_output]

--maker-cpus [=] <n>
    Number of cpus to use for maker runs. [default: 25]

--trinity-cpus [=] <n>
    Number of cpus to use for Trinity. By default, it will calculate the
    number of cpus to use in function of the number of reads.

--trinity-memory [=] <n>
    Number of GB RAM of memory to use for Trinity. By default, this
    value is calculated in function of number of reads (1GB of memory by
    million of paired-end reads).

    By default, this calculated value is restricted to 50 GB, provide a
    value with this option to override this limit.

--rsem-cpus [=] <n>
    Number of cpus to use for rsem. By default, it will calculate the
    number of CPUs to use in function of the number of reads.

--augustus-prefix [=] <name>
    Prefix to recognize Augustus gene models create for a user or a
    project. By default, none.

--augustus-gm [=] <string>
    Name of an already existing augustus gene model (see the list on the
    augustus config/species/ folder).

    This option only launches one MAKER run directly using the augustus
    gene model.

--gm-db
    Enables the use of previous gene models avaible for snap and
    augustus (needs --snap-db and --augustus-db options).

    This option will only launch one MAKER run directly using the SNAP
    and augustus gene models (named as the --organism option in the
    folders given with --snap-db and augustus-db).

--snap-db [=] <dir>
    Path to snap gene model database (e.g. /home/snap-db/).

--augustus-db [=] <dir>
    Path to augustus gene model database (e.g. /media/#
    /apps/augustus-3.2.2/config/species/).

--augustus-new
    This option removes the previous augustus gene model. For this give,
    the programs need the directory with gene models (Needs
    --augustus-db option).

--masking [=] <n>
    Mask repetitive elements in the genome (strongly recommanded if the
    genome sequences are not masked) (boolean switch: 0/1) [default: 1].

--soft-mask [=] <n>
    Use soft-masking rather than hard-masking in BLAST (boolean switch:
    0/1) [default: 1].

--split-length [=] <length>
    Length for dividing up contigs into chunks (increases/decreases
    memory usage) [default: 100000].

--pred-flank [=] <n>
    Flank for extending evidence clusters sent to gene predictors

--min-intron [=] <n>
    Minimum intron length (used for alignment polishing) [default: 20].

--pred-stats [=] <n>
    Report AED and QI statistics for all predictions as well as models
    (boolean switch: 0/1) [default: 1].

--blast-type [=] <type>
    Set to 'ncbi+', 'ncbi' or 'wublast'.

--blastn-cov [=] <percent>
    Blastn Percent Coverage Threshold EST-Genome Alignments [default:
    0.8].

--blastn-id [=] <percent>
    Blastn Percent Identity Threshold EST-Genome Alignments [default:
    0.85].

--blastn-eval [=] <eval>
    Blastn e-value cutoff [default: 1e-10].

--blastn-depth [=] <n>
    Number of BLAST alignments, per query, to be used for annotation.
    Low values decreases the memory use and the runtime for large
    evidence datasets [default: 0].

--blastx-cov [=] <percent>
    Blastx Percent Coverage Threhold Protein-Genome Alignments [default:
    0.5].

--blastx-id [=] <percent>
    Blastx Percent Identity Threhold Protein-Genome Alignments [default:
    0.4].

--blastx-eval [=] <eval>
    Blastx e-value cutoff [default: 1e-06].

--blastx-depth [=] <n>
    Number of BLAST alignments, per query, to be used for annotation.
    Low values decreases the memory use and the runtime for large
    evidence datasets [default: 0].

--prot-dbs [=] <directory_path>
    Path to the protein databases for the different eukaryotic clades.

--version
--usage
--help
--man
    print the usual manual

2.3 Braker

The typical command for running the pipeline is as follows:

nextflow run Braker.nf --genome=genome.fna --prot=fungi --SRA=none --brakermode=prot --cpu=20 --currentpath=<PWD>

Mandatory arguments:
--genome                 Specify genome
--currentpath            Specify your current full path (as obtenied by pwd), for TMPDIR

Optional arguments:
--brakermode             Specify the mode of braker, with RNAseq + proteins (default = rnaseq) or with protein file only (= prot)
--SRA                    Specify rnaseq SRA list file, default = none 
--prot                   Specify which prot file to use: fungi, protozoa, plants or test, default = fungi 
--cpu                    number of cpus to use, default = 1

3. Usage

3.1 Prodigal

The user should provide a prokaryote genome, Prodigal will provide proteins and coding sequences.

singularity exec /scratch/ulg/GENERA/prodigal-2.6.3.sif prodigal -i GCF_000007145.1.fna -o GCF_000007145.1.out /
      -a GCF_000007145.1.faa -d GCF_000007145.1.genes.fna

3.2 AMAW

The user should provide a eukaryote genome, an organism name (with this format genus_species) which will be used for RNAseq search on the NCBI, and a gene model for Augustus. Please see point 3.2.1 and 3.2.2 for information on organism name and Augustus gene model.

$ singularity exec /scratch/ulg/GENERA/amaw.sif amaw.pl --genome=/mnt/<FIELD2> --organism=<FIELD3> /
  --proteins=1 --est=1 --taxdir=/temp/taxdump/ --maker-cpus=20 --trinity-cpus=20 --rsem-cpus=20 /
  --outdir=/mnt/GENERA-annotation --prot-dbs=/temp/prot_dbs/

3.2.1 organism name option

AMAW automatically downloads RNAseq data from the SRA portal of the NCBI. To ensure that RNAseq experiments are available, a simple query and the NCBI can be run:

https://www.ncbi.nlm.nih.gov/sra/?term=Uroleptopsis_citrina+AND+RNA-Seq

3.2.2 Augustus gene model option

A gene model can be provided, it is not mandatory, to help Augustus prediction. You can choose the best gene model, the closest to your genome, in the Augustus species list:

https://github.com/Gaius-Augustus/Augustus/tree/master/config/species.

3.2.2 limit on SRA download

Two option can limit the number of SRA used by AMAW.
The SRA can be listed with --sra-list (--sra-list=SRR14871474) or a maximum storage can be used --max-storage=10 (for 10gb).

3.3 Braker

Braker2 uses GeneMark-ET and Augustus for eukaryotes proteins annotation. Braker2, in opposition to AMAW, is not designed for non-model organism and doesn't search SRA automatically within the NCBI. The SRA number used by Braker should be provided by the user. Braker needs spliced alignments, produced here by Hisat2. The usage of too distant (in term of phylogeny) SRA will produce an error due to a low number of hints. The protein evidence used by Braker comes from OrthoDB databases (https://github.com/gatech-genemark/ProtHint#protein-database-preparation). Currently, only fungi are provided in the GENERA tools (if more files are needed, please add an issue on the git).

 $ cp /scratch/ulg/GENERA/Nextflow-scripts/Braker.nf .
 $ nextflow run Braker.nf --genome=genome.fna --prot=fungi --SRA=none --brakermode=prot --cpu=20 --currentpath=<PWD>

3.3.1 Input files

3.3.1.1 genome

The genome name, present in the current directory, with .fna extension.

genome.fna   
3.3.1.2 current path

PATH of the current directory, obtained with $PWD command.
A '/' should end this PATH.

/scratch/ulg/bioec/lcornet/annot/
3.3.1.3 barkermode

The user can specify the mode: with RNAseq + proteins (default = rnaseq) or with protein file only (= prot).

3.3.1.4 SRA

If the mode used is rnaseq, the user should provided a list of SRA numbers in a file. The SRA numbers can be found with a query on NCBI:
https://www.ncbi.nlm.nih.gov/sra/?term=Uroleptopsis_citrina+AND+RNA-Seq

SRA.list
SRR7662950
3.3.1.5 prot

The user should specify which OrthoDB database will be used by the workflow.

3.4 HPC usage

Create a new directory:

$ mkdir annot
$ cd annot

Copy the Annotation suite from the shared directory:

$ cp /scratch/ulg/GENERA/Nextflow-scripts/Annotation-* .

Get the path of your current directory, only the symlink:

$ pwd            # this command will produce a path like: /scratch/ulg/bioec/lcornet/annot 

Edit the in the job file with your needs, edits the prokaryote or eukaryote job file:

3.4.1 Prokaryote

$ nano -w Annotation-proka.job
$ The file should look like this:
     #!/bin/bash
     # Submission script for Nic5
     #SBATCH --time=5-01:00:00 # days-hh:mm:ss
     #
     #SBATCH --ntasks=1
     #SBATCH --cpus-per-task=20
     #SBATCH --mem-per-cpu=2625 # megabytes
     #SBATCH --partition=bio

     export OMP_NUM_THREADS=20
     export MKL_NUM_THREADS=20

     mkdir GENERA-annotation
     singularity exec --bind <FIELD1>:/mnt /scratch/ulg/GENERA/prodigal-2.6.3.sif prodigal -i /mnt/<FIELD2>.fna 
     -o /mnt/GENERA.out -a /mnt/<FIELD2>.faa -d /mnt/<FI$
     mv GENERA.out *.faa *.genes.fna GENERA-annotation/

     #Notes : FIELD1 corresponds to the symlink path        

Submit your job:

$ sbatch Annotation-proka.job

3.4.2 Eukaryote

3.4.2.1 AMAW
$ nano -w Annotation-euka.job
$ The file should look like this:
     #!/bin/bash
     # Submission script for Nic5
     #SBATCH --time=5-01:00:00 # days-hh:mm:ss
     #
     #SBATCH --ntasks=1
     #SBATCH --cpus-per-task=20
     #SBATCH --mem-per-cpu=2625 # megabytes
     #SBATCH --partition=bio

     export OMP_NUM_THREADS=20
     export MKL_NUM_THREADS=20

     rm -rf augustus-config/
     cp -r /scratch/ulg/GENERA/Databases/AMAW/augustus-config .
     mkdir NCBI
     cd NCBI
     mkdir .ncbi
     cp /scratch/ulg/GENERA/user-settings.mkfg .ncbi/
     cd ../
     singularity exec --bind /scratch/ulg/GENERA/Databases/AMAW:/temp,<FIELD1>/NCBI:${HOME},<FIELD1>:/mnt 
     --contain --workdir <FIELD1> /scratch/ulg/GENERA/amaw.sif amaw.pl --genome=/mnt/<FIELD2> 
     --organism=<FIELD3> --proteins=1 --est=1 --taxdir=/temp/taxdump/ --maker-cpus=20 --trinity-cpus=20 
     --rsem-cpus=20 --augustus-db=/mnt/augustus-config/ --outdir=/mnt/GENERA-annotation --prot-dbs=/temp/prot_dbs/         

Submit your job:

$ sbatch Annotation-euka.job
3.4.2.2 Braker

Copy the Braker suite from the shared directory:

$ cp /scratch/ulg/GENERA/Nextflow-scripts/Braker.* .

Get the path of your current directory, symlink and original path:

$ pwd            # this command will produce a path like: /scratch/ulg/bioec/lcornet/annot
$ readlink -f .  # this command will produce a path like: /scratch/users/l/c/lcornet/annot

Use the two paths to complete the Nextflow config file

$ mv Braker.config nextflow.config
$ nano -w nextflow.config (ctrl -X to quit and save)
The file should look like this:
process {
    withName:augustusCongig {
        container = '/scratch/ulg/GENERA/braker-2.sif'
    }
    withName:getprot {
        container = '/scratch/ulg/GENERA/braker-2.sif'
    }
    withName:abbr {
        container = '/scratch/ulg/GENERA/Genome-downloader.sif'
    }
    withName:hisat2 {
        container = '/scratch/ulg/GENERA/braker-2.sif'
    }
    withName:braker {
        container = '/scratch/ulg/GENERA/braker-2.sif'
    }
    withName:results {
        container = '/scratch/ulg/GENERA/braker-2.sif'
    }
}
singularity.enabled = true
singularity.cacheDir = "$PWD"
singularity.autoMounts = false
singularity.runOptions = '-B /scratch/ulg/bioec/lcornet/annot -B    
/scratch/users/l/c/lcornet/annot -B /scratch/ulg/GENERA/Databases/BRAKER/'

Edit the in the job file with your needs:

#!/bin/bash
# Submission script for Nic5
#SBATCH --time=20-01:00:00 # days-hh:mm:ss
#
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --mem-per-cpu=7500 # megabytes
#SBATCH --partition=bio

export OMP_NUM_THREADS=20
export MKL_NUM_THREADS=20

module --ignore-cache load Nextflow/21.08.0
nextflow run Braker.nf --genome=genome.fna --prot=fungi --brakermode=prot --SRA=none --cpu=20 \
--currentpath=/scratch/ulg/bioec/lcornet/annot/

Submit your job:

$ sbatch Braker.job

4. Output

Annotation produces an output directory called GENERA-annotation, different for prokaryote or eukaryote.

4.1 Prokaryote

GENERA-annotation:
GENERA.out                 #Log of prodigal
file.faa                   #protein sequences
file.genes.fna             #CDS sequences

4.2 Eukaryote

4.2.1 AMAW

GENERA-annotation:
efetch_result               #fetch results for RNAseq
fastq-dump.sh               #command for RNAseq download
maker_bopts.ctl             #maker run files
maker_exe.ctl               #maker run files
maker_opts.ctl              #maker run files
maker_run.sh                #maker run files
RSEM.genes.results          #RSEM run files
RSEM.isoforms.results       #RSEM run files
rsem.sh                     #RSEM run files
SRR7662950_1.fastq          #Fastq downloaded
SRR7662950_2.fastq          #Fastq downloaded
transcript_filter.sh        #Trinity run files
Trinity_filtered.fasta      #Trinity results
trinity_out_dir/            #Trinity run files
trinity.sh                  #Trinity run files
final_output/               #AMAW results
       run.all.maker.augustus.proteins.fasta                       #Final file with proteins                    
       run.all.maker.non_overlapping_ab_initio.transcripts.fasta
       run.all.maker.augustus.transcripts.fasta                   
       run_fasta.gff
       run.all.maker.non_overlapping_ab_initio.proteins.fasta     
       run.gff

4.2.2 Braker

A directory GENERA-braker is created by the workflow.

 GENERA-braker/
 GENERA-braker.log           #Log of the workflow
 BRAKER/                     #Results of Braker2
       augustus.hints.aa     #Final file with proteins     

5. Cleaning

It is important to clean your directory with the run of the script.
No cleaning is necessary for Prodigal but AMAW directory should be cleaned.

$ cd GENERA-annotation; rm -rf efetch_result fastq-dump.sh maker_bopts.ctl maker_exe.ctl maker_opts.ctl /
  maker_run.sh rsem.sh *.fastq transcript_filter.sh trinity_out_dir/ trinity.sh
$ rm -rf work

7. Programs used

Bio-MUST-Core: https://metacpan.org/pod/Bio::MUST::Core
Prodigal: https://github.com/hyattpd/Prodigal
Trinity: https://github.com/trinityrnaseq/trinityrnaseq/wiki
Bowtie 2: https://github.com/BenLangmead/bowtie2
Jellyfish: https://github.com/gmarcais/Jellyfish
Salmon: https://github.com/COMBINE-lab/salmon
Blast: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
Tandem Repeats Finder: https://tandem.bu.edu/trf/trf409.linux64.download.html
RMBlast: http://www.repeatmasker.org/RMBlast.html
Repeatmasker: https://www.repeatmasker.org/RepeatMasker/
Exonerate: https://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate
MAKER: https://www.yandell-lab.org/software/maker.html
Augustus: https://github.com/Gaius-Augustus/Augustus
Braker2: https://github.com/Gaius-Augustus/BRAKER

8. How to cite

If you use this tool for prokaryotes please cite the Genera GitHub (https://github.com/Lcornet/GENERA) and the Prodigal paper (https://github.com/hyattpd/Prodigal). If you use this tool for eukaryotes, please cite the AMAW paper (https://www.biorxiv.org/content/10.1101/2021.12.07.471566v1) and the MAKER2 paper (https://www.yandell-lab.org/software/maker.html).

9. FAQ

Nothing reported.