Skip to content

20. CRACOT

Lcornet edited this page Nov 14, 2022 · 4 revisions

CRACOT

Ortho

1. Goal

This tool allows to simulate genomic contamination.
This tool produce chimeric genomes, with three types of contamination: redundant, replaced, and single.

2. HELP

The typical command for running the pipeline is as follows:

nextflow run CRACOT.nf --genomes=genomes --lineage=Genomes.taxomonomy --list=positive-list.txt 
--cpu=60 --num=100 --taxorank=phylum --redundant=2 --replacement=2 --single=4 
--hgtrate=none --hgtrandom=no --redundanthgt=0 --replacementhgt=0 --singlehgt=0 

Mandatory arguments:
--genomes                Specify directory with genomes
--lineage                Specify lineage file path
--list                   Specify positive list file path

Optional arguments:
--taxorank               Specify taxonomic rank, phylum or order or family or genus or species, default=phylum
--num                    Specify number of chimeric genomes, default=100
--maskslave              Used only slave genome one time, default = no
--redundant              Specify number of redundant events, default=5
--replacement            Specify number of replacement events, default=5
--single                 Specify number of single events, default=5
--hgtrate                Specify the mutation rate (1 to 99%) of HGT events, default = none
--redundanthgt           Specify number of hgt redundant events, default=0
--replacementhgt         Specify number of hgt replacement events, default=0
--singlehgt              Specify number of hgt single events, default=0   
--merge                  Merge redundant and single to the last contig of the chimeric genome, yes or no, default = yes
--hgtrandom              Activate the random insertion of HGT events, yes or no, default = no, Recommended with SINGLE hgt events only.
--cpu                    number of cpus to use, default = 1

3. Usage

The user should specify three mandatory fields: a directory with genomes, a lineage file for these genomes (as produced by Genome-downloader.nf), and a positive list of genomes to use.
In option, the user can specify the taxonomic rank (from phylum to species), the number of chimeric genomes to produce, the percentage of event, the mutation rate (from 0 to 25 %) for HGT.

$ cp /scratch/ulg/GENERA/Nextflow-scripts/CRACOT.nf .
$ module --ignore-cache load Nextflow/21.08.0
$ nextflow run CRACOT.nf --genomes=genomes --lineage=Genomes.taxomonomy --list=positive-list.txt

3.1 Input files

3.1.1 genomes

The genomes directory should contain genomes in fasta with the extension .fna.

3.1.2 lineage

The lineage file should contain the taxonomy of all the genomes present in the positive list.

GCF_000003215.1 Clostridioides difficile_GCF_000003215.1        GCF_000003215.1 Firmicutes; Clostridia; Eubacteriales; Peptostreptococcaceae; Clostridioides; Clostridioides difficile
GCF_000007085.1 Caldanaerobacter subterraneus_GCF_000007085.1   GCF_000007085.1 Firmicutes; Clostridia; Thermoanaerobacterales; Thermoanaerobacteraceae; Caldanaerobacter; Caldanaerobacter subterraneus
GCF_000007625.1 Clostridium tetani_GCF_000007625.1      GCF_000007625.1 Firmicutes; Clostridia; Eubacteriales; Clostridiaceae; Clostridium; Clostridium tetani

This file automatically produced by Genome-downloader.nf.

3.1.3 list

This file is a positive list of genomes to use.
The genomes folder can thus have more genomes than the positive list.

3.1.4 taxorank

Specify the taxonimic rank to use.
For instance, the phylum rank (e.g., Firmicutes) means that the two genomes belong to the same phylum but not to the same class.

3.1.5 num

Specify the number of chimeric genome to create.
The creation of a chimeric genome is based on the orthology and on the number of common and single genes.
If this number is not sufficient to create the genome, CRACOT will not output the genome but a message like this:

GCF_013296445.1  Clostridia- Eubacteriales      GCF_900166995.1  Clostridia- Thermoanaerobacterales     The number of available proteins in Sub g
enome are not sufficient: 215 proteins while 216 are needed for redundancy and/or replacement - 618 proteins while 0 are needed for single.

3.1.6 maskslave

By default, a slave genome (the contaminant) is used one time.
The user can set this option to yes to reuse the slave genome multiple times, and reduce the computation time.

3.1.7 redundant

Percentage of redundant events.

3.1.8 replacement

Percentage of replaced events.

3.1.9 single

Percentage of single events.

3.1.10 hgtrate

Percentage of mutation for HGT simulation, from 0 to 25.

3.1.11 redundanthgt

Percentage of HGT redundant events.

3.1.12 replacementhgt

Percentage of HGT replaced events.

3.1.13 singlehgt

Percentage of HGT single events.

3.1.14 merge

When this option is set to yes, the redundant and single events are merge to the last contig.
If set to no, a new contig is created for these two events.

3.1.15 hgtrandom

If activated, the HGT are inserted randomly in the main genome.

3.1.16 cpu

Specify the number of CPU to use.

3.2 HPC usage

Create a new directory:

$ mkdir CRACOT
$ cd CRACOT

Copy the Orthology suite from the shared directory:

$ cp /scratch/ulg/GENERA/Nextflow-scripts/CRACOT.* .

Get the path of your current directory, symlink and original path:

$ pwd            # this command will produce a path like: /scratch/ulg/bioec/lcornet/ortho
$ readlink -f .  # this command will produce a path like: /scratch/users/l/c/lcornet/ortho

Use the two paths to complete the nextflow config file

$ mv CRACOT.config nextflow.config
$ nano -w nextflow.config (ctrl -X to quit and save)
The file should look like this:
process {
    withName:lineage {
        container = '/scratch/ulg/GENERA/contams.sif'
    }
    withName:makecorr {
        container = '/scratch/ulg/GENERA/contams.sif'
    }
    withName:plasmiddel {
        container = '/scratch/ulg/GENERA/contams.sif'
    }
    withName:prodigal {
        container = '/scratch/ulg/GENERA/contams.sif'
    }
    withName:orthofinder {
        container = '/scratch/ulg/GENERA/contams.sif'
    }
    withName:makechim {
        container = '/scratch/ulg/GENERA/contams.sif'
    }
    withName:hgtsim {
        container = '/scratch/ulg/GENERA/HGTsim.sif'
    }
    withName:hgtintroduce {
        container = '/scratch/ulg/GENERA/contams.sif'
    }
    withName:publicationResults {
        container = '/scratch/ulg/GENERA/contams.sif'
    }
}
singularity.enabled = true
singularity.cacheDir = "$PWD"
singularity.autoMounts = false
singularity.runOptions = '-B /scratch/ulg/bioec/lcornet/CRACOT -B /scratch/users/l/c/lcornet/CRACOT'
process.scratch = '/data/GENERA/'

Edit the in the job file with your needs:

$ nano -w CRACOT.job
$ The file should look like this:
    #!/bin/bash
    # Submission script for Nic5
    #SBATCH --time=5-01:00:00 # days-hh:mm:ss
    #
    #SBATCH --ntasks=1
    #SBATCH --cpus-per-task=20
    #SBATCH --mem-per-cpu=12000 # megabytes
    #SBATCH --partition=bio

    export OMP_NUM_THREADS=20
    export MKL_NUM_THREADS=20

    module --ignore-cache load Nextflow/21.08.0
    nextflow run CRACOT.nf --genomes=genomes --lineage=Genomes.taxomonomy --list=positive-list.txt --cpu=20 --num=100 --taxolevel=<FIELD> --duplication=4 --replacement=0 --single=0 --hgtrate=none --hgtrandom=no --duplicationhgt=0 --replacementhgt=0 --singlehgt=0

Submit you job:

$ sbatch CRACOT.job

4. Nextflow trace

CRACOT will produce a trace of the Nextflow processes, updated during the run of the analyse.

executor >  local (9)
[64/e14e3b] process > lineage (1)            [100%] 1 of 1 ✔
[56/a5fbf1] process > makecorr (1)           [100%] 1 of 1 ✔
[a5/8fdd00] process > plasmiddel (1)         [100%] 1 of 1 ✔
[0a/816a1e] process > prodigal (1)           [100%] 1 of 1 ✔
[b5/4e74f8] process > orthofinder (1)        [100%] 1 of 1 ✔
[cf/7de2e2] process > makechim (1)           [100%] 1 of 1 ✔
[df/91a287] process > hgtsim (1)             [100%] 1 of 1 ✔
[98/5b0934] process > hgtintroduce (1)       [100%] 1 of 1 ✔
[d9/3293b3] process > publicationResults (1) [100%] 1 of 1 ✔
Completed at: 26-août-2022 15:42:17
Duration    : 4h 10m 34s
CPU hours   : 4.2
Succeeded   : 9

5. Output

CRACOT produce an output directory called GENERA-chimeric.

executor >  local (11)
[9f/9a88ac] process > Taxonomy (1)           [100%] 1 of 1 ✔
[fe/ab5467] process > format (1)             [100%] 1 of 1 ✔
[9c/0b3359] process > prodigal (1)           [100%] 1 of 1 ✔
[76/a50e0a] process > orthofinder (1)        [100%] 1 of 1 ✔
[62/8b3da7] process > anvio (1)              [100%] 1 of 1 ✔
[59/e04dbf] process > formatOG (1)           [100%] 1 of 1 ✔
[db/784837] process > core (1)               [100%] 1 of 1 ✔
[cd/80bcb4] process > specific (1)           [100%] 1 of 1 ✔
[a1/2468e9] process > enrichment (1)         [100%] 1 of 1 ✔
[9d/d19ddb] process > enrichmentcheck (1)    [100%] 1 of 1 ✔
[c9/89c32d] process > publicationResults (1) [100%] 1 of 1 ✔

Completed at: 30-mars-2022 16:10:33
Duration    : 3h 18m 58s
CPU hours   : 3.3
Succeeded   : 11

5.1. GENERA-chimeric

A directory GENERA-chimeric will be created in the current directory.

    CHIMERIC-genomes/           #Folder with chimeric genomes
    chimeric-genomes.list       #Log of CRACOT
    chimeric.idl                #idl file with genomes used
    CHIM-hgt-info/              #Log of HGTsim
    CHIM-sequences/             #Contaminating sequences
    GENOMES-USED_out-plasmid/   #Genomes used after plasmid del

6. Cleaning

It is important to clean you directory avec the run of the script.
Indeed, nextflow keep a directory (work) who will not be used later, but store a large amount of data.

$ rm -rf work

7. Programs used

Bio-MUST-Core Version 0.212670: https://metacpan.org/pod/Bio::MUST::Core
PlasmidPicker: https://github.com/haradama/PlasmidPicker
OrthoFinder Version 2.5.4: https://github.com/davidemms/OrthoFinder
Prodigal Version 2.6.3: https://github.com/hyattpd/Prodigal
FortyTwo Version 0.212670: https://metacpan.org/dist/Bio-MUST-Apps-FortyTwo
HgtSIM: https://github.com/songweizhi/HgtSIM