-
Notifications
You must be signed in to change notification settings - Fork 1
20. CRACOT
This tool allows to simulate genomic contamination.
This tool produce chimeric genomes, with three types of contamination: redundant, replaced, and single.
The typical command for running the pipeline is as follows:
nextflow run CRACOT.nf --genomes=genomes --lineage=Genomes.taxomonomy --list=positive-list.txt
--cpu=60 --num=100 --taxorank=phylum --redundant=2 --replacement=2 --single=4
--hgtrate=none --hgtrandom=no --redundanthgt=0 --replacementhgt=0 --singlehgt=0
Mandatory arguments:
--genomes Specify directory with genomes
--lineage Specify lineage file path
--list Specify positive list file path
Optional arguments:
--taxorank Specify taxonomic rank, phylum or order or family or genus or species, default=phylum
--num Specify number of chimeric genomes, default=100
--maskslave Used only slave genome one time, default = no
--redundant Specify number of redundant events, default=5
--replacement Specify number of replacement events, default=5
--single Specify number of single events, default=5
--hgtrate Specify the mutation rate (1 to 99%) of HGT events, default = none
--redundanthgt Specify number of hgt redundant events, default=0
--replacementhgt Specify number of hgt replacement events, default=0
--singlehgt Specify number of hgt single events, default=0
--merge Merge redundant and single to the last contig of the chimeric genome, yes or no, default = yes
--hgtrandom Activate the random insertion of HGT events, yes or no, default = no, Recommended with SINGLE hgt events only.
--cpu number of cpus to use, default = 1
The user should specify three mandatory fields: a directory with genomes, a lineage file for these genomes (as produced by Genome-downloader.nf), and a positive list of genomes to use.
In option, the user can specify the taxonomic rank (from phylum to species), the number of chimeric genomes to produce, the percentage of event, the mutation rate (from 0 to 25 %) for HGT.
$ cp /scratch/ulg/GENERA/Nextflow-scripts/CRACOT.nf .
$ module --ignore-cache load Nextflow/21.08.0
$ nextflow run CRACOT.nf --genomes=genomes --lineage=Genomes.taxomonomy --list=positive-list.txt
The genomes directory should contain genomes in fasta with the extension .fna.
The lineage file should contain the taxonomy of all the genomes present in the positive list.
GCF_000003215.1 Clostridioides difficile_GCF_000003215.1 GCF_000003215.1 Firmicutes; Clostridia; Eubacteriales; Peptostreptococcaceae; Clostridioides; Clostridioides difficile
GCF_000007085.1 Caldanaerobacter subterraneus_GCF_000007085.1 GCF_000007085.1 Firmicutes; Clostridia; Thermoanaerobacterales; Thermoanaerobacteraceae; Caldanaerobacter; Caldanaerobacter subterraneus
GCF_000007625.1 Clostridium tetani_GCF_000007625.1 GCF_000007625.1 Firmicutes; Clostridia; Eubacteriales; Clostridiaceae; Clostridium; Clostridium tetani
This file automatically produced by Genome-downloader.nf.
This file is a positive list of genomes to use.
The genomes folder can thus have more genomes than the positive list.
Specify the taxonimic rank to use.
For instance, the phylum rank (e.g., Firmicutes) means that the two genomes belong to the same phylum but not to the same class.
Specify the number of chimeric genome to create.
The creation of a chimeric genome is based on the orthology and on the number of common and single genes.
If this number is not sufficient to create the genome, CRACOT will not output the genome but a message like this:
GCF_013296445.1 Clostridia- Eubacteriales GCF_900166995.1 Clostridia- Thermoanaerobacterales The number of available proteins in Sub g
enome are not sufficient: 215 proteins while 216 are needed for redundancy and/or replacement - 618 proteins while 0 are needed for single.
By default, a slave genome (the contaminant) is used one time.
The user can set this option to yes to reuse the slave genome multiple times, and reduce the computation time.
Percentage of redundant events.
Percentage of replaced events.
Percentage of single events.
Percentage of mutation for HGT simulation, from 0 to 25.
Percentage of HGT redundant events.
Percentage of HGT replaced events.
Percentage of HGT single events.
When this option is set to yes, the redundant and single events are merge to the last contig.
If set to no, a new contig is created for these two events.
If activated, the HGT are inserted randomly in the main genome.
Specify the number of CPU to use.
Create a new directory:
$ mkdir CRACOT
$ cd CRACOT
Copy the Orthology suite from the shared directory:
$ cp /scratch/ulg/GENERA/Nextflow-scripts/CRACOT.* .
Get the path of your current directory, symlink and original path:
$ pwd # this command will produce a path like: /scratch/ulg/bioec/lcornet/ortho
$ readlink -f . # this command will produce a path like: /scratch/users/l/c/lcornet/ortho
Use the two paths to complete the nextflow config file
$ mv CRACOT.config nextflow.config
$ nano -w nextflow.config (ctrl -X to quit and save)
The file should look like this:
process {
withName:lineage {
container = '/scratch/ulg/GENERA/contams.sif'
}
withName:makecorr {
container = '/scratch/ulg/GENERA/contams.sif'
}
withName:plasmiddel {
container = '/scratch/ulg/GENERA/contams.sif'
}
withName:prodigal {
container = '/scratch/ulg/GENERA/contams.sif'
}
withName:orthofinder {
container = '/scratch/ulg/GENERA/contams.sif'
}
withName:makechim {
container = '/scratch/ulg/GENERA/contams.sif'
}
withName:hgtsim {
container = '/scratch/ulg/GENERA/HGTsim.sif'
}
withName:hgtintroduce {
container = '/scratch/ulg/GENERA/contams.sif'
}
withName:publicationResults {
container = '/scratch/ulg/GENERA/contams.sif'
}
}
singularity.enabled = true
singularity.cacheDir = "$PWD"
singularity.autoMounts = false
singularity.runOptions = '-B /scratch/ulg/bioec/lcornet/CRACOT -B /scratch/users/l/c/lcornet/CRACOT'
process.scratch = '/data/GENERA/'
Edit the in the job file with your needs:
$ nano -w CRACOT.job
$ The file should look like this:
#!/bin/bash
# Submission script for Nic5
#SBATCH --time=5-01:00:00 # days-hh:mm:ss
#
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --mem-per-cpu=12000 # megabytes
#SBATCH --partition=bio
export OMP_NUM_THREADS=20
export MKL_NUM_THREADS=20
module --ignore-cache load Nextflow/21.08.0
nextflow run CRACOT.nf --genomes=genomes --lineage=Genomes.taxomonomy --list=positive-list.txt --cpu=20 --num=100 --taxolevel=<FIELD> --duplication=4 --replacement=0 --single=0 --hgtrate=none --hgtrandom=no --duplicationhgt=0 --replacementhgt=0 --singlehgt=0
Submit you job:
$ sbatch CRACOT.job
CRACOT will produce a trace of the Nextflow processes, updated during the run of the analyse.
executor > local (9)
[64/e14e3b] process > lineage (1) [100%] 1 of 1 ✔
[56/a5fbf1] process > makecorr (1) [100%] 1 of 1 ✔
[a5/8fdd00] process > plasmiddel (1) [100%] 1 of 1 ✔
[0a/816a1e] process > prodigal (1) [100%] 1 of 1 ✔
[b5/4e74f8] process > orthofinder (1) [100%] 1 of 1 ✔
[cf/7de2e2] process > makechim (1) [100%] 1 of 1 ✔
[df/91a287] process > hgtsim (1) [100%] 1 of 1 ✔
[98/5b0934] process > hgtintroduce (1) [100%] 1 of 1 ✔
[d9/3293b3] process > publicationResults (1) [100%] 1 of 1 ✔
Completed at: 26-août-2022 15:42:17
Duration : 4h 10m 34s
CPU hours : 4.2
Succeeded : 9
CRACOT produce an output directory called GENERA-chimeric.
executor > local (11)
[9f/9a88ac] process > Taxonomy (1) [100%] 1 of 1 ✔
[fe/ab5467] process > format (1) [100%] 1 of 1 ✔
[9c/0b3359] process > prodigal (1) [100%] 1 of 1 ✔
[76/a50e0a] process > orthofinder (1) [100%] 1 of 1 ✔
[62/8b3da7] process > anvio (1) [100%] 1 of 1 ✔
[59/e04dbf] process > formatOG (1) [100%] 1 of 1 ✔
[db/784837] process > core (1) [100%] 1 of 1 ✔
[cd/80bcb4] process > specific (1) [100%] 1 of 1 ✔
[a1/2468e9] process > enrichment (1) [100%] 1 of 1 ✔
[9d/d19ddb] process > enrichmentcheck (1) [100%] 1 of 1 ✔
[c9/89c32d] process > publicationResults (1) [100%] 1 of 1 ✔
Completed at: 30-mars-2022 16:10:33
Duration : 3h 18m 58s
CPU hours : 3.3
Succeeded : 11
A directory GENERA-chimeric will be created in the current directory.
CHIMERIC-genomes/ #Folder with chimeric genomes
chimeric-genomes.list #Log of CRACOT
chimeric.idl #idl file with genomes used
CHIM-hgt-info/ #Log of HGTsim
CHIM-sequences/ #Contaminating sequences
GENOMES-USED_out-plasmid/ #Genomes used after plasmid del
It is important to clean you directory avec the run of the script.
Indeed, nextflow keep a directory (work) who will not be used later, but store a large amount of data.
$ rm -rf work
Bio-MUST-Core Version 0.212670: https://metacpan.org/pod/Bio::MUST::Core
PlasmidPicker: https://github.com/haradama/PlasmidPicker
OrthoFinder Version 2.5.4: https://github.com/davidemms/OrthoFinder
Prodigal Version 2.6.3: https://github.com/hyattpd/Prodigal
FortyTwo Version 0.212670: https://metacpan.org/dist/Bio-MUST-Apps-FortyTwo
HgtSIM: https://github.com/songweizhi/HgtSIM