Creating bins from reads
https://edwards.sdsu.edu/research/cami-challenge-datasets/ Specifically targeting the "High Complexity" fastq files for all 5 samples.
Reads that come back from sequencing runs are not perfect, so they typically require some correction so that errors do not propagate to additional data products from the sequencing reads.
JGI provides a set of scripts that can handle many different tasks you would want to perform on read files called bbtools https://jgi.doe.gov/data-and-tools/bbtools/. Read QC typically is performed through two tools bbduk and trimmomatic. Both tools work and can process your reads fairly quickly, but bbduk can work fairly faster, so that is what we will use today.
Installation through conda (other options available through the above links). Version 38.51 is used here, but newer versions are now available as well.
conda create --name=bbmap-38.51
conda activate bbmap-38.51
conda install -c bioconda bbmap=38.51
Run bbduk on the read sets for each sample separately. Likely for many samples, you would want to write a for loop in bash or your favorite language to execute these commands.
bbduk.sh in=RH_S001__insert_270.fq out1=RH_S001__insert_270.clean.1.fq.gz out2=RH_S001__insert_270.clean.2.fq.gz minlength=51 qtrim=rl maq=10 maxns=0 trimq=10 ref=/bbmap_38.51/resources/adapters.fa ow=t
bbduk.sh in=RH_S002__insert_270.fq out1=RH_S002__insert_270.clean.1.fq.gz out2=RH_S002__insert_270.clean.2.fq.gz minlength=51 qtrim=rl maq=10 maxns=0 trimq=10 ref=/bbmap_38.51/resources/adapters.fa ow=t
bbduk.sh in=RH_S003__insert_270.fq out1=RH_S003__insert_270.clean.1.fq.gz out2=RH_S003__insert_270.clean.2.fq.gz minlength=51 qtrim=rl maq=10 maxns=0 trimq=10 ref=/bbmap_38.51/resources/adapters.fa ow=t
bbduk.sh in=RH_S004__insert_270.fq out1=RH_S004__insert_270.clean.1.fq.gz out2=RH_S004__insert_270.clean.2.fq.gz minlength=51 qtrim=rl maq=10 maxns=0 trimq=10 ref=/bbmap_38.51/resources/adapters.fa ow=t
bbduk.sh in=RH_S005__insert_270.fq out1=RH_S005__insert_270.clean.1.fq.gz out2=RH_S005__insert_270.clean.2.fq.gz minlength=51 qtrim=rl maq=10 maxns=0 trimq=10 ref=/bbmap_38.51/resources/adapters.fa ow=t
Notice that this command can take interleaved reads (forward and reverse are both contained in one file) and output into paired format (forward and reverse separated). Reads will typically be in paired format, but the same structure of the command still applies. This one command packages in both read quality control and trimming, and the associated stats for read removal and such are listed below.
Forcing interleaved input because paired output was specified for a single input file.
0.036 seconds.
Initial:
Memory: max=51693m, total=51693m, free=50614m, used=1079m
Added 2970 kmers; time: 0.053 seconds.
Memory: max=51693m, total=51693m, free=48996m, used=2697m
Input is being processed as paired
Started output streams: 0.358 seconds.
Processing time: 206.542 seconds.
Input: 99811870 reads 14971780500 bases.
Contaminants: 0 reads (0.00%) 0 bases (0.00%)
QTrimmed: 45800763 reads (45.89%) 306744010 bases (2.05%)
Low quality discards: 85516 reads (0.09%) 12412925 bases (0.08%)
Total Removed: 86038 reads (0.09%) 319156935 bases (2.13%)
Result: 99725832 reads (99.91%) 14652623565 bases (97.87%)
Time: 206.955 seconds.
Reads Processed: 99811k 482.29k reads/sec
Bases Processed: 14971m 72.34m bases/sec
As you can see, most reads are retained in the sample with bases removed from the trimming. Generally, removing obvious errors and very low quality sequences is good practice, but using overly strong QC can change your resulting assemblies and MAG generation.
Installation
`conda install -c bioconda fastqc`
Running fastqc on just one sample's forward cleaned reads.
fastqc RH_S001__insert_270.clean.1.fq.gz
Command line output:
Started analysis of RH_S001__insert_270.clean.1.fq.gz
Approx 5% complete for RH_S001__insert_270.clean.1.fq.gz
Approx 10% complete for RH_S001__insert_270.clean.1.fq.gz
Approx 15% complete for RH_S001__insert_270.clean.1.fq.gz
Approx 20% complete for RH_S001__insert_270.clean.1.fq.gz
Approx 25% complete for RH_S001__insert_270.clean.1.fq.gz
Approx 30% complete for RH_S001__insert_270.clean.1.fq.gz
Approx 35% complete for RH_S001__insert_270.clean.1.fq.gz
Approx 40% complete for RH_S001__insert_270.clean.1.fq.gz
Approx 45% complete for RH_S001__insert_270.clean.1.fq.gz
Approx 50% complete for RH_S001__insert_270.clean.1.fq.gz
Approx 55% complete for RH_S001__insert_270.clean.1.fq.gz
Approx 60% complete for RH_S001__insert_270.clean.1.fq.gz
Approx 65% complete for RH_S001__insert_270.clean.1.fq.gz
Approx 70% complete for RH_S001__insert_270.clean.1.fq.gz
Approx 75% complete for RH_S001__insert_270.clean.1.fq.gz
Approx 80% complete for RH_S001__insert_270.clean.1.fq.gz
Approx 85% complete for RH_S001__insert_270.clean.1.fq.gz
Approx 90% complete for RH_S001__insert_270.clean.1.fq.gz
Approx 95% complete for RH_S001__insert_270.clean.1.fq.gz
Analysis complete for RH_S001__insert_270.clean.1.fq.gz
Output: HTML file with quality information. Generally good for a sanity check on your data in case there are any obvious issues.
Note: Also check out MultiQC for generating summary reports of many samples at once.
As of 2021, two of the more popular tools for assembly are MetaSPAdes and MEGAHIT. Either tool should suit your needs. Generally speaking, MetaSPAdes has shown great performance for assembly, but it is also VERY compute intensive, meaning one will likely need a large amount of memory, cores/threads, and CPU hours. MEGAHIT, on the other hand, performs well but is less compute intensive.
For this workflow, I have chosen megahit due to the compute needs associated with MetaSPAdes.
Installation
conda create --name=megahit-1.2.9
conda activate megahit-1.2.9
conda install -c bioconda megahit
Assembly (one example sample)
megahit -1 RH_S001__insert_270.clean.1.fq.gz -2 RH_S001__insert_270.clean.2.fq.gz -o RH_S001__insert_270_megahit_out -t 20
Output
checkpoints.txt
done
final.contigs.fa
intermediate_contigs
log
options.json
Where final.contigs.fa is ultimately what we care most about, which is the final assembly fasta file output from megahit.
Binning more generally relies on two things:
- Abundances of contigs across samples
- Tetranucleotide frequencies/genomic features
Requirements for binning based on the above: Assembly (Contigs/Scaffolds) and BAM file(s)
For each assembly that you have, you want to map as many related samples as possible to the assembly to maximize the amount of data to better resolve the clustering. For this, we need to do read-mapping to determine the coverages of contigs across samples. Read-mapping generates bam files which are compressed sam files that contain information about the reads mapping to the various contigs. CoverM is a wrapper to do the read-mapping and some filtering for us. It leverages Minimap2-sr for the actual read-mapping.
Installation
conda create --name=coverm-0.6.1
conda activate coverm-0.6.1
conda install -c bioconda coverm
Read map every sample against the 1st sample assembly
coverm contig -r RH_S001__insert_270_megahit_out/final.contigs.fa -t 20 \
--bam-file-cache-directory bam_files \
-c RH_S001__insert_270.clean.1.fq.gz RH_S001__insert_270.clean.2.fq.gz \
RH_S002__insert_270.clean.1.fq.gz RH_S002__insert_270.clean.2.fq.gz \
RH_S003__insert_270.clean.1.fq.gz RH_S003__insert_270.clean.2.fq.gz \
RH_S004__insert_270.clean.1.fq.gz RH_S004__insert_270.clean.2.fq.gz \
RH_S005__insert_270.clean.1.fq.gz RH_S005__insert_270.clean.2.fq.gz
This will generate five bam files for each set of paired reads that were mapped:
RH_S001__insert_270_megahit_out.final.contigs.fna.RH_S001__insert_270.clean.1.fq.gz.bam
RH_S001__insert_270_megahit_out.final.contigs.fna.RH_S002__insert_270.clean.1.fq.gz.bam
RH_S001__insert_270_megahit_out.final.contigs.fna.RH_S003__insert_270.clean.1.fq.gz.bam
RH_S001__insert_270_megahit_out.final.contigs.fna.RH_S004__insert_270.clean.1.fq.gz.bam
RH_S001__insert_270_megahit_out.final.contigs.fna.RH_S005__insert_270.clean.1.fq.gz.bam
After generation of the bam files, create sorted and indexed bam files using samtools for input into other programs that require this formatting Sort the bam files
cd bam_files
for bam in *.bam ; do
samtools sort -@ 20 -o ${bam::-4}.sorted.bam $bam ;
done
Index the bam files
for bam in *.sorted.bam ; do
samtools index $bam;
done
cd ..
Metabat2 https://bitbucket.org/berkeleylab/metabat/src/master/ Installation
conda create --name=metabat2
conda activate metabat2
conda install -c bioconda/label/cf201901 metabat2
Generate metabat coverage
coverm contig -b bam_files/*.sorted.bam -m metabat -t 20 > metabat_coverage.tsv
Run metabat2
metabat2 -i ../RH_S001__insert_270_megahit_out/final.contigs.fa -a metabat_coverage.tsv -o metabat2 -t 20
Maxbin2 https://sourceforge.net/projects/maxbin2/ Installation
conda create --name=maxbin-2.2.7
conda activate maxbin-2.2.7
conda install -c bioconda maxbin2
Leverage coverm to calculate the trimmed mean coverage estimates (can use other coverage estimates other than trimmed_mean)
cd bam_files
for bam in *.sorted.bam ; do
coverm contig -t 20 -m trimmed_mean -b $bam > ../${bam::-4}.trimmed_mean.tsv ;
done
Create a file called abund_list.txt pointing to each tsv generated above that would look like this if the tsv files are located in the same directory:
RH_S001__insert_270_megahit_out.final.contigs.fna.RH_S001__insert_270.clean.1.fq.gz.trimmed_mean.tsv
RH_S001__insert_270_megahit_out.final.contigs.fna.RH_S004__insert_270.clean.1.fq.gz.trimmed_mean.tsv
RH_S001__insert_270_megahit_out.final.contigs.fna.RH_S002__insert_270.clean.1.fq.gz.trimmed_mean.tsv
RH_S001__insert_270_megahit_out.final.contigs.fna.RH_S005__insert_270.clean.1.fq.gz.trimmed_mean.tsv
RH_S001__insert_270_megahit_out.final.contigs.fna.RH_S003__insert_270.clean.1.fq.gz.trimmed_mean.tsv
Run Maxbin2 algorithm with a minimum contig length of 2500bp
mkdir maxbin2
run_MaxBin.pl -contig ../RH_S001__insert_270_megahit_out/final.contigs.fa -abund_list abund_list.txt -out maxbin2/maxbin2 -min_contig_length 2500 -thread 20
CONCOCT https://concoct.readthedocs.io/en/latest/ CONCOCT requires that you first cut up your contigs into chunks before coverage estimation
cut_up_fasta.py ../RH_S001__insert_270_megahit_out/final.contigs.fa -c 10000 -o 0 --merge_last -b contigs_10K.bed > contigs_10K.fa
Generate CONCOCT coverage table based on the sorted bam files generated earlier
concoct_coverage_table.py contigs_10K.bed bam_files/*.sorted.bam > concoct_coverage_table.tsv
Run CONCOCT algorithm and merge the cut contigs from step 1
concoct -l 2500 --threads 20 --composition_file contigs_10K.fa --coverage_file concoct_coverage_table.tsv -b concoct/
merge_cutup_clustering.py concoct/clustering_gt2500.csv > concoct/clustering_merged.csv
Ensemble binning takes the output bins from the standalone "initial binners" and uses the information across bin sets to build a more high quality superset of bins. DAS Tool and MetaWRAP are the two most common at the moment. For this tutorial we will be using the MetaWRAP bin_refinement module.
MetaWRAP https://github.com/bxlab/metaWRAP
For the ensemble binning, we are using the metawrap bin_refinement module. This module can accept the input from three different binners, so we are using the bins generated from the "Initial Binners" above -- metabat2, maxbin2, concoct -- as input into the ensemble binning module from metawrap.
Installation
conda create --name=metawrap-1.3.2
conda activate metawrap-1.3.2
conda install -y -c ursky metawrap-mg=1.3.2
Run metawrap bin_refinement using the output from the three "Initial binners"
metawrap bin_refinement -A metabat2/ -B maxbin2/fasta_bins/ \
-C concoct/fasta_bins/ -t 20 -o metawrap_bin_refinement/
Result: By default, metawrap filters for bins that are >70% complete and <10% contaminated based on CheckM. CheckM uses lineage specific single-copy genes to estimate completeness and contamination of a particular genome. Roughly speaking, if a genome bin has more than one single-copy gene, that adds to the contamination because it was only expected once. Then, the completeness is determined by how many single-copy genes the genome bin has versus how many there "should" be for that lineage.
Automated Bin QC -- RefineM
RefineM will remove any potential contaminants and outliers based on tetranucleotide frequencies and abundances. It also has several other functions such as removing contigs that do not match taxonomically, but that will not be utilized here.
Install RefineM
conda create --name=refinem-0.1.2
conda activate refinem-0.1.2
conda install refinem
Generate the scaffold/contig stats for the contigs in the bins generated (e.g. generates coverage profiles and tetranucleotide frequencies)
refinem scaffold_stats -x fa -c 20 \
../RH_S001__insert_270_megahit_out/final.contigs.fa \
metawrap_bin_refinement/metawrap_70_10_bins/ \
scaffold_stats bam_files/*.sorted.bam
Determine contigs that are outliers in the dataset
refinem outliers scaffold_stats/scaffold_stats.tsv scaffold_stats_outliers
Filter out outliers from bins
refinem filter_bins -x fa \
metawrap_bin_refinement/metawrap_70_10_bins/ \
scaffold_stats_outliers/outliers.tsv \
metawrap_bin_refinement/metawrap_70_10_bins_refinem_filtered/