How to generated reference genomes in json format #43

yazhinia · 2024-08-30T16:01:51Z

Hello developers,
Thank you for developing a nice benchmarking tool. How to generate reference genomes in json format for BinBencher assessment? I couldn't understand from the documentation.
I wanted to use it to assess bins generated from multi-split binning. Contigs were generated from assembly of each sample and binning was performed on the concatenated set as you suggested in the VAMB paper. At the moment, I consider sample-wise assessment i.e., using only genomes that are present in a sample and bins obtained from that sample (splitted by sample after binning) for the assessment.
Is it applicable only for dataset where contigs were obtained through gold standard approach or is it also applicable for contigs obtained from any (meta-)genome assembler? If the later is yes, how do you get a mapped position for contigs in the bins and get genome-bin pairs? Would aligner like bowtie2, bwa mem, minimap2 be a recommended approach (though time-consuming)?

Thank you for your inputs.

jakobnissen · 2024-09-02T07:12:35Z

Dear @yazhinia

Multi- vs single-sample benchmarking

As you might have read in the BinBencher.jl paper, BinBencher (BB) allows you to benchmark with multiple samples, without correctness issues. However, if you still want to benchmark once per sample, that'll work well, too.

Only for gold standard approach?

BB works only when you have the actual ground truth. For assembled data, you will need to somehow learn the ground truth for the contigs. Of course, since any such learned truth will be incomplete, the benchmarking will be slightly inaccurate.
If you have the collection of genomes from which the reads have been sampled - say you simulated some reads from a known set of genomes, then assembled the reads - then I'd map the contigs to a collection of all source genomes that have been simulated with nonzero abundance using minimap2 (remember the -c flag for base-level alignment!) or similar, in order to get the contig/genome mapping positions. That's what we do. It's not even that slow.

If you have actual non-simulated contigs, then I'm skeptical BB will be any good. You could try phylogenetically placing your contigs with something like GTDB-tk, but I doubt the results would be particularly good, and certainly not good enough to function as the ground truth for benchmarking.

How to actually create the `Reference`?

This is unfortunately pretty tricky to do, and something I've tried to make easier. You have two options:

You can create the JSON manually with all the information needed for the Reference. Read these comments in the source code to build a ReferenceJSON object, then construct a Reference object using Reference(my_json), then save it with open(io -> BinBencherBacked.save(io, ref), "reference.json", "w"). However, this is pretty labor-intensive and requires a lot of data wrangling in Julia
I'm currently developing a command-line interface to BB, at this repo: https://github.com/jakobnissen/BinBencher.jl . I currently use commit [] for testing. This tool includes a subcommand to create a reference. I'm currently in the process of writing some documentation for it - it will be up in a few days, and then I'll ping you again. This might be your best bet.

yazhinia · 2024-09-02T08:19:35Z

Dear author,
Thank you for the information. I am working with CAMI2 datasets and I wanted to assess multi-split binning of contigs generated from (meta)-genome assembler. Since abundance of some genomes is zero in some samples, will BB handle this correctly?

Gold standard genomes are available but contigs binned are assembled by assembler
For mapping, I currently use Strobealign. Would we really need to consider exact the range of mapping positions for each contig and genomes to assign contig to a genome and subsequently to get bin/genome pair or just observing an alignment between contig and genome alone is sufficient? I think, this step should be done independent of BB. How to integrate this manual mapping pairs to BB assessment? or is it not currently advisable to use BB for this scenario?

Creating Reference
Thank you for working on this.

jakobnissen · 2024-09-02T09:12:32Z

Dear @yazhinia,

BB handles zero-abundance genomes just fine. These will appear in the reference as normal genomes, but without any contigs assigned to them (hopefully!).
Unfortunately, you do need the exact range of the mapping positions of contigs to their genome. Or rather, you need them in order to get reasonably accurate measurements. This is because BB treats redundant- vs nonredundant contigs differently, i.e. if contig A maps to a genome position already covered by contigs B and C, then any bin containing B and C but not A will still have 100% recall.
You can still use strobealign, just not with the -x flag (which disables the base-level alignment). I should also note that disabling base-level alignment with -x in strobealign not only means you don't get exact mapping positions, but also makes strobealign more prone to false positives.

yazhinia · 2024-09-02T09:29:37Z

Dear @jakobnissen ,
Thank you for the detailed reply. Good to know about -x flag and fortunately so far I use Strobealign without -x. I still don't know how to use the manually obtained mapping I get from aligner (for example, contig id, genome id/taxid, aligned region) with BB for assessment. Maybe I will wait for BB release version.

Thank you again.

jakobnissen · 2024-09-03T12:51:20Z

Dear @yazhinia

I've now written up a first draft of the documentation for BinBencher: https://viralinstruction.com/BinBencherBackend.jl/dev/
I'll polish over the next few days (and add more tests!), but this should give you the information necessary to create a reference JSON file.

yazhinia · 2024-09-03T14:35:07Z

Dear @jakobnissen
Thank you for the link. If I understand correctly that to create reference for CAMI2 dataset, I need seq_mapping.tsv which I can get from gsa_pooled_mapping.tsv file, contigs.fna, tax.tsv and ncbi_out.tsv files.

For tax.tsv, I can use taxonomic_profiles.tsv files provided from CAMI2 datasets but not unsure how to indicate plasmids and viruses. Is it correct that the child name in this file should be same as the name of fna file in the genome_directories? Would the order of child in this file matters?

If these files are already generated for CAMI2 datasets and accessible for others, I can benefit from directly using them at the moment.

jakobnissen · 2024-09-24T11:00:25Z

The child name in tax.tsv is the name of the plasmid/virus, which may be arbitrary. The order of the rows in the file does not matter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to generated reference genomes in json format #43

How to generated reference genomes in json format #43

yazhinia commented Aug 30, 2024

jakobnissen commented Sep 2, 2024

yazhinia commented Sep 2, 2024 •

edited

Loading

jakobnissen commented Sep 2, 2024

yazhinia commented Sep 2, 2024

jakobnissen commented Sep 3, 2024

yazhinia commented Sep 3, 2024

jakobnissen commented Sep 24, 2024

How to generated reference genomes in json format #43

How to generated reference genomes in json format #43

Comments

yazhinia commented Aug 30, 2024

jakobnissen commented Sep 2, 2024

Multi- vs single-sample benchmarking

Only for gold standard approach?

How to actually create the Reference?

yazhinia commented Sep 2, 2024 • edited Loading

jakobnissen commented Sep 2, 2024

yazhinia commented Sep 2, 2024

jakobnissen commented Sep 3, 2024

yazhinia commented Sep 3, 2024

jakobnissen commented Sep 24, 2024

How to actually create the `Reference`?

yazhinia commented Sep 2, 2024 •

edited

Loading