The abundance of organisms is not corrected for length #127

cuencam · 2019-11-06T11:52:00Z

Hi Hadrien,
I just noticed a small bug in your code. The number of reads that come from each contig is a direct proportion of the estimated abundance, however, this is not what you should exactly simulate.

Example:

There are 10 bacteria with a genome of 5Mb and 10 phages with a genome of 0.5Mb. This is an organism abundance of 1:1 (or 50% for both). Even when biologically they are in the same abundance, the number of reads that would come out of this would be 10 * 5Mb / read_length and 10*0.5 / read length.

This means that if you would map the reads back to the organisms you would get an average vertical coverage (Also referred to as depth of coverage) of 10 for both.

In your current implementation, you are producing 50% of the reads for the bacteria and 50% of the reads for the phage. this would make that if I map back the reads to the organisms I will get coverage for the bacteria as (#reads * 0.5read_length)/ 5Mb and the phage will be (#reads * 0.5read_length)/ 0.5Mb. This will result in a 100X higher coverage for the phage and reduced coverage for the bacteria.

I think that the implementation of this is a minor effort that will significantly improve the usability of the simulations, since in the current state if I provide an abundance table and then use a mapper to calculate abundances these two results will not match

Cheers

Miguel

HadrienG · 2019-11-08T07:01:49Z

Hi!

I guess you could argue that in the case of InSilicoSeq, abundance is a bit of a misnomer if you think of abundance in terms of number of organisms, but when I developed InSilicoSeq I though in terms of abundance of read fragments. Maybe I'm in the minority here and if other users chime in and confirm that I'm using the term very wrongly I'll consider the change. I guess you could think of my use of abundance as "proportions of reads in a sample".

In the meanwhile if you want you can use the --coverage option instead of --abundance and --n_reads, which should be closer to what you want, i.e a coverage.txt file formatted like the following:

NC_011750.1 10
J02459.1    10

Will give you 10x coverage of E.coli and 10x coverage of phage lambda.

Thanks for starting this discussion,
/Hadrien

(Paging @Ackia for his input)

cuencam · 2019-11-21T09:26:17Z

The option would be good if in the "coverage" I could directly say "log-norm" and not have to pre-define it myself

jfrank87 · 2020-05-01T13:35:25Z

I agree with @cuencam. When I simulate a dataset, I would like to know, "how many times" a certain genome is present in the sample (the coverage). If ISS could generate a dataset automatically while reporting the coverage for each genome, that would be great!

HadrienG · 2020-06-07T08:00:23Z

Hi folks! Just letting you know that I'm planning to add coverage distributions in 1.5.0, which should be out sometimes in June.

HadrienG · 2020-08-13T07:07:50Z

The option would be good if in the "coverage" I could directly say "log-norm" and not have to pre-define it myself

I should now able to do that with 1.5.0

Best,
Hadrien

HadrienG added the question label Nov 8, 2019

dpellow mentioned this issue May 18, 2020

edited #167

Closed

HadrienG added the enhancement label Jun 7, 2020

HadrienG self-assigned this Jun 7, 2020

HadrienG closed this as completed Aug 13, 2020

HadrienG mentioned this issue Aug 8, 2023

Experimental groups #227

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The abundance of organisms is not corrected for length #127

The abundance of organisms is not corrected for length #127

cuencam commented Nov 6, 2019

HadrienG commented Nov 8, 2019

cuencam commented Nov 21, 2019

jfrank87 commented May 1, 2020

HadrienG commented Jun 7, 2020 •

edited

Loading

HadrienG commented Aug 13, 2020

The abundance of organisms is not corrected for length #127

The abundance of organisms is not corrected for length #127

Comments

cuencam commented Nov 6, 2019

HadrienG commented Nov 8, 2019

cuencam commented Nov 21, 2019

jfrank87 commented May 1, 2020

HadrienG commented Jun 7, 2020 • edited Loading

HadrienG commented Aug 13, 2020

HadrienG commented Jun 7, 2020 •

edited

Loading