-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The abundance of organisms is not corrected for length #127
Comments
Hi! I guess you could argue that in the case of InSilicoSeq, In the meanwhile if you want you can use the
Will give you 10x coverage of E.coli and 10x coverage of phage lambda. Thanks for starting this discussion, (Paging @Ackia for his input) |
The option would be good if in the "coverage" I could directly say "log-norm" and not have to pre-define it myself |
I agree with @cuencam. When I simulate a dataset, I would like to know, "how many times" a certain genome is present in the sample (the coverage). If ISS could generate a dataset automatically while reporting the coverage for each genome, that would be great! |
Hi folks! Just letting you know that I'm planning to add coverage distributions in |
I should now able to do that with Best, |
Hi Hadrien,
I just noticed a small bug in your code. The number of reads that come from each contig is a direct proportion of the estimated abundance, however, this is not what you should exactly simulate.
Example:
There are 10 bacteria with a genome of 5Mb and 10 phages with a genome of 0.5Mb. This is an organism abundance of 1:1 (or 50% for both). Even when biologically they are in the same abundance, the number of reads that would come out of this would be 10 * 5Mb / read_length and 10*0.5 / read length.
This means that if you would map the reads back to the organisms you would get an average vertical coverage (Also referred to as depth of coverage) of 10 for both.
In your current implementation, you are producing 50% of the reads for the bacteria and 50% of the reads for the phage. this would make that if I map back the reads to the organisms I will get coverage for the bacteria as (#reads * 0.5read_length)/ 5Mb and the phage will be (#reads * 0.5read_length)/ 0.5Mb. This will result in a 100X higher coverage for the phage and reduced coverage for the bacteria.
I think that the implementation of this is a minor effort that will significantly improve the usability of the simulations, since in the current state if I provide an abundance table and then use a mapper to calculate abundances these two results will not match
Cheers
Miguel
The text was updated successfully, but these errors were encountered: