Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The abundance of organisms is not corrected for length #127

Closed
cuencam opened this issue Nov 6, 2019 · 5 comments
Closed

The abundance of organisms is not corrected for length #127

cuencam opened this issue Nov 6, 2019 · 5 comments

Comments

@cuencam
Copy link

cuencam commented Nov 6, 2019

Hi Hadrien,
I just noticed a small bug in your code. The number of reads that come from each contig is a direct proportion of the estimated abundance, however, this is not what you should exactly simulate.

Example:

There are 10 bacteria with a genome of 5Mb and 10 phages with a genome of 0.5Mb. This is an organism abundance of 1:1 (or 50% for both). Even when biologically they are in the same abundance, the number of reads that would come out of this would be 10 * 5Mb / read_length and 10*0.5 / read length.

This means that if you would map the reads back to the organisms you would get an average vertical coverage (Also referred to as depth of coverage) of 10 for both.

In your current implementation, you are producing 50% of the reads for the bacteria and 50% of the reads for the phage. this would make that if I map back the reads to the organisms I will get coverage for the bacteria as (#reads * 0.5read_length)/ 5Mb and the phage will be (#reads * 0.5read_length)/ 0.5Mb. This will result in a 100X higher coverage for the phage and reduced coverage for the bacteria.

I think that the implementation of this is a minor effort that will significantly improve the usability of the simulations, since in the current state if I provide an abundance table and then use a mapper to calculate abundances these two results will not match

Cheers

Miguel

@HadrienG
Copy link
Owner

HadrienG commented Nov 8, 2019

Hi!

I guess you could argue that in the case of InSilicoSeq, abundance is a bit of a misnomer if you think of abundance in terms of number of organisms, but when I developed InSilicoSeq I though in terms of abundance of read fragments. Maybe I'm in the minority here and if other users chime in and confirm that I'm using the term very wrongly I'll consider the change. I guess you could think of my use of abundance as "proportions of reads in a sample".

In the meanwhile if you want you can use the --coverage option instead of --abundance and --n_reads, which should be closer to what you want, i.e a coverage.txt file formatted like the following:

NC_011750.1 10
J02459.1    10

Will give you 10x coverage of E.coli and 10x coverage of phage lambda.

Thanks for starting this discussion,
/Hadrien

(Paging @Ackia for his input)

@cuencam
Copy link
Author

cuencam commented Nov 21, 2019

The option would be good if in the "coverage" I could directly say "log-norm" and not have to pre-define it myself

@jfrank87
Copy link

jfrank87 commented May 1, 2020

I agree with @cuencam. When I simulate a dataset, I would like to know, "how many times" a certain genome is present in the sample (the coverage). If ISS could generate a dataset automatically while reporting the coverage for each genome, that would be great!

@dpellow dpellow mentioned this issue May 18, 2020
@HadrienG
Copy link
Owner

HadrienG commented Jun 7, 2020

Hi folks! Just letting you know that I'm planning to add coverage distributions in 1.5.0, which should be out sometimes in June.

@HadrienG HadrienG self-assigned this Jun 7, 2020
@HadrienG
Copy link
Owner

The option would be good if in the "coverage" I could directly say "log-norm" and not have to pre-define it myself

I should now able to do that with 1.5.0

Best,
Hadrien

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants