Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First thoughts #1

Open
TomSmithCGAT opened this issue Feb 18, 2015 · 0 comments
Open

First thoughts #1

TomSmithCGAT opened this issue Feb 18, 2015 · 0 comments

Comments

@TomSmithCGAT
Copy link
Collaborator

Possible improvements:

  1. Add support for read pairs (from your outline - I completely agree)
    • How quick is it to retrieve a single read from the bam file. How does this compare to reading in pairs of reads and adding another key along with "read" and "counts" for "read2" for the reads_dict. This would require a read sorted BAM though.
  2. Optimise and port to Cython (From your outline - I'm not yet clear how much scope there is for improvement here)
    • How much compute does it currently take per 1G BAM?
    • Should we profile the script to find out where it's spending the most time to identify potential bottlenecks
    • Could we parallelise at the level of bundles? (I've never done parallel programming so I'm not sure about the feasbility/sensibility of this, but each bundle could be dealt with entirely separate as far as I can see. There would be an issue with writing out in the correct order)
  3. Include ability to take stats (From your outline - I agree it would be nice to have some good summary stats/plots)
    • Distribution of reads per UMI per sample pre- and post-dedup (See S4C, Jaitin et al)
    • Distribution of pair-wise edit distance between all UMIs in a sample pre- and post-dedup
    • Distribution of cluster size before and after dedup (presumably we need to re-cluster after dedup to identify those UMIs which still cluster but are inferred to represent truely unique molecules)
    • add in null expectation for cluster size distribution based on number of UMIs post-dedup)
  4. Implement sequencing error based approach to collapsing UMI clusters (I've not given this much thought so it may not be a sensible approach)
    • Retain sequence qualities for all UMI bases
    • Take the counts for the second most abundant UMI (c1), the total counts across a UMI cluster (cT) and the sequencing errors across the UMI (e1, e2, ... eN) and compute the probability of obtaining c1 or more UMIs at this edit distance given the error rates
      • Would binomial probability suffice?
      • If so, the probability of a specific UMI due to errors would simply be the product of the required probability to generate the edit distance from the most abundant UMI. When the binomial probability was below a threshold we we could then reject the null hypothesis that the two UMIs are the same except for sequencing errors.
      • If the null hypothesis is not rejected, continue on to the third most abundant and so on
      • When the null hypothesis is rejected, the link between these two UMIs in the adjacency dictionary would need to be removed and the process continued down both branches. For UMIs which are adjacent to two parents which are split apart, the parent with the largest binomial probability wins out.
    • Multiple testing correction issues. How many independent tests? We would be performing multiple tests per cluster so I guess maybe correct for the number of clusters?
  5. Add further options for cluster collapsing methods (this could be useful for comparing methods in the publication):
    • Remove all UMIs with less than 1/100 of the average number of reads across non-zero UMIs (as per S. Islam et al (2014))
    • Select most abundant UMI in cluster regardless of cluster topology
    • Select the most abundant UMI as parent and collapse in all clusters within edit distance as a single UMI. Then select the most abundant remaining UMI as parent and continue process with remaining UMIs. Repeat until all UMIs are accounted for
  6. What about deletions?
    • How high is the rate of deletions in the UMI according to your data?
      • define deletions as consecutive genomic positions where the 3' UMI is one edit distance away from the 5' UMI and has the same 3' base as the 5' base for the genomic sequence following the 5' UMI.
    • If the rate is low, ignore. Otherwise, can we add this in somehow?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant