-
Notifications
You must be signed in to change notification settings - Fork 736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lower than expected GQ values, with bimodal distribution #586
Comments
Hi @JakeHagen , My guess is that our model isn't as confident, because 100bp reads is not the main type of data our model is trained on. Glad to hear that the number of calls are expected though. Certainly interesting to see that the VCF report here. (Side note: Maybe we should consider attaching these reports as part of our documentations like metrics.md. I'll take a note to consider for future releases!) By the way, In the past (starting v1.2), we did try augmenting the training data by creating 100bp and 125bp reads, but we did so by trimming. See this document: https://github.com/google/deepvariant/blob/r1.4/docs/deepvariant-details-training-data.md#vfootnote12 I'll also ask around on my team to see if anyone else has other thoughts. Thanks for reporting. |
Hi @JakeHagen Thank you for the report, and for including the quality readout from the HTML file. One thing I want to mention is that this distribution is something that we have seen in some samples - see Figure 1 of Accurate, scalable cohort variant calls using DeepVariant and GLnexus. In this figure, some of the analyzed cohorts do have bimodal GQ distributions for DeepVariant calls, while others (e.g. GIAB) do not. Supplementary Figure 3 of that paper indicates that a reasonable component of the bimodal distribution relates to sequence depth, at lower sample sequence depths, GIAB becomes more bimodal. I believe that we internally stratified calls and (though my memory is hazy) found that another factor in the bimodal distribution is whether a site is HET or HOM. Specifically, HET sites with lower depth have lower GQs, and I believe the explanation for this is that as coverage drops, it can become difficult to tell a HET site from either a REF or HOM, while HOM sites have more effective signal for them as non-REF. I don't think that the model is likely to be less confident in 100bp reads because they are not as much of the training data, but I expect the fact that 100bp reads are harder to uniquely map and will results in more variability in the coverage of high-MAPQ reads would indirectly contribute. |
I have one other question, do you know what the median insert size is (e.g. from the logging information of BWA)? One other possibility is that the insert sizes for this sample are different and this is interacting with the input channel for insert length. If this is the case, then you would expect that DeepVariant 1.3 (which does not include this channel) would have less of that bimodal distribution. If you do check this and see a difference in GQ distribution, it would be good for us to know. |
Thank you very much for your responses @pichuan @AndrewCarroll . I attached the relevant plots from deepvariant 1.3's visual output. It has a very similar distribution compared to 1.4. (I also did not know these plots were zoomable) I also looked into insert size differences using the tlen field from the bam, which I believe is equivalent to insert size. The median is 267 and the distribution is below. A sample with a normal GQ distribution had a median of 150 (but also 75bp read length). Because of the similar plots between 1.3 and 1.4, I don't think insert size is the difference maker, but I can do a deep dive into it if nothing else comes up. I will also investigate the relationship between depth, GT, and GQ. I think you are right that the lower GQ peak is from heterozygous samples, but I will need to look back into some previous plots. Thanks again for your insight |
Hi @JakeHagen That is a surprising observation. I am going to look at this myself. I'll start with trying to reproduce the observation. I assume this sample is not shareable, but if it is, it would be great for me to start from that. Otherwise, I'll try truncating to 100bp and then to 75bp on a standard sample to see if I can reproduce this finding. |
Thank you @AndrewCarroll , I appreciate you looking into this. I wish I could share this sample, but I can not (unless you have dbGaP approval?). |
Hi @JakeHagen , |
@pichuan the sample above is in this cohort https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001194.v3.p2 |
Hi @JakeHagen Although we do have access to some dbGaP datasets, I don't believe that this is one of them. Let me conduct some experiments from our benchmark data and see if I can replicate the effect. |
Hi @JakeHagen I took one of our 50x NovaSeq samples that are 150bp PE reads and trimmed the reads into 4 WGS consisting of a sample each for:
I wasn't able to replicate the effect that you see in any off the output reports. In your approach, did you trim the reads from the end (simply truncating to the first 75bp)? If so, I wasn't able to replicate the effect you see. There might be something more complicated about your sample. One possible explanation is that you have a run with lower sequencing quality and trimming to the first 75bp reads removes some lower quality parts which look suspicious to DeepVariant. If so, I wonder if your results would differ if you retained only the last 75bp reads. But I am not quite sure how to further diagnose. Here are my plots: |
I was hoping you would be able to reproduce this, but Im not surprised you couldn't. Thanks again for all your help |
Hi @JakeHagen I will take a look at running a similar analysis on our exome samples. I suppose one remaining possibility is that the truncation of the reads reduces how far beyond the capture region the sequencing is getting. The edges of the capture region tend to both have less coverage and it's harder to sample both alleles. That's just a guess, I don't have a clear answer and will still try to collect more data. When you run DeepVariant for the exome, do you restrict to the capture regions only and do you add any padding to those? |
This was called with padded bed file. I can rerun with a unpadded file, but I don't think it will make a difference considering I usually run with a padded file. |
I believe I have recreated this issue with HG002 from GIAB.
This is how the distributions look 75bp This is what my command looked like
Thanks again for looking into this |
Hi @JakeHagen Thank you for this analysis. This is an interesting observation. I have been some progress on doing the same truncation for the broader exome data we have. It will be interesting to see if that replicates as well. Either way, the fact that you have generated this effect on public data will be very useful. It will be informative to see what factors we can do to isolate or mitigate the effect. We're going to do some experiments here. Thanks again, |
Hi @JakeHagen We may have identified an issue which could have affected very specifically exome runs with 100bp length (but not WGS). We have been able to both replicate your findings and train a model which seems to eliminate the effect on our replication. Would you be interested to run a with this custom model that we generated to confirm that it fixes your issue? If so, can you email [email protected] and I can send you both the model and instructions to run it. If this does seem to correct the issue and we can validate the fix, we will plan to push this out as a part of next release. |
That's great. I will try the model for sure. I will reach out shortly by email. Thanks! |
Hi @JakeHagen , |
Describe the issue:
On a specific batch of samples, GQs and QUALs seem to be abnormal.
The GQ and QUAL distributions are bimodal and for variants they are much lower than I would expect. It doesn't seem like there is anything wrong with the calls themselves; I get an expected number of variants. I also can not find anything wrong with the input data. It has high base quality throughout the reads, they are 100bp paired end reads from a NovaSeq with the four value binned base quality scores. This is the visual report for one sample.
Here is an example. I would expect this variant to have a much higher GQ and QUAL. I also have attached deepvariant's channels png for this variant.
Is this expected or is something strange happening here, any insight you can provide would be very appreciated.
Thank you
Setup
Novaseq, 100bp paired, HG38
Steps to reproduce:
The text was updated successfully, but these errors were encountered: