Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lower than expected GQ values, with bimodal distribution #586

Closed
JakeHagen opened this issue Nov 17, 2022 · 20 comments
Closed

Lower than expected GQ values, with bimodal distribution #586

JakeHagen opened this issue Nov 17, 2022 · 20 comments

Comments

@JakeHagen
Copy link

Describe the issue:
On a specific batch of samples, GQs and QUALs seem to be abnormal.
The GQ and QUAL distributions are bimodal and for variants they are much lower than I would expect. It doesn't seem like there is anything wrong with the calls themselves; I get an expected number of variants. I also can not find anything wrong with the input data. It has high base quality throughout the reads, they are 100bp paired end reads from a NovaSeq with the four value binned base quality scores. This is the visual report for one sample.

image

Here is an example. I would expect this variant to have a much higher GQ and QUAL. I also have attached deepvariant's channels png for this variant.

chr1    169421916       .       A       G       18.4    PASS    .       GT:GQ:DP:AD:VAF:PL      0/1:17:58:29,29:0.5:18,0,22

chr1_169421916_A-G

Is this expected or is something strange happening here, any insight you can provide would be very appreciated.
Thank you

Setup

  • Operating system: Ubuntu 20.04
  • DeepVariant version: 1.4 (but also 1.2)
  • Installation method (Docker, built from source, etc.): Singularity
  • Type of data: (sequencing instrument, reference genome, anything special that is unlike the case studies?)
    Novaseq, 100bp paired, HG38

Steps to reproduce:

singularity run -B /usr/lib/locale/:/usr/lib/locale/ -c --pwd $(pwd) -W $(pwd) -B $(pwd) docker://google/deepvariant:1.4.0 /opt/deepvariant/bin/run_deepvariant   --model_type WES   --ref $REF --reads $CRAM --output_vcf $VCF --output_gvcf $GVCF --intermediate_results_dir ./int_results  --regions $BED
@pichuan
Copy link
Collaborator

pichuan commented Nov 18, 2022

Hi @JakeHagen ,

My guess is that our model isn't as confident, because 100bp reads is not the main type of data our model is trained on. Glad to hear that the number of calls are expected though. Certainly interesting to see that the VCF report here. (Side note: Maybe we should consider attaching these reports as part of our documentations like metrics.md. I'll take a note to consider for future releases!)

By the way, In the past (starting v1.2), we did try augmenting the training data by creating 100bp and 125bp reads, but we did so by trimming. See this document: https://github.com/google/deepvariant/blob/r1.4/docs/deepvariant-details-training-data.md#vfootnote12
But it's possible that our model still didn't feel confident enough with your datatype.

I'll also ask around on my team to see if anyone else has other thoughts. Thanks for reporting.

@AndrewCarroll
Copy link
Collaborator

Hi @JakeHagen

Thank you for the report, and for including the quality readout from the HTML file. One thing I want to mention is that this distribution is something that we have seen in some samples - see Figure 1 of Accurate, scalable cohort variant calls using DeepVariant and GLnexus. In this figure, some of the analyzed cohorts do have bimodal GQ distributions for DeepVariant calls, while others (e.g. GIAB) do not.

Supplementary Figure 3 of that paper indicates that a reasonable component of the bimodal distribution relates to sequence depth, at lower sample sequence depths, GIAB becomes more bimodal.

I believe that we internally stratified calls and (though my memory is hazy) found that another factor in the bimodal distribution is whether a site is HET or HOM. Specifically, HET sites with lower depth have lower GQs, and I believe the explanation for this is that as coverage drops, it can become difficult to tell a HET site from either a REF or HOM, while HOM sites have more effective signal for them as non-REF.

I don't think that the model is likely to be less confident in 100bp reads because they are not as much of the training data, but I expect the fact that 100bp reads are harder to uniquely map and will results in more variability in the coverage of high-MAPQ reads would indirectly contribute.

@AndrewCarroll
Copy link
Collaborator

@JakeHagen

I have one other question, do you know what the median insert size is (e.g. from the logging information of BWA)? One other possibility is that the insert sizes for this sample are different and this is interacting with the input channel for insert length.

If this is the case, then you would expect that DeepVariant 1.3 (which does not include this channel) would have less of that bimodal distribution. If you do check this and see a difference in GQ distribution, it would be good for us to know.

@JakeHagen
Copy link
Author

Thank you very much for your responses @pichuan @AndrewCarroll . I attached the relevant plots from deepvariant 1.3's visual output. It has a very similar distribution compared to 1.4. (I also did not know these plots were zoomable)

image

I also looked into insert size differences using the tlen field from the bam, which I believe is equivalent to insert size. The median is 267 and the distribution is below. A sample with a normal GQ distribution had a median of 150 (but also 75bp read length). Because of the similar plots between 1.3 and 1.4, I don't think insert size is the difference maker, but I can do a deep dive into it if nothing else comes up.
image
image

I will also investigate the relationship between depth, GT, and GQ. I think you are right that the lower GQ peak is from heterozygous samples, but I will need to look back into some previous plots.

Thanks again for your insight

@JakeHagen
Copy link
Author

Hi again, it looks like this might be a read length issue. I used the first 75bp of the same sample and the GQ and QUAL distribution is what I would expect.

Screenshot 2022-11-28 at 11 12 31 AM

@AndrewCarroll
Copy link
Collaborator

Hi @JakeHagen

That is a surprising observation. I am going to look at this myself. I'll start with trying to reproduce the observation. I assume this sample is not shareable, but if it is, it would be great for me to start from that.

Otherwise, I'll try truncating to 100bp and then to 75bp on a standard sample to see if I can reproduce this finding.

@JakeHagen
Copy link
Author

Thank you @AndrewCarroll , I appreciate you looking into this. I wish I could share this sample, but I can not (unless you have dbGaP approval?).

@pichuan
Copy link
Collaborator

pichuan commented Nov 29, 2022

Hi @JakeHagen ,
which dbGaP dataset do you use?

@JakeHagen
Copy link
Author

@AndrewCarroll
Copy link
Collaborator

Hi @JakeHagen

Although we do have access to some dbGaP datasets, I don't believe that this is one of them. Let me conduct some experiments from our benchmark data and see if I can replicate the effect.

@AndrewCarroll
Copy link
Collaborator

Hi @JakeHagen

I took one of our 50x NovaSeq samples that are 150bp PE reads and trimmed the reads into 4 WGS consisting of a sample each for:

  1. first 100bp of the reads
  2. last 100bp of the reads
  3. first 75bp of the reads
  4. last 75bp of the reads.

I wasn't able to replicate the effect that you see in any off the output reports. In your approach, did you trim the reads from the end (simply truncating to the first 75bp)? If so, I wasn't able to replicate the effect you see. There might be something more complicated about your sample. One possible explanation is that you have a run with lower sequencing quality and trimming to the first 75bp reads removes some lower quality parts which look suspicious to DeepVariant. If so, I wonder if your results would differ if you retained only the last 75bp reads. But I am not quite sure how to further diagnose.

Here are my plots:

HG003 novaseq 50x_only_first75bp

HG003 novaseq 50x_only_last75bp

HG003 novaseq 50x_only_first100bp

HG003 novaseq 50x_only_last100bp

@JakeHagen
Copy link
Author

Hi @AndrewCarroll

I was hoping you would be able to reproduce this, but Im not surprised you couldn't.
I clipped the last 26bp from the fastq and then realigned etc.
The base quality scores are very good through the length of the read but I will try to clip the first 26bp to see if that makes a difference. Also, it probably wouldn't have an effect but my data is WES not WGS.

Thanks again for all your help

@JakeHagen
Copy link
Author

This is clipping the first 26bp. I am really stumped by this.
image

@AndrewCarroll
Copy link
Collaborator

Hi @JakeHagen

I will take a look at running a similar analysis on our exome samples. I suppose one remaining possibility is that the truncation of the reads reduces how far beyond the capture region the sequencing is getting. The edges of the capture region tend to both have less coverage and it's harder to sample both alleles. That's just a guess, I don't have a clear answer and will still try to collect more data.

When you run DeepVariant for the exome, do you restrict to the capture regions only and do you add any padding to those?

@JakeHagen
Copy link
Author

Hi @AndrewCarroll

This was called with padded bed file. I can rerun with a unpadded file, but I don't think it will make a difference considering I usually run with a padded file.

@JakeHagen
Copy link
Author

Hi @AndrewCarroll

I believe I have recreated this issue with HG002 from GIAB.
This is what I did:

over10.cds.bed.txt

This is how the distributions look
Original-127bp
original_127bp

100bp
100bp

75bp

75bp

This is what my command looked like

 singularity run -B /usr/lib/locale/:/usr/lib/locale/ -B /share/terra docker://google/deepvariant:1.4.0 \
/opt/deepvariant/bin/run_deepvariant \
--model_type WES \
--ref /share/terra/rsrc/hg38/ref/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz \
--reads ./HG002.proc.bam \
--output_vcf HG002.over10dp.vcf.gz \
--output_gvcf HG002.over10dp.gvcf.gz \
--num_shards 8 \
--intermediate_results_dir ./dv_int_results \
--regions ../100bp/over10.cds.bed

Thanks again for looking into this

@AndrewCarroll
Copy link
Collaborator

Hi @JakeHagen

Thank you for this analysis. This is an interesting observation. I have been some progress on doing the same truncation for the broader exome data we have. It will be interesting to see if that replicates as well.

Either way, the fact that you have generated this effect on public data will be very useful. It will be informative to see what factors we can do to isolate or mitigate the effect. We're going to do some experiments here.

Thanks again,
Andrew

@AndrewCarroll
Copy link
Collaborator

Hi @JakeHagen

We may have identified an issue which could have affected very specifically exome runs with 100bp length (but not WGS). We have been able to both replicate your findings and train a model which seems to eliminate the effect on our replication.

Would you be interested to run a with this custom model that we generated to confirm that it fixes your issue? If so, can you email [email protected] and I can send you both the model and instructions to run it.

If this does seem to correct the issue and we can validate the fix, we will plan to push this out as a part of next release.

@JakeHagen
Copy link
Author

That's great. I will try the model for sure. I will reach out shortly by email.

Thanks!

@pichuan
Copy link
Collaborator

pichuan commented Mar 7, 2023

Hi @JakeHagen ,
this problem should be fixed in https://github.com/google/deepvariant/releases/tag/v1.5.0. And, starting in this release, we added the VCF stats plots to https://github.com/google/deepvariant/blob/r1.5/docs/metrics.md and https://github.com/google/deepvariant/blob/r1.5/docs/metrics-deeptrio.md. We're glad to see that @MariaNattestad 's VCF stats tool was useful for you!

@pichuan pichuan closed this as completed Mar 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants