Lower than expected GQ values, with bimodal distribution #586

JakeHagen · 2022-11-17T18:59:30Z

Describe the issue:
On a specific batch of samples, GQs and QUALs seem to be abnormal.
The GQ and QUAL distributions are bimodal and for variants they are much lower than I would expect. It doesn't seem like there is anything wrong with the calls themselves; I get an expected number of variants. I also can not find anything wrong with the input data. It has high base quality throughout the reads, they are 100bp paired end reads from a NovaSeq with the four value binned base quality scores. This is the visual report for one sample.

Here is an example. I would expect this variant to have a much higher GQ and QUAL. I also have attached deepvariant's channels png for this variant.

chr1    169421916       .       A       G       18.4    PASS    .       GT:GQ:DP:AD:VAF:PL      0/1:17:58:29,29:0.5:18,0,22

Is this expected or is something strange happening here, any insight you can provide would be very appreciated.
Thank you

Setup

Operating system: Ubuntu 20.04
DeepVariant version: 1.4 (but also 1.2)
Installation method (Docker, built from source, etc.): Singularity
Type of data: (sequencing instrument, reference genome, anything special that is unlike the case studies?)
Novaseq, 100bp paired, HG38

Steps to reproduce:

singularity run -B /usr/lib/locale/:/usr/lib/locale/ -c --pwd $(pwd) -W $(pwd) -B $(pwd) docker://google/deepvariant:1.4.0 /opt/deepvariant/bin/run_deepvariant   --model_type WES   --ref $REF --reads $CRAM --output_vcf $VCF --output_gvcf $GVCF --intermediate_results_dir ./int_results  --regions $BED

The text was updated successfully, but these errors were encountered:

pichuan · 2022-11-18T22:53:35Z

Hi @JakeHagen ,

My guess is that our model isn't as confident, because 100bp reads is not the main type of data our model is trained on. Glad to hear that the number of calls are expected though. Certainly interesting to see that the VCF report here. (Side note: Maybe we should consider attaching these reports as part of our documentations like metrics.md. I'll take a note to consider for future releases!)

By the way, In the past (starting v1.2), we did try augmenting the training data by creating 100bp and 125bp reads, but we did so by trimming. See this document: https://github.com/google/deepvariant/blob/r1.4/docs/deepvariant-details-training-data.md#vfootnote12
But it's possible that our model still didn't feel confident enough with your datatype.

I'll also ask around on my team to see if anyone else has other thoughts. Thanks for reporting.

AndrewCarroll · 2022-11-18T23:07:47Z

Hi @JakeHagen

Thank you for the report, and for including the quality readout from the HTML file. One thing I want to mention is that this distribution is something that we have seen in some samples - see Figure 1 of Accurate, scalable cohort variant calls using DeepVariant and GLnexus. In this figure, some of the analyzed cohorts do have bimodal GQ distributions for DeepVariant calls, while others (e.g. GIAB) do not.

Supplementary Figure 3 of that paper indicates that a reasonable component of the bimodal distribution relates to sequence depth, at lower sample sequence depths, GIAB becomes more bimodal.

I believe that we internally stratified calls and (though my memory is hazy) found that another factor in the bimodal distribution is whether a site is HET or HOM. Specifically, HET sites with lower depth have lower GQs, and I believe the explanation for this is that as coverage drops, it can become difficult to tell a HET site from either a REF or HOM, while HOM sites have more effective signal for them as non-REF.

I don't think that the model is likely to be less confident in 100bp reads because they are not as much of the training data, but I expect the fact that 100bp reads are harder to uniquely map and will results in more variability in the coverage of high-MAPQ reads would indirectly contribute.

AndrewCarroll · 2022-11-18T23:18:41Z

@JakeHagen

I have one other question, do you know what the median insert size is (e.g. from the logging information of BWA)? One other possibility is that the insert sizes for this sample are different and this is interacting with the input channel for insert length.

If this is the case, then you would expect that DeepVariant 1.3 (which does not include this channel) would have less of that bimodal distribution. If you do check this and see a difference in GQ distribution, it would be good for us to know.

JakeHagen · 2022-11-21T17:58:55Z

Thank you very much for your responses @pichuan @AndrewCarroll . I attached the relevant plots from deepvariant 1.3's visual output. It has a very similar distribution compared to 1.4. (I also did not know these plots were zoomable)

I also looked into insert size differences using the tlen field from the bam, which I believe is equivalent to insert size. The median is 267 and the distribution is below. A sample with a normal GQ distribution had a median of 150 (but also 75bp read length). Because of the similar plots between 1.3 and 1.4, I don't think insert size is the difference maker, but I can do a deep dive into it if nothing else comes up.

I will also investigate the relationship between depth, GT, and GQ. I think you are right that the lower GQ peak is from heterozygous samples, but I will need to look back into some previous plots.

Thanks again for your insight

JakeHagen · 2022-11-28T16:14:01Z

Hi again, it looks like this might be a read length issue. I used the first 75bp of the same sample and the GQ and QUAL distribution is what I would expect.

AndrewCarroll · 2022-11-29T00:16:19Z

Hi @JakeHagen

That is a surprising observation. I am going to look at this myself. I'll start with trying to reproduce the observation. I assume this sample is not shareable, but if it is, it would be great for me to start from that.

Otherwise, I'll try truncating to 100bp and then to 75bp on a standard sample to see if I can reproduce this finding.

JakeHagen · 2022-11-29T15:16:59Z

Thank you @AndrewCarroll , I appreciate you looking into this. I wish I could share this sample, but I can not (unless you have dbGaP approval?).

pichuan · 2022-11-29T17:10:55Z

Hi @JakeHagen ,
which dbGaP dataset do you use?

JakeHagen · 2022-11-29T18:30:05Z

@pichuan the sample above is in this cohort https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001194.v3.p2

AndrewCarroll · 2022-11-29T18:40:57Z

Hi @JakeHagen

Although we do have access to some dbGaP datasets, I don't believe that this is one of them. Let me conduct some experiments from our benchmark data and see if I can replicate the effect.

AndrewCarroll · 2022-12-01T23:18:23Z

Hi @JakeHagen

I took one of our 50x NovaSeq samples that are 150bp PE reads and trimmed the reads into 4 WGS consisting of a sample each for:

first 100bp of the reads
last 100bp of the reads
first 75bp of the reads
last 75bp of the reads.

I wasn't able to replicate the effect that you see in any off the output reports. In your approach, did you trim the reads from the end (simply truncating to the first 75bp)? If so, I wasn't able to replicate the effect you see. There might be something more complicated about your sample. One possible explanation is that you have a run with lower sequencing quality and trimming to the first 75bp reads removes some lower quality parts which look suspicious to DeepVariant. If so, I wonder if your results would differ if you retained only the last 75bp reads. But I am not quite sure how to further diagnose.

Here are my plots:

JakeHagen · 2022-12-02T14:54:02Z

Hi @AndrewCarroll

I was hoping you would be able to reproduce this, but Im not surprised you couldn't.
I clipped the last 26bp from the fastq and then realigned etc.
The base quality scores are very good through the length of the read but I will try to clip the first 26bp to see if that makes a difference. Also, it probably wouldn't have an effect but my data is WES not WGS.

Thanks again for all your help

JakeHagen · 2022-12-02T19:33:24Z

This is clipping the first 26bp. I am really stumped by this.

AndrewCarroll · 2022-12-08T00:18:14Z

Hi @JakeHagen

I will take a look at running a similar analysis on our exome samples. I suppose one remaining possibility is that the truncation of the reads reduces how far beyond the capture region the sequencing is getting. The edges of the capture region tend to both have less coverage and it's harder to sample both alleles. That's just a guess, I don't have a clear answer and will still try to collect more data.

When you run DeepVariant for the exome, do you restrict to the capture regions only and do you add any padding to those?

JakeHagen · 2022-12-08T15:07:29Z

Hi @AndrewCarroll

This was called with padded bed file. I can rerun with a unpadded file, but I don't think it will make a difference considering I usually run with a padded file.

JakeHagen · 2022-12-14T21:46:46Z

Hi @AndrewCarroll

I believe I have recreated this issue with HG002 from GIAB.
This is what I did:

Downloaded ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/OsloUniversityHospital_Exome/151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008.posiSrt.markDup.bam
Converted the bam to fastq
Made a 100bp and 75bp fastq
Aligned all three fastqs (original-127bp, 100bp, 75bp) to https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz
Marked dups etc
Ran deep variant 1.4 on 100bp bam, 75bp bam, and original bam using WES model
Use bed file consisting of regions from the 100bp bam with depth over 10 intersected with CDS regions.

over10.cds.bed.txt

This is how the distributions look
Original-127bp

100bp

75bp

This is what my command looked like

 singularity run -B /usr/lib/locale/:/usr/lib/locale/ -B /share/terra docker://google/deepvariant:1.4.0 \
/opt/deepvariant/bin/run_deepvariant \
--model_type WES \
--ref /share/terra/rsrc/hg38/ref/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz \
--reads ./HG002.proc.bam \
--output_vcf HG002.over10dp.vcf.gz \
--output_gvcf HG002.over10dp.gvcf.gz \
--num_shards 8 \
--intermediate_results_dir ./dv_int_results \
--regions ../100bp/over10.cds.bed

Thanks again for looking into this

AndrewCarroll · 2022-12-15T00:08:07Z

Hi @JakeHagen

Thank you for this analysis. This is an interesting observation. I have been some progress on doing the same truncation for the broader exome data we have. It will be interesting to see if that replicates as well.

Either way, the fact that you have generated this effect on public data will be very useful. It will be informative to see what factors we can do to isolate or mitigate the effect. We're going to do some experiments here.

Thanks again,
Andrew

AndrewCarroll · 2022-12-18T05:19:21Z

Hi @JakeHagen

We may have identified an issue which could have affected very specifically exome runs with 100bp length (but not WGS). We have been able to both replicate your findings and train a model which seems to eliminate the effect on our replication.

Would you be interested to run a with this custom model that we generated to confirm that it fixes your issue? If so, can you email [email protected] and I can send you both the model and instructions to run it.

If this does seem to correct the issue and we can validate the fix, we will plan to push this out as a part of next release.

JakeHagen · 2022-12-19T16:10:25Z

That's great. I will try the model for sure. I will reach out shortly by email.

Thanks!

pichuan · 2023-03-07T02:43:11Z

Hi @JakeHagen ,
this problem should be fixed in https://github.com/google/deepvariant/releases/tag/v1.5.0. And, starting in this release, we added the VCF stats plots to https://github.com/google/deepvariant/blob/r1.5/docs/metrics.md and https://github.com/google/deepvariant/blob/r1.5/docs/metrics-deeptrio.md. We're glad to see that @MariaNattestad 's VCF stats tool was useful for you!

pichuan closed this as completed Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower than expected GQ values, with bimodal distribution #586

Lower than expected GQ values, with bimodal distribution #586

JakeHagen commented Nov 17, 2022

pichuan commented Nov 18, 2022

AndrewCarroll commented Nov 18, 2022

AndrewCarroll commented Nov 18, 2022

JakeHagen commented Nov 21, 2022

JakeHagen commented Nov 28, 2022

AndrewCarroll commented Nov 29, 2022

JakeHagen commented Nov 29, 2022

pichuan commented Nov 29, 2022

JakeHagen commented Nov 29, 2022

AndrewCarroll commented Nov 29, 2022

AndrewCarroll commented Dec 1, 2022

JakeHagen commented Dec 2, 2022

JakeHagen commented Dec 2, 2022

AndrewCarroll commented Dec 8, 2022

JakeHagen commented Dec 8, 2022

JakeHagen commented Dec 14, 2022

AndrewCarroll commented Dec 15, 2022

AndrewCarroll commented Dec 18, 2022

JakeHagen commented Dec 19, 2022

pichuan commented Mar 7, 2023

Lower than expected GQ values, with bimodal distribution #586

Lower than expected GQ values, with bimodal distribution #586

Comments

JakeHagen commented Nov 17, 2022

pichuan commented Nov 18, 2022

AndrewCarroll commented Nov 18, 2022

AndrewCarroll commented Nov 18, 2022

JakeHagen commented Nov 21, 2022

JakeHagen commented Nov 28, 2022

AndrewCarroll commented Nov 29, 2022

JakeHagen commented Nov 29, 2022

pichuan commented Nov 29, 2022

JakeHagen commented Nov 29, 2022

AndrewCarroll commented Nov 29, 2022

AndrewCarroll commented Dec 1, 2022

JakeHagen commented Dec 2, 2022

JakeHagen commented Dec 2, 2022

AndrewCarroll commented Dec 8, 2022

JakeHagen commented Dec 8, 2022

JakeHagen commented Dec 14, 2022

AndrewCarroll commented Dec 15, 2022

AndrewCarroll commented Dec 18, 2022

JakeHagen commented Dec 19, 2022

pichuan commented Mar 7, 2023