Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs for paired reads in Mutect2 somatic genotyping #6264

Merged
merged 2 commits into from
Nov 18, 2019
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified docs/mutect/mutect.pdf
Binary file not shown.
18 changes: 13 additions & 5 deletions docs/mutect/mutect.tex
Original file line number Diff line number Diff line change
Expand Up @@ -149,17 +149,25 @@ \subsection{Finding Active Regions}
\subsection{Local Assembly, Pair-HMM, and Realignment}
These topics, which are common to \code{Mutect2} and \code{HaplotypeCaller}, are discussed in docs/local{\_}assembly.pdf, docs/pair{\_}hmm.pdf, and docs/variants{\_}from{\_}haplotypes.pdf in the gatk git repository. As a black box, whenever the evidence in the previous section suffices to trigger local assembly and realignment, we end up at each candidate variant site with one read-vs-allele log likelihood matrix $\ell$ for each sample, where $ \ell_{ra}$ is the log probability of sequencing read $r$ given its base qualities and given that read $r$ is derived from a molecule exhibiting allele $a$. \footnote{Technically, pair-HMM produces a read-vs-\textit{haplotype} likelihood matrix, which is then ``marginalized" to produce a set of read-vs-allele likelihood matrices. In the future, \code{Mutect2} may operate directly on this read-vs-haplotype matrix in order to exploit the biological fact that there are only a few haplotypes in any region.}

%%%%%%%%%%%%%%%%%%
%PAIRED READS
%%%%%%%%%%%%%%%%%%
\subsection{Handling Paired Reads}
The fundamental unit of evidence in paired-end sequencing is not the read but the fragment of DNA from which two reads were sequenced. If we apply the somatic likelihoods model below directly to the read-vs-haplotype likelihoods matrix from Pair-HMM our model would in effect allow a read and its mate to come from two different biological haplotypes. To prevent this absurdity, we transform the read-vs-haplotype likelihoods matrix into a fragment-vs-haplotype likelihood matrix. The simplest approach, which \code{Mutect2} uses, is to multiply the likelihoods of all paired reads in a fragment (in log space, add) to obtain the fragment's likelihood. That is, $P({\rm fragment} | {\rm haplotype}) = P({\rm read~1} | {\rm haplotype}) P({\rm read~2} | {\rm haplotype})$. Multiplying likelihoods is justified as far as sequencing error is concerned, since sequencing errors on paired reads are statistically independent. There are, of course shared covariates such as sequence context that influence errors on both reads. These, however, are the domain of \code{FilterMutectCalls}'s downstream filtering. The somatic likelihoods model of \code{Mutect2} is \textit{only} concerned with distinguishing sequencing errors (in the narrow sense of an error that occurred on the sequencer itself) from possible somatic variants. \code{FilterMutectCalls} is responsible for distinguishing somatic variants from all other errors, such as those that occur during sample preparation and alignment.

Simply multiplying likelihoods in this way is not justified when paired reads overlap because while sequencing errors are independent, PCR errors are not. That is, a substitution occurring on both reads may be due to independent sequencing errors or to the amplification of a single PCR error. In order to force the possibility of PCR error into a model of independent reads, we enforce that the effective, multiplied, likelihood must not exceed a bound given by the probability of PCR error. To achieve this, at bases where read pairs overlap, \code{Mutect2} caps the total (that is, the sum, because qualities are measured in a logarithmic phred scale) base and indel qualities to user-defined PCR SNV and indel qualities. This can be adjusted for PCR-free protocols. For example, if the PCR quality is 40 and two reads overlap with base qualities of 25 and 30, each base quality is replaced with $40/2 = 20$. If the base qualities were both, say, 15, no adjustment would be needed because the probability of sequencing error dominates the probability of PCR error. This caping of overlapping base qualities occurs before Pair-HMM, while merging reads into fragments occurs after Pair-HMM.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This caping -> This capping

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging comma in "There are, of course..."


%%%%%%%%%%%%%%%%%%
%SOMATIC LIKELIHOODS MODEL
%%%%%%%%%%%%%%%%%%
\subsection{Somatic Likelihoods Model}\label{introduction}
We have a set of potential somatic alleles and read-allele likelihoods $\ell_{ra} \equiv P({\rm read~}r|{\rm allele~}a)$. We don't know which alleles are real somatic alleles and so we must compute, for each subset $\mathbb{A}$ of alleles, the likelihood that the reads come from $\mathbb{A}$. A simple model for this likelihood is as follows: each read $r$ is associated with a latent indicator vector $\vz_r$ with one-hot encoding $z_{ra} = 1$ iff read $r$ came from allele $a \in \mathbb{A}$. The conditional probabilities of reads given alleles is $\ell_{ra}$.
There is a latent vector $\vf$ of allele fractions such that $f_a$ is the allele fraction of allele $a$, that is, the prior probability that any given read comes from allele $a$. Giving $\vf$ a Dirichlet prior, we have a full-model likelihood
We have a set of potential somatic alleles and fragment-allele likelihoods $\ell_{ra} \equiv P({\rm fragment~}r|{\rm allele~}a)$. We don't know which alleles are real somatic alleles and so we must compute, for each subset $\mathbb{A}$ of alleles, the likelihood that the fragments come from $\mathbb{A}$. A simple model for this likelihood is as follows: each fragment $r$ is associated with a latent indicator vector $\vz_r$ with one-hot encoding $z_{ra} = 1$ iff fragment $r$ came from allele $a \in \mathbb{A}$. The conditional probabilities of fragments given alleles is $\ell_{ra}$.
There is a latent vector $\vf$ of allele fractions such that $f_a$ is the allele fraction of allele $a$, that is, the prior probability that any given fragment comes from allele $a$. Giving $\vf$ a Dirichlet prior, we have a full-model likelihood
\begin{equation}
P(\mathbb{R}, \vz, \vf | \mathbb{A}) = P(\vf) P(\vz | \vf) P( \mathbb{R} | \vz, \mathbb{A}) = {\rm Dir}(\vf | \valpha) \prod_a \prod_r \left( f_a \ell_{ra}\right)^{z_{ra}}.
\label{full_likelihood}
\end{equation}
We want to marginalize the latent variables to obtain the evidence $P(\mathbb{R} | \mathbb{A})$, which we make tractable via a mean-field approximation $P(\mathbb{R}, \vz, \vf | \mathbb{A}) \approx q(\vz) q(\vf)$, which is exact in two limits. First, if there are many reads, each allele is associated with many reads and therefore the Law of Large Numbers causes $\vf$ and $\vz$ to become uncorrelated. Second, if the allele assignments of reads are obvious $\vz_r$ is effectively determinate, hence uncorrelated with $\vf$. In the variational Bayesian mean-field formalism we have
We want to marginalize the latent variables to obtain the evidence $P(\mathbb{R} | \mathbb{A})$, which we make tractable via a mean-field approximation $P(\mathbb{R}, \vz, \vf | \mathbb{A}) \approx q(\vz) q(\vf)$, which is exact in two limits. First, if there are many fragments, each allele is associated with many fragments and therefore the Law of Large Numbers causes $\vf$ and $\vz$ to become uncorrelated. Second, if the allele assignments of fragments are obvious $\vz_r$ is effectively determinate, hence uncorrelated with $\vf$. In the variational Bayesian mean-field formalism we have
\begin{align}
q(\vf) \propto& \exp E_{q(\vz)} \left[ \ln P(\mathbb{R}, \vz, \vf | \mathbb{A}) \right] \propto {\rm Dir}(\vf | \valpha + \sum_r \bar{\vz}_r) \equiv {\rm Dir}(\vf | \vbeta), \quad \vbeta = \valpha + \sum_r \bar{\vz}_r \label{z_mean_field} \\
q(\vz_r) \propto& \exp E_{q(\vf)} \left[ \ln P(\mathbb{R}, \vz, \vf | \mathbb{A}) \right] \propto \prod_a \left( \tilde{f}_a \ell_{ra}\right)^{z_{ra}}, \label{f_mean_field}
Expand All @@ -170,7 +178,7 @@ \subsection{Somatic Likelihoods Model}\label{introduction}
\ln \tilde{f}_a =& E_{q(\vf)}[\ln f_a] = \psi(\beta_a) - \psi(\sum_{a^\prime} \beta_{a^\prime})
\label{f-tilde}
\end{align}
are easily obtained from the categorical distribution $q(\vz)$ and the Dirichlet distribution $q(\vf)$\footnote{Note that we didn't \textit{impose} this in any way. It simply falls out of the mean field equations.}. We initialize $\bar{z}_{ra} = 1$ if $a$ is the most likely allele for read $r$, 0 otherwise and iterate Equations \ref{z_mean_field} and \ref{f_mean_field} until convergence. Having obtained the mean fields of $q(\vz)$ and $q(\vf)$, we use the variational approximation (Bishop's Eq 10.3) to the model evidence:
are easily obtained from the categorical distribution $q(\vz)$ and the Dirichlet distribution $q(\vf)$\footnote{Note that we didn't \textit{impose} this in any way. It simply falls out of the mean field equations.}. We initialize $\bar{z}_{ra} = 1$ if $a$ is the most likely allele for fragment $r$, 0 otherwise and iterate Equations \ref{z_mean_field} and \ref{f_mean_field} until convergence. Having obtained the mean fields of $q(\vz)$ and $q(\vf)$, we use the variational approximation (Bishop's Eq 10.3) to the model evidence:
\begin{align}
\ln P(\mathbb{R} | \mathbb{A}) \approx E_q \left[ \ln P(\mathbb{R}, \vz, \vf | \mathbb{A}) \right] - E_q \left[ \ln q(\vz) \right] - E_q \left[ \ln q(\vf) \right]. \label{lagrangian}
\end{align}
Expand All @@ -184,7 +192,7 @@ \subsection{Somatic Likelihoods Model}\label{introduction}
\ln \Gamma(\sum_a \omega_a) - \sum_a \ln \Gamma(\omega_a).
\end{equation}

We now have the model evidence for allele subset $\mathbb{A}$. The \code{TLOD} emitted by \code{Mutect2} for an alt allele is the log evidence ratio of an allele set containing all alleles versus an allele set excluding that allele. That is, it is the log odds that an allele exists. When multiple tumor samples are given, \code{Mutect2} computes a single \code{TLOD} by combining all tumor reads.
We now have the model evidence for allele subset $\mathbb{A}$. The \code{TLOD} emitted by \code{Mutect2} for an alt allele is the log evidence ratio of an allele set containing all alleles versus an allele set excluding that allele. That is, it is the log odds that an allele exists. When multiple tumor samples are given, \code{Mutect2} computes a single \code{TLOD} by combining all tumor fragments.

%%%%%%%
%%%%%%%
Expand Down