broadinstitute · davidbenjamin · Nov 18, 2019 · Nov 15, 2019 · Nov 18, 2019 · takutosato
diff --git a/docs/mutect/mutect.pdf b/docs/mutect/mutect.pdf
diff --git a/docs/mutect/mutect.tex b/docs/mutect/mutect.tex
@@ -149,17 +149,25 @@ \subsection{Finding Active Regions}
 \subsection{Local Assembly, Pair-HMM, and Realignment}
 These topics, which are common to \code{Mutect2} and \code{HaplotypeCaller}, are discussed in docs/local{\_}assembly.pdf, docs/pair{\_}hmm.pdf, and docs/variants{\_}from{\_}haplotypes.pdf in the gatk git repository.  As a black box, whenever the evidence in the previous section suffices to trigger local assembly and realignment, we end up at each candidate variant site with one read-vs-allele log likelihood matrix $\ell$ for each sample, where $ \ell_{ra}$ is the log probability of sequencing read $r$ given its base qualities and given that read $r$ is derived from a molecule exhibiting allele $a$.  \footnote{Technically, pair-HMM produces a read-vs-\textit{haplotype} likelihood matrix, which is then ``marginalized" to produce a set of read-vs-allele likelihood matrices.  In the future, \code{Mutect2} may operate directly on this read-vs-haplotype matrix in order to exploit the biological fact that there are only a few haplotypes in any region.}
 
+%%%%%%%%%%%%%%%%%%
+%PAIRED READS
+%%%%%%%%%%%%%%%%%%
+\subsection{Handling Paired Reads}
+The fundamental unit of evidence in paired-end sequencing is not the read but the fragment of DNA from which two reads were sequenced.  If we apply the somatic likelihoods model below directly to the read-vs-haplotype likelihoods matrix from Pair-HMM our model would in effect allow a read and its mate to come from two different biological haplotypes.  To prevent this absurdity, we transform the read-vs-haplotype likelihoods matrix into a fragment-vs-haplotype likelihood matrix.  The simplest approach, which \code{Mutect2} uses, is to multiply the likelihoods of all paired reads in a fragment (in log space, add) to obtain the fragment's likelihood.  That is, $P({\rm fragment} | {\rm haplotype}) = P({\rm read~1} | {\rm haplotype}) P({\rm read~2} | {\rm haplotype})$.  Multiplying likelihoods is justified as far as sequencing error is concerned, since sequencing errors on paired reads are statistically independent.  There are, of course shared covariates such as sequence context that influence errors on both reads.  These, however, are the domain of \code{FilterMutectCalls}'s downstream filtering.  The somatic likelihoods model of \code{Mutect2} is \textit{only} concerned with distinguishing sequencing errors (in the narrow sense of an error that occurred on the sequencer itself) from possible somatic variants.  \code{FilterMutectCalls} is responsible for distinguishing somatic variants from all other errors, such as those that occur during sample preparation and alignment.
+
+Simply multiplying likelihoods in this way is not justified when paired reads overlap because while sequencing errors are independent, PCR errors are not.  That is, a substitution occurring on both reads may be due to independent sequencing errors or to the amplification of a single PCR error.  In order to force the possibility of PCR error into a model of independent reads, we enforce that the effective, multiplied, likelihood must not exceed a bound given by the probability of PCR error.  To achieve this, at bases where read pairs overlap, \code{Mutect2} caps the total (that is, the sum, because qualities are measured in a logarithmic phred scale) base and indel qualities to user-defined PCR SNV and indel qualities.  This can be adjusted for PCR-free protocols.  For example, if the PCR quality is 40 and two reads overlap with base qualities of 25 and 30, each base quality is replaced with $40/2 = 20$.  If the base qualities were both, say, 15, no adjustment would be needed because the probability of sequencing error dominates the probability of PCR error.  This caping of overlapping base qualities occurs before Pair-HMM, while merging reads into fragments occurs after Pair-HMM.
+
 %%%%%%%%%%%%%%%%%%
 %SOMATIC LIKELIHOODS MODEL
 %%%%%%%%%%%%%%%%%%
 \subsection{Somatic Likelihoods Model}\label{introduction}
-We have a set of potential somatic alleles and read-allele likelihoods $\ell_{ra} \equiv P({\rm read~}r|{\rm allele~}a)$.  We don't know which alleles are real somatic alleles and so we must compute, for each subset $\mathbb{A}$ of alleles, the likelihood that the reads come from $\mathbb{A}$.  A simple model for this likelihood is as follows: each read $r$ is associated with a latent indicator vector $\vz_r$ with one-hot encoding $z_{ra} = 1$ iff read $r$ came from allele $a \in \mathbb{A}$.  The conditional probabilities of reads given alleles is $\ell_{ra}$.
-There is a latent vector $\vf$ of allele fractions such that $f_a$ is the allele fraction of allele $a$, that is, the prior probability that any given read comes from allele $a$.  Giving $\vf$ a Dirichlet prior, we have a full-model likelihood
+We have a set of potential somatic alleles and fragment-allele likelihoods $\ell_{ra} \equiv P({\rm fragment~}r|{\rm allele~}a)$.  We don't know which alleles are real somatic alleles and so we must compute, for each subset $\mathbb{A}$ of alleles, the likelihood that the fragments come from $\mathbb{A}$.  A simple model for this likelihood is as follows: each fragment $r$ is associated with a latent indicator vector $\vz_r$ with one-hot encoding $z_{ra} = 1$ iff fragment $r$ came from allele $a \in \mathbb{A}$.  The conditional probabilities of fragments given alleles is $\ell_{ra}$.
+There is a latent vector $\vf$ of allele fractions such that $f_a$ is the allele fraction of allele $a$, that is, the prior probability that any given fragment comes from allele $a$.  Giving $\vf$ a Dirichlet prior, we have a full-model likelihood
 \begin{equation}
 P(\mathbb{R}, \vz, \vf | \mathbb{A}) = P(\vf) P(\vz | \vf)  P( \mathbb{R} | \vz, \mathbb{A}) = {\rm Dir}(\vf | \valpha) \prod_a  \prod_r \left( f_a \ell_{ra}\right)^{z_{ra}}.
 \label{full_likelihood}
 \end{equation}
-We want to marginalize the latent variables to obtain the evidence $P(\mathbb{R} | \mathbb{A})$, which we make tractable via a mean-field approximation $P(\mathbb{R}, \vz, \vf | \mathbb{A}) \approx q(\vz) q(\vf)$, which is exact in two limits.  First, if there are many reads, each allele is associated with many reads and therefore the Law of Large Numbers causes $\vf$ and $\vz$ to become uncorrelated.  Second, if the allele assignments of reads are obvious $\vz_r$ is effectively determinate, hence uncorrelated with $\vf$.  In the variational Bayesian mean-field formalism we have
+We want to marginalize the latent variables to obtain the evidence $P(\mathbb{R} | \mathbb{A})$, which we make tractable via a mean-field approximation $P(\mathbb{R}, \vz, \vf | \mathbb{A}) \approx q(\vz) q(\vf)$, which is exact in two limits.  First, if there are many fragments, each allele is associated with many fragments and therefore the Law of Large Numbers causes $\vf$ and $\vz$ to become uncorrelated.  Second, if the allele assignments of fragments are obvious $\vz_r$ is effectively determinate, hence uncorrelated with $\vf$.  In the variational Bayesian mean-field formalism we have
 \begin{align}
 q(\vf) \propto& \exp E_{q(\vz)} \left[ \ln P(\mathbb{R}, \vz, \vf | \mathbb{A}) \right] \propto {\rm Dir}(\vf | \valpha + \sum_r \bar{\vz}_r) \equiv {\rm Dir}(\vf | \vbeta), \quad \vbeta = \valpha + \sum_r \bar{\vz}_r \label{z_mean_field}  \\
 q(\vz_r) \propto& \exp E_{q(\vf)} \left[ \ln P(\mathbb{R}, \vz, \vf | \mathbb{A}) \right]  \propto \prod_a \left( \tilde{f}_a \ell_{ra}\right)^{z_{ra}}, \label{f_mean_field}
@@ -170,7 +178,7 @@ \subsection{Somatic Likelihoods Model}\label{introduction}
 \ln \tilde{f}_a =& E_{q(\vf)}[\ln f_a] = \psi(\beta_a) - \psi(\sum_{a^\prime} \beta_{a^\prime})
 \label{f-tilde}
 \end{align}
-are easily obtained from the categorical distribution $q(\vz)$ and the Dirichlet distribution $q(\vf)$\footnote{Note that we didn't \textit{impose} this in any way.  It simply falls out of the mean field equations.}.  We initialize $\bar{z}_{ra} = 1$ if $a$ is the most likely allele for read $r$, 0 otherwise and iterate Equations \ref{z_mean_field} and \ref{f_mean_field} until convergence. Having obtained the mean fields of $q(\vz)$ and $q(\vf)$, we use the variational approximation (Bishop's Eq 10.3) to the model evidence:
+are easily obtained from the categorical distribution $q(\vz)$ and the Dirichlet distribution $q(\vf)$\footnote{Note that we didn't \textit{impose} this in any way.  It simply falls out of the mean field equations.}.  We initialize $\bar{z}_{ra} = 1$ if $a$ is the most likely allele for fragment $r$, 0 otherwise and iterate Equations \ref{z_mean_field} and \ref{f_mean_field} until convergence. Having obtained the mean fields of $q(\vz)$ and $q(\vf)$, we use the variational approximation (Bishop's Eq 10.3) to the model evidence:
 \begin{align}
 \ln P(\mathbb{R} | \mathbb{A}) \approx  E_q \left[ \ln P(\mathbb{R}, \vz, \vf | \mathbb{A}) \right] - E_q \left[ \ln q(\vz) \right] - E_q \left[ \ln q(\vf) \right]. \label{lagrangian}
 \end{align}
@@ -184,7 +192,7 @@ \subsection{Somatic Likelihoods Model}\label{introduction}
 \ln \Gamma(\sum_a \omega_a) - \sum_a \ln \Gamma(\omega_a).
 \end{equation}
 
-We now have the model evidence for allele subset $\mathbb{A}$. The \code{TLOD} emitted by \code{Mutect2} for an alt allele is the log evidence ratio of an allele set containing all alleles versus an allele set excluding that allele.  That is, it is the log odds that an allele exists.  When multiple tumor samples are given, \code{Mutect2} computes a single \code{TLOD} by combining all tumor reads.
+We now have the model evidence for allele subset $\mathbb{A}$. The \code{TLOD} emitted by \code{Mutect2} for an alt allele is the log evidence ratio of an allele set containing all alleles versus an allele set excluding that allele.  That is, it is the log odds that an allele exists.  When multiple tumor samples are given, \code{Mutect2} computes a single \code{TLOD} by combining all tumor fragments.
 
 %%%%%%%
 %%%%%%%