greenelab · cgreene · Jul 23, 2019 · May 9, 2019 · Jul 22, 2019 · Jul 22, 2019
diff --git a/content/04.study.md b/content/04.study.md
@@ -47,6 +47,29 @@ Deep learning applied to gene expression data is still in its infancy, but the f
 Many previously untestable hypotheses can now be interrogated as deep learning enables analysis of increasing amounts of data generated by new technologies.
 For example, the effects of cellular heterogeneity on basic biology and disease etiology can now be explored by single-cell RNA-seq and high-throughput fluorescence-based imaging, techniques we discuss below that will benefit immensely from deep learning approaches.
 
+### DNA Methylation (to remove after merge)
+
+####  Latent Space Construction
+
+Unsupervised discovery of biologically-significant features is another major area of interest for researchers using DNA methylation data.
+A consistent theme of these methods is that they construct a low-dimensional space that semantically encodes biologically important features from methylation profiles.
+As with other applications, these low-dimensional representations are thought to capture a set of important, unmeasured sources of biological variability in the data, and that projection into these spaces results in biologically-similar examples being close together.
+For this reason, they are often termed latent spaces.
+One method used several stacked binary restricted Boltzmann machines (forming a deep neural network) to learn a low-dimensional subspace representation of the methylation profiles of 5000 CpG sites with highest variance across 136 women breast tissue samples, 113 breast cancer samples and 23 non-cancerous samples, and samples in the latent space were clustered (via self-organizing maps) to show that the latent space could differentiate breast cancer samples from non-neoplastic samples. 
+Furthermore, the latent space was visualized using t-SNE (t-distributed stochastic neighbor embedding) [@arxiv:1808.01359]. 
+Titus et. al. [@doi:10.5220/0006636401400145] adapted a VAE strategy developed by Way et. al. [@doi:10.1142/9789813235533_0008] to methylation data.
+The VAE was modified to perform dimensionality reduction on 300,000 PAM50-assigned CpG features to 100 latent features in 862 samples.
+The authors performed t-SNE visualization, clustering, and classified tumor subtypes from a Breast Cancer dataset from TCGA. 
+In an subsequent extension of this work [@doi:10.1101/433763v5], the authors constructed a 100-dimensional latent space of 100k CpG sites across around 1200 samples, and selected latent space dimensions that were the most highly associated with the differentiation between estrogen-response (ER) positive and negative tumor samples in breast cancer patients to determine the extent to which the latent space could predict responses to endocrine therapy. 
+Certain latent space dimensions differentiated tumors based on their ER status and provided biologically-plausible hypotheses, which suggests that VAE-derived models may have a place in summarizing DNA methylation profiles into composite features that can aid in predicting treatment response. 
+Another study explored the latent features of lung cancer methylation profiles that were extracted using variational autoencoders. After constructing a latent space representations of TCGA lung cancer samples, the authors used a logistic regression classifier on the latent dimentions to accturately classify cancer subtypes [@doi:10.1109/BIBM.2018.8621365]. 
+These studies, along with the growing body of work using VAEs and other latent representations of genomic and epigenomic data demonstrate a suite of tools to explore the unmeasured aspects of biology. These techniques are allowing for the discovery and analysis of latent biology that was previously under achieved using previously developed models. The power and robustness of unsupervised deep learning models comes from their ability to learn high-dimensional non-linear relationships among data. 
+
+Important applications in the future include predicting methylation and pathological states based on methylation profiles uncovered from datasets with more noise, such as solid tissue samples over blood samples. 
+A more complete understanding of the biological processes underlying cell types, transitions in cell dynamics, and subject phenotypes can be developed using unsupervised deep learning approaches such as variational autoencoders that leverage more of the measured data. 
+In addition, latent representations may assist with biological hypothesis generation and have the ability to stratify patients by predicted risk. 
+While neural-network embeddings can outperform traditional embeddings, it is important to be aware that many of these methods can be highly sensitive to hyperparameter tuning and an evaluation of the impact of hyperparameter tuning should be included [@doi:10.1101/385534].
+
 ### Splicing
 
 Pre-mRNA transcripts can be spliced into different isoforms by retaining or skipping subsets of exons or including parts of introns, creating enormous spatiotemporal flexibility to generate multiple distinct proteins from a single gene.