From 2cf71e4c645d708d7dd8a81c585522768d9cf279 Mon Sep 17 00:00:00 2001 From: Alexander Titus Date: Thu, 9 May 2019 18:05:03 -0400 Subject: [PATCH 1/7] Update 04.study.md --- content/04.study.md | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/content/04.study.md b/content/04.study.md index 05c5054f..d9511208 100644 --- a/content/04.study.md +++ b/content/04.study.md @@ -47,6 +47,26 @@ Deep learning applied to gene expression data is still in its infancy, but the f Many previously untestable hypotheses can now be interrogated as deep learning enables analysis of increasing amounts of data generated by new technologies. For example, the effects of cellular heterogeneity on basic biology and disease etiology can now be explored by single-cell RNA-seq and high-throughput fluorescence-based imaging, techniques we discuss below that will benefit immensely from deep learning approaches. +### DNA Methylation (to remove after merge) + +#### Latent Space Construction + +In addition to the prediction of methylation sites and levels, and classification and regression of these profiles, new work has extracted biologically significant features from methylation data using unsupervised deep learning. +Particularly, unsupervised methods focus on construction of a latent space that semantically encodes biologically important features from methylation profiles. +One method used several stacked binary restricted Boltzmann machines (forming a deep neural network) to learn a low-dimensional subspace representation of the methylation profiles of 5000 CpG sites with highest variance across 136 women breast tissue samples, 113 breast cancer samples and 23 non-cancerous samples, and samples in the latent space were clustered (via self-organizing maps) to show that the latent space could differentiate breast cancer samples from non-neoplastic samples. +Furthermore, the latent space was visualized using t-SNE (t-distributed stochastic neighbor embedding) [@arxiv:1808.01359]. +The other method for latent space construction was inspired by Way and Greene’s work on Variational Auto-Encoders in the ovarian and pancancer setting for RNASeq data, which in addition to learning a latent semantic space, used vector subtraction to identify biologically relevant features that stratify patient sex, metastatic activation, cancer types and tumor subtypes, amongst others. + +Titus et. al. adopted the VAE developed by Way et. al. and modified it to perform dimensionality reduction on 300,000 PAM50-assigned CpG features to 100 latent features in 862 samples, performing t-SNE visualizations, clustering, and classifying tumor subtypes from a Breast Cancer related TCGA dataset. +Titus extended this work to find the 100-dimensional latent space of 100k CpG sites across around 1200 samples, and selected latent space dimensions that were the most highly associated with the differentiation between estrogen-response (ER) positive and negative tumor samples in breast cancer patients to see if the latent space could serve as a useful tool for indicating possible responses to endocrine therapy. +After backtracking these VAE dimensions to their highly contributing CpG loci, the pathways that were found at most of these loci were significantly associated with ER-status, the VAE dimensions were associated with CpGs of sparse non-coding regions, and a nonredundant set of CpGs were identified as being related to ER-positive and ER-negative status, suggesting VAE’s importance in epigenetic work for diagnosis, risk, prognosis, and treatment response ascertainment [@doi:10.5220/0006636401400145; @doi:10.1101/433763]. +Another study explored the latent features of lung cancer methylation profiles that were extracted using variational autoencoders, and were able to classify cancer subtypes [@doi:10.1109/BIBM.2018.8621365]. + +Important applications in the future include predicting methylation and pathological states based on methylation profiles uncovered from datasets with more noise, such as solid tissue samples over blood samples. +A more complete understanding of the biological processes underlying cell types, transitions in cell dynamics, and subject phenotypes can be developed using unsupervised deep learning approaches such as variational autoencoders that leverage more of the measured data. +In addition, latent representations may assist with biological hypothesis generation and have the ability to stratify patients by predicted risk. +It should be noted here that while the performance of latent representations can outperform traditional embeddings, when selecting these models as benchmark methods, it is also important to be cognizant of the sensitivity of these generative models to hyperparameter tuning, which can oftentimes cause these models to underperform [@doi:10.1101/385534]. + ### Splicing Pre-mRNA transcripts can be spliced into different isoforms by retaining or skipping subsets of exons or including parts of introns, creating enormous spatiotemporal flexibility to generate multiple distinct proteins from a single gene. From 14acad58f6a4f89c9ff2dd6aeb482bad9e885f5e Mon Sep 17 00:00:00 2001 From: Alexander Titus Date: Mon, 22 Jul 2019 05:34:11 -0400 Subject: [PATCH 2/7] Apply suggestions from code review Updates based on pull-request peer-review. Co-Authored-By: Casey Greene --- content/04.study.md | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/content/04.study.md b/content/04.study.md index d9511208..bf408ba5 100644 --- a/content/04.study.md +++ b/content/04.study.md @@ -51,21 +51,23 @@ For example, the effects of cellular heterogeneity on basic biology and disease #### Latent Space Construction -In addition to the prediction of methylation sites and levels, and classification and regression of these profiles, new work has extracted biologically significant features from methylation data using unsupervised deep learning. -Particularly, unsupervised methods focus on construction of a latent space that semantically encodes biologically important features from methylation profiles. +Unsupervised discovery of biologically-significant features is another major area of interest for researchers using DNA methylation data. +A consistent theme of these methods is that they construct a low-dimensional space that semantically encodes biologically important features from methylation profiles. +As with other applications, these low-dimensional representations are thought to capture a set of important, unmeasured sources of biological variability in the data, and that projection into these spaces results in biologically-similar examples being close together. +For this reason, they are often termed latent spaces. One method used several stacked binary restricted Boltzmann machines (forming a deep neural network) to learn a low-dimensional subspace representation of the methylation profiles of 5000 CpG sites with highest variance across 136 women breast tissue samples, 113 breast cancer samples and 23 non-cancerous samples, and samples in the latent space were clustered (via self-organizing maps) to show that the latent space could differentiate breast cancer samples from non-neoplastic samples. Furthermore, the latent space was visualized using t-SNE (t-distributed stochastic neighbor embedding) [@arxiv:1808.01359]. -The other method for latent space construction was inspired by Way and Greene’s work on Variational Auto-Encoders in the ovarian and pancancer setting for RNASeq data, which in addition to learning a latent semantic space, used vector subtraction to identify biologically relevant features that stratify patient sex, metastatic activation, cancer types and tumor subtypes, amongst others. - -Titus et. al. adopted the VAE developed by Way et. al. and modified it to perform dimensionality reduction on 300,000 PAM50-assigned CpG features to 100 latent features in 862 samples, performing t-SNE visualizations, clustering, and classifying tumor subtypes from a Breast Cancer related TCGA dataset. -Titus extended this work to find the 100-dimensional latent space of 100k CpG sites across around 1200 samples, and selected latent space dimensions that were the most highly associated with the differentiation between estrogen-response (ER) positive and negative tumor samples in breast cancer patients to see if the latent space could serve as a useful tool for indicating possible responses to endocrine therapy. -After backtracking these VAE dimensions to their highly contributing CpG loci, the pathways that were found at most of these loci were significantly associated with ER-status, the VAE dimensions were associated with CpGs of sparse non-coding regions, and a nonredundant set of CpGs were identified as being related to ER-positive and ER-negative status, suggesting VAE’s importance in epigenetic work for diagnosis, risk, prognosis, and treatment response ascertainment [@doi:10.5220/0006636401400145; @doi:10.1101/433763]. +Titus et. al. [@doi:10.5220/0006636401400145] adapted a VAE strategy developed by Way et. al. [@doi:10.1142/9789813235533_0008] to methylation data. +The VAE was modified to perform dimensionality reduction on 300,000 PAM50-assigned CpG features to 100 latent features in 862 samples. +The authors performed t-SNE visualization, clustering, and classified tumor subtypes from a Breast Cancer dataset from TCGA. +In an subsequent extension of this work [@doi:10.1101/433763v5], the authors constructed a 100-dimensional latent space of 100k CpG sites across around 1200 samples, and selected latent space dimensions that were the most highly associated with the differentiation between estrogen-response (ER) positive and negative tumor samples in breast cancer patients to determine the extent to which the latent space could predict responses to endocrine therapy. +Certain latent space dimensions differentiated tumors based on their ER status and provided biologically-plausible hypotheses, which suggests that VAE-derived models may have a place in summarizing DNA methylation profiles into composite features that can aid in predicting treatment response. Another study explored the latent features of lung cancer methylation profiles that were extracted using variational autoencoders, and were able to classify cancer subtypes [@doi:10.1109/BIBM.2018.8621365]. Important applications in the future include predicting methylation and pathological states based on methylation profiles uncovered from datasets with more noise, such as solid tissue samples over blood samples. A more complete understanding of the biological processes underlying cell types, transitions in cell dynamics, and subject phenotypes can be developed using unsupervised deep learning approaches such as variational autoencoders that leverage more of the measured data. In addition, latent representations may assist with biological hypothesis generation and have the ability to stratify patients by predicted risk. -It should be noted here that while the performance of latent representations can outperform traditional embeddings, when selecting these models as benchmark methods, it is also important to be cognizant of the sensitivity of these generative models to hyperparameter tuning, which can oftentimes cause these models to underperform [@doi:10.1101/385534]. +While neural-network embeddings can outperform traditional embeddings, it is important to be aware that many of these methods can be highly sensitive to hyperparameter tuning and an evaluation of the impact of hyperparameter tuning should be included [@doi:10.1101/385534]. ### Splicing From 8d90e52cffc190f39788d4a1443b556463af0e22 Mon Sep 17 00:00:00 2001 From: Alexander Titus Date: Mon, 22 Jul 2019 05:42:17 -0400 Subject: [PATCH 3/7] Update 04.study.md --- content/04.study.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/content/04.study.md b/content/04.study.md index bf408ba5..f10bc6fa 100644 --- a/content/04.study.md +++ b/content/04.study.md @@ -62,7 +62,8 @@ The VAE was modified to perform dimensionality reduction on 300,000 PAM50-assign The authors performed t-SNE visualization, clustering, and classified tumor subtypes from a Breast Cancer dataset from TCGA. In an subsequent extension of this work [@doi:10.1101/433763v5], the authors constructed a 100-dimensional latent space of 100k CpG sites across around 1200 samples, and selected latent space dimensions that were the most highly associated with the differentiation between estrogen-response (ER) positive and negative tumor samples in breast cancer patients to determine the extent to which the latent space could predict responses to endocrine therapy. Certain latent space dimensions differentiated tumors based on their ER status and provided biologically-plausible hypotheses, which suggests that VAE-derived models may have a place in summarizing DNA methylation profiles into composite features that can aid in predicting treatment response. -Another study explored the latent features of lung cancer methylation profiles that were extracted using variational autoencoders, and were able to classify cancer subtypes [@doi:10.1109/BIBM.2018.8621365]. +Another study explored the latent features of lung cancer methylation profiles that were extracted using variational autoencoders. After constructing a latent space representations of TCGA lung cancer samples, the authors used a logistic regression classifier on the latent dimentions to accturately classify cancer subtypes [@doi:10.1109/BIBM.2018.8621365]. +These studies, along with the growing body of work using VAEs and other latent representations of genomic and epigenomic data demonstrate a suite of tools to explore the unmeasured aspects of biology. These techniques are allowing for the discovery and analysis of latent biology that was previously under achieved using previously developed models. The power and robustness of unsupervised deep learning models comes from their ability to learn high-dimensional non-linear relationships among data. Important applications in the future include predicting methylation and pathological states based on methylation profiles uncovered from datasets with more noise, such as solid tissue samples over blood samples. A more complete understanding of the biological processes underlying cell types, transitions in cell dynamics, and subject phenotypes can be developed using unsupervised deep learning approaches such as variational autoencoders that leverage more of the measured data. From 1ecc7591671face216006b5396b153639ff3aac1 Mon Sep 17 00:00:00 2001 From: Alexander Titus Date: Mon, 22 Jul 2019 05:49:51 -0400 Subject: [PATCH 4/7] Apply suggestions from code review Co-Authored-By: Casey Greene --- content/04.study.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/content/04.study.md b/content/04.study.md index f10bc6fa..d071fd57 100644 --- a/content/04.study.md +++ b/content/04.study.md @@ -62,8 +62,11 @@ The VAE was modified to perform dimensionality reduction on 300,000 PAM50-assign The authors performed t-SNE visualization, clustering, and classified tumor subtypes from a Breast Cancer dataset from TCGA. In an subsequent extension of this work [@doi:10.1101/433763v5], the authors constructed a 100-dimensional latent space of 100k CpG sites across around 1200 samples, and selected latent space dimensions that were the most highly associated with the differentiation between estrogen-response (ER) positive and negative tumor samples in breast cancer patients to determine the extent to which the latent space could predict responses to endocrine therapy. Certain latent space dimensions differentiated tumors based on their ER status and provided biologically-plausible hypotheses, which suggests that VAE-derived models may have a place in summarizing DNA methylation profiles into composite features that can aid in predicting treatment response. -Another study explored the latent features of lung cancer methylation profiles that were extracted using variational autoencoders. After constructing a latent space representations of TCGA lung cancer samples, the authors used a logistic regression classifier on the latent dimentions to accturately classify cancer subtypes [@doi:10.1109/BIBM.2018.8621365]. -These studies, along with the growing body of work using VAEs and other latent representations of genomic and epigenomic data demonstrate a suite of tools to explore the unmeasured aspects of biology. These techniques are allowing for the discovery and analysis of latent biology that was previously under achieved using previously developed models. The power and robustness of unsupervised deep learning models comes from their ability to learn high-dimensional non-linear relationships among data. +Another study explored the latent features of lung cancer methylation profiles that were extracted using variational autoencoders. +After constructing a latent space representations of TCGA lung cancer samples, the authors used a logistic regression classifier on the latent dimentions to accturately classify cancer subtypes [@doi:10.1109/BIBM.2018.8621365]. +These studies, along with the growing body of work using VAEs and other latent representations of genomic and epigenomic data demonstrate a suite of tools to explore the unmeasured aspects of biology. +These techniques are allowing for the discovery and analysis of latent biology that was previously under achieved using previously developed models. +The power and robustness of unsupervised deep learning models comes from their ability to learn high-dimensional non-linear relationships among data. Important applications in the future include predicting methylation and pathological states based on methylation profiles uncovered from datasets with more noise, such as solid tissue samples over blood samples. A more complete understanding of the biological processes underlying cell types, transitions in cell dynamics, and subject phenotypes can be developed using unsupervised deep learning approaches such as variational autoencoders that leverage more of the measured data. From 708487bb282710c1f447c429486b4182678bde09 Mon Sep 17 00:00:00 2001 From: Alexander Titus Date: Mon, 22 Jul 2019 05:56:42 -0400 Subject: [PATCH 5/7] Update content/04.study.md Additional clarity Co-Authored-By: Casey Greene --- content/04.study.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/04.study.md b/content/04.study.md index d071fd57..515605ef 100644 --- a/content/04.study.md +++ b/content/04.study.md @@ -65,7 +65,7 @@ Certain latent space dimensions differentiated tumors based on their ER status a Another study explored the latent features of lung cancer methylation profiles that were extracted using variational autoencoders. After constructing a latent space representations of TCGA lung cancer samples, the authors used a logistic regression classifier on the latent dimentions to accturately classify cancer subtypes [@doi:10.1109/BIBM.2018.8621365]. These studies, along with the growing body of work using VAEs and other latent representations of genomic and epigenomic data demonstrate a suite of tools to explore the unmeasured aspects of biology. -These techniques are allowing for the discovery and analysis of latent biology that was previously under achieved using previously developed models. +Techniques that produce these representations provide the opportunity to discover important biological features that were previously missed. The power and robustness of unsupervised deep learning models comes from their ability to learn high-dimensional non-linear relationships among data. Important applications in the future include predicting methylation and pathological states based on methylation profiles uncovered from datasets with more noise, such as solid tissue samples over blood samples. From 4ba168d48ce8cd52370c9fa5e31931c646885ee5 Mon Sep 17 00:00:00 2001 From: Alexander Titus Date: Mon, 22 Jul 2019 05:57:51 -0400 Subject: [PATCH 6/7] Apply suggestions from code review Updates based on peer-review Co-Authored-By: Casey Greene --- content/04.study.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/04.study.md b/content/04.study.md index 515605ef..5deca4c6 100644 --- a/content/04.study.md +++ b/content/04.study.md @@ -66,10 +66,10 @@ Another study explored the latent features of lung cancer methylation profiles t After constructing a latent space representations of TCGA lung cancer samples, the authors used a logistic regression classifier on the latent dimentions to accturately classify cancer subtypes [@doi:10.1109/BIBM.2018.8621365]. These studies, along with the growing body of work using VAEs and other latent representations of genomic and epigenomic data demonstrate a suite of tools to explore the unmeasured aspects of biology. Techniques that produce these representations provide the opportunity to discover important biological features that were previously missed. -The power and robustness of unsupervised deep learning models comes from their ability to learn high-dimensional non-linear relationships among data. +The power of unsupervised deep learning models for this task comes from their ability to learn high-dimensional non-linear relationships among data. Important applications in the future include predicting methylation and pathological states based on methylation profiles uncovered from datasets with more noise, such as solid tissue samples over blood samples. -A more complete understanding of the biological processes underlying cell types, transitions in cell dynamics, and subject phenotypes can be developed using unsupervised deep learning approaches such as variational autoencoders that leverage more of the measured data. +Unsupervised deep learning approaches such as variational autoencoders, which leverage measured points to produce a generative, low-dimensional representation, may provide a more complete understanding of the biological processes underlying cell types, transitions in cell dynamics, and subject phenotypes. In addition, latent representations may assist with biological hypothesis generation and have the ability to stratify patients by predicted risk. While neural-network embeddings can outperform traditional embeddings, it is important to be aware that many of these methods can be highly sensitive to hyperparameter tuning and an evaluation of the impact of hyperparameter tuning should be included [@doi:10.1101/385534]. From a91eb5c9dff743abed1e68a2f0d0162df5627012 Mon Sep 17 00:00:00 2001 From: Alexander Titus Date: Mon, 22 Jul 2019 19:18:57 -0400 Subject: [PATCH 7/7] Update content/04.study.md resolve citation error Co-Authored-By: Casey Greene --- content/04.study.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/04.study.md b/content/04.study.md index 74e0ba2e..785b25e1 100644 --- a/content/04.study.md +++ b/content/04.study.md @@ -77,7 +77,7 @@ Furthermore, the latent space was visualized using t-SNE (t-distributed stochast Titus et. al. [@doi:10.5220/0006636401400145] adapted a VAE strategy developed by Way et. al. [@doi:10.1142/9789813235533_0008] to methylation data. The VAE was modified to perform dimensionality reduction on 300,000 PAM50-assigned CpG features to 100 latent features in 862 samples. The authors performed t-SNE visualization, clustering, and classified tumor subtypes from a Breast Cancer dataset from TCGA. -In an subsequent extension of this work [@doi:10.1101/433763v5], the authors constructed a 100-dimensional latent space of 100k CpG sites across around 1200 samples, and selected latent space dimensions that were the most highly associated with the differentiation between estrogen-response (ER) positive and negative tumor samples in breast cancer patients to determine the extent to which the latent space could predict responses to endocrine therapy. +In an subsequent extension of this work [@doi:10.1101/433763], the authors constructed a 100-dimensional latent space of 100k CpG sites across around 1200 samples, and selected latent space dimensions that were the most highly associated with the differentiation between estrogen-response (ER) positive and negative tumor samples in breast cancer patients to determine the extent to which the latent space could predict responses to endocrine therapy. Certain latent space dimensions differentiated tumors based on their ER status and provided biologically-plausible hypotheses, which suggests that VAE-derived models may have a place in summarizing DNA methylation profiles into composite features that can aid in predicting treatment response. Another study explored the latent features of lung cancer methylation profiles that were extracted using variational autoencoders. After constructing a latent space representations of TCGA lung cancer samples, the authors used a logistic regression classifier on the latent dimentions to accturately classify cancer subtypes [@doi:10.1109/BIBM.2018.8621365].