Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Third section of methylation #955

Merged
merged 8 commits into from
Jul 23, 2019
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions content/04.study.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,29 @@ Deep learning applied to gene expression data is still in its infancy, but the f
Many previously untestable hypotheses can now be interrogated as deep learning enables analysis of increasing amounts of data generated by new technologies.
For example, the effects of cellular heterogeneity on basic biology and disease etiology can now be explored by single-cell RNA-seq and high-throughput fluorescence-based imaging, techniques we discuss below that will benefit immensely from deep learning approaches.

### DNA Methylation (to remove after merge)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure this will cause merge conflicts, so noting here that we should merge this last.


#### Latent Space Construction

Unsupervised discovery of biologically-significant features is another major area of interest for researchers using DNA methylation data.
A consistent theme of these methods is that they construct a low-dimensional space that semantically encodes biologically important features from methylation profiles.
As with other applications, these low-dimensional representations are thought to capture a set of important, unmeasured sources of biological variability in the data, and that projection into these spaces results in biologically-similar examples being close together.
For this reason, they are often termed latent spaces.
One method used several stacked binary restricted Boltzmann machines (forming a deep neural network) to learn a low-dimensional subspace representation of the methylation profiles of 5000 CpG sites with highest variance across 136 women breast tissue samples, 113 breast cancer samples and 23 non-cancerous samples, and samples in the latent space were clustered (via self-organizing maps) to show that the latent space could differentiate breast cancer samples from non-neoplastic samples.
Furthermore, the latent space was visualized using t-SNE (t-distributed stochastic neighbor embedding) [@arxiv:1808.01359].
Titus et. al. [@doi:10.5220/0006636401400145] adapted a VAE strategy developed by Way et. al. [@doi:10.1142/9789813235533_0008] to methylation data.
The VAE was modified to perform dimensionality reduction on 300,000 PAM50-assigned CpG features to 100 latent features in 862 samples.
The authors performed t-SNE visualization, clustering, and classified tumor subtypes from a Breast Cancer dataset from TCGA.
In an subsequent extension of this work [@doi:10.1101/433763v5], the authors constructed a 100-dimensional latent space of 100k CpG sites across around 1200 samples, and selected latent space dimensions that were the most highly associated with the differentiation between estrogen-response (ER) positive and negative tumor samples in breast cancer patients to determine the extent to which the latent space could predict responses to endocrine therapy.
Certain latent space dimensions differentiated tumors based on their ER status and provided biologically-plausible hypotheses, which suggests that VAE-derived models may have a place in summarizing DNA methylation profiles into composite features that can aid in predicting treatment response.
Another study explored the latent features of lung cancer methylation profiles that were extracted using variational autoencoders. After constructing a latent space representations of TCGA lung cancer samples, the authors used a logistic regression classifier on the latent dimentions to accturately classify cancer subtypes [@doi:10.1109/BIBM.2018.8621365].
These studies, along with the growing body of work using VAEs and other latent representations of genomic and epigenomic data demonstrate a suite of tools to explore the unmeasured aspects of biology. These techniques are allowing for the discovery and analysis of latent biology that was previously under achieved using previously developed models. The power and robustness of unsupervised deep learning models comes from their ability to learn high-dimensional non-linear relationships among data.

Important applications in the future include predicting methylation and pathological states based on methylation profiles uncovered from datasets with more noise, such as solid tissue samples over blood samples.
A more complete understanding of the biological processes underlying cell types, transitions in cell dynamics, and subject phenotypes can be developed using unsupervised deep learning approaches such as variational autoencoders that leverage more of the measured data.
In addition, latent representations may assist with biological hypothesis generation and have the ability to stratify patients by predicted risk.
While neural-network embeddings can outperform traditional embeddings, it is important to be aware that many of these methods can be highly sensitive to hyperparameter tuning and an evaluation of the impact of hyperparameter tuning should be included [@doi:10.1101/385534].

### Splicing

Pre-mRNA transcripts can be spliced into different isoforms by retaining or skipping subsets of exons or including parts of introns, creating enormous spatiotemporal flexibility to generate multiple distinct proteins from a single gene.
Expand Down