A curated list of "awesome" machine learning resources, datasets, and papers in epigenetics research. This collection aims to bridge the gap between machine learning and epigenetics, providing valuable references for researchers and practitioners.
- Epigenomic language models powered by Cerebras (2021) - BERT model pretrained on human genome and across 127 cell types with DNA sequence and paired epigenetic state inputs.
- MethylNet: Deep learning for DNA methylation analysis (2020) - VAE for analyzing DNA methylation data.
- DeepCpG: Accurate prediction of single-cell DNA methylation states using deep learning (2017) - CNN model for predicting single-cell DNA methylation states.
- Detection of significantly differentially methylated regions in targeted bisulfite sequencing data (2013) - Stats model for identifying differentially methylated region (DMR) from microarray data (i.e. clustering/segment).
- A nonparametric Bayesian approach for clustering bisulfate-based DNA methylation profiles (2012) - Bayesian stats model for clustering/segment microarray data.
- DeepHistone: A deep learning approach to predicting histone modifications (2019) - CNN-based model for histone modification prediction.
- DeepDiff: DEEP-learning for predicting DIFFerential gene expression from histone modifications (2018) - Hybrid (attention + LSTM) deep learning model for gene expression prediction from histone modification.
- Effective gene expression prediction from sequence by integrating long-range interactions (2021) - Transformer-based model for chromatin accessibility prediction.
- DeepTACT: predicting 3D chromatin contacts via bootstrapping deep learning (2019) - A bootstrapping deep learning model to predict chromatin contacts between regulatory elements.
- cisTopic: cis-regulatory topic modeling on single-cell ATAC-seq data (2019) - A probabilistic framework used to simultaneously discover coaccessible enhancers and stable cell states from sparse single-cell epigenomics data.
- SCENIC+: single-cell multiomic inference of enhancers and gene regulatory networks (2021) - Gradient-boost regression model for single-cell multiomic inference of enhancers and gene regulatory networks.
- MOFA+: A probabilistic framework for comprehensive integration of structured single-cell data (2020) - Framework for integrating multiple omics data types.
- A deep multiple instance learning framework improves microsatellite instability detection from tumor next generation sequencing (2025) - MIL for MSI detection.
- Large language model produces high accurate diagnosis of cancer from end-motif profiles of cell-free DNA (2024) - LLM-based approach for cancer diagnosis using cfDNA end-motif profiles.
- MethylGPT: a foundation model for the DNA methylome (2024) - A transformer-decoder-based LM pretrained on methylation microarray data.
- Transformer-based representation learning and multiple-instance learning for cancer diagnosis exclusively from raw sequencing fragments of bisulfite-treated plasma cell-free DNA (2024) - Transformer's encoder + attention-based MIL for CRC and HCC detection.
- Deep generative AI models analyzing circulating orphan non-coding RNAs enable detection of early-stage lung cancer (2024) - VAE-based model for early lung cancer detection using circulating RNAs.
- Transformer-based AI technology improves early ovarian cancer diagnosis using cfDNA methylation markers (2024) - BERT-like model on CpG sites.
- Development of a deep learning model for cancer diagnosis by inspecting cell-free DNA end-motifs (2024) - Transformer's encoder that captures end-motif signatures for HCC.
- Deep learning model integrating cfDNA methylation and fragment size profiles for lung cancer diagnosis (2024) - CNN for lung cancer diagnosis.
- Early detection of hepatocellular carcinoma via no end-repair enzymatic methylation sequencing of cell-free DNA and pre-trained neural network (2023) - BERT-like model for early HCC detection.
- Comprehensive tissue deconvolution of cell-free DNA by deep learning for disease diagnosis and monitoring (2023) - MLE application in cfDNA tissue deconvolution.
- MethylBERT: A Transformer-based model for read-level DNA methylation pattern identification and tumour deconvolution (2023, now available in Nature Communications 2025) - BERT-like model pre-trained on human reference genome and adapted for methylation sequence profiles.
- Bridging biological cfDNA features and machine learning approaches (2023) - Background in Biology for ML practitioners.
- The cell-free DNA methylome captures distinctions between localized and metastatic prostate tumors (2022) - Methylome analysis for prostate cancer staging.
- Tumor fractions deciphered from circulating cell-free DNA methylation for cancer early diagnosis (2022) - Bayesian modeling for tumor fraction estimation.
- DISMIR: Deep learning-based noninvasive cancer detection by integrating DNA sequence and methylation information of individual cell-free DNA reads (2021) - Hybrid sequence model (ConvNet+LSTM) for HCC detection with maximization of tumor fraction posterior probability.
- CancerDetector: ultrasensitive and non-invasive cancer detection at the resolution of individual reads using cell-free DNA methylation sequencing data (2018) - Statistical model for read-level cancer detection from cfDNA.
- Multimodal cell-free DNA whole-genome TAPS is sensitive and reveals specific cancer signals (2025) - Deep and less destructive assay than bisulfite sequencing.
- Enzymatic methyl sequencing detects DNA methylation at single-base resolution from picograms of DNA (2021) - Enzymatic methyl sequencing.
- scNMT-seq: Single-cell nucleosome, methylation and transcription sequencing (2018) - Single-cell nucleosome, methylation and transcription sequencing.
- DNA methylation detection: Bisulfite genomic sequencing analysis (2011) - Background of bisulfite sequencing. Also check Bisulfite_sequencing and Reduced representation bisulfite sequencing on Wiki.
- ENCODE - Encyclopedia of DNA Elements.
- Roadmap Epigenomics - Comprehensive mapping of epigenomic states.
- GEO - Gene Expression Omnibus, contains various epigenetics datasets.
- EWAS Atlas - A comprehensive database for epigenome-wide association studies.
- ClockBase - A curated methylation database for biological ages.
Your contributions are always welcome! Please following the guidelines to contribute.
- Ensure your suggestion is not already included
- Make an individual pull request for each suggestion
- Use the following format:
[Resource Name](Link) (Year) - Description.
- Keep descriptions concise and clear
- Check your spelling and grammar
- Make sure your text editor is set to remove trailing whitespace
- Add your suggestion to the most relevant category