A C++ library that implements Stochastic Variational Inference for Motif Elicitation (SVIME). It discovers an unbounded number of motifs over DNA sequences from FASTA files and produces their logos.
Simply copy the source and header files to the src folder of your project.
C++
- Boost
- Eigen
- OpenMP
Python3
- Matplotlib
- Numpy
- Pandas
We'll find Oct4 binding motifs in DNA sequences overlapping Oct4 ChIP-seq peaks from [1]. Download the 'oct4_sorted.fa' file from https://github.com/tahmidmehdi/svime/tree/master/data. This file contains the binding sites.
- Create a project with a source file and include the following files:
#include "svime.h"
#include "distribution.h"
#include "util.h"
#include "asa103.hpp"
#include "processFasta.h"
#include <omp.h>
#include <boost/foreach.hpp>
#include <boost/math/special_functions/digamma.hpp>
#include <Eigen/StdVector>
- In your main function, create a mapping of chromosomes to sizes (number of 15-mers in the chromosome based on the FASTA file) with the
faToMatrix
function. This stores sequences and genomic coordinates to .txt files in a specified output directory.
std::map<std::string, int> chrSizes = faToMatrix("/path/to/oct4_sorted.fa", 15, "/path/to/output");
- Create an array of parameters for the step-size function of SVI. The function is described in [2]. The first element of the array is the tau parameter and the second is the kappa parameter. Then, create a pointer to tau.
float step[2] = {0, 0.5};
float* stepPtr = step;
- Create a
svime
object. The arguments are described in the next section.
svime model = svime(15, 1, 1, 20, stepPtr, 1000, 10, 4, 42);
- Fit the model & find motifs. Check /path/to/output/results for logos.
svime::variationalDist q = model.fit_predict("/path/to/output", chrSizes, NULL);
Implements SVI for Dirichlet Process Mixture of Product-Multinomials [3].
Argument | Data type | Description |
---|---|---|
window | int | required. Length of motifs. |
alpha | float | required. The alpha parameter for the Dirichlet Process. Determines how precisely the model should look for motifs. Higher values will create more motifs. |
epochs | int | required. The maximum number of epochs. |
max_clusters | int | required. The maximum number of motifs the model can create. |
step_pars | float* | required. Array of parameters for step-size function. |
batch_size | int | optional (default: 1000). Number of window-mers in each batch. |
tol | float | optional (default: 0.001). The algorithm stops when the difference between evidence lower bounds (ELBOs) in 2 consecutive iterations is less than tol. |
n_jobs | int | optional (default: 1). The number of threads to use. |
random_state | int | optional (default: 42). Determines the initial clusters and ensures reproducibility. |
fit_predict(outDir, chrSizes, hyperparameters = NULL)
Argument | Data type | Description |
---|---|---|
outDir | string | required. Output directory. |
chrSizes | map<string, int> | required. A mapping of chromosomes to their sizes. |
hyperparameters | psm* | optional (default: all concentrations are set to 1). A position score matrix (psm struct) of prior concentration parameters for each base and position. |
[1] Kopp, W. and Schulte-Sasse, R. (2017). Unsupervised learning of dna sequence features using a convolutional restricted boltzmann machine. bioRxiv.
[2] Hoffman, M. D. et al. (2013). Stochastic variational inference. The Journal of Machine Learning Research, 14(1), 1303-1347.
[3] Dunson, D. B. and Xing, C. (2009). Nonparametric bayes modeling of multivariate categorical data. Journal of the American Statistical Association, 104, 1042-1051.