Skip to content

Commit

Permalink
update matrix simulation + change lfc wording
Browse files Browse the repository at this point in the history
  • Loading branch information
bvieth committed Nov 30, 2020
1 parent 2a63261 commit efe3506
Show file tree
Hide file tree
Showing 45 changed files with 2,432 additions and 2,177 deletions.
43 changes: 22 additions & 21 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -2,29 +2,30 @@ Package: powsimR
Type: Package
Title: Power Simulations for RNA-sequencing
Description: Recent development of very sensitive RNA-seq protocols, such as Smart-seq2 and CEL-seq allows
transcriptional profiling at single-cell resolution and droplet devices make single cell transcriptomics
high-throughput, allowing to characterize thousands or even millions of single cells. In powsimR, we have
implemented a flexible tool to assess power and sample size requirements for differential expression (DE)
analysis of single cell and bulk RNA-seq experiments. For our read count simulations, we (1) reliably
model the mean, dispersion and dropout distributions as well as the relationship between those factors
from the data. (2) Simulate read counts from the empirical mean-variance and dropout relations, while
offering flexible choices of the number of differentially expressed genes, effect sizes and DE testing
method. (3) Finally, we evaluate the power over various sample sizes. The number of replicates required
to achieve the desired statistical power is mainly determined by technical noise and biological
variability and both are considerably larger if the biological replicates are single cells. powsimR can
not only estimate sample sizes necessary to achieve a certain power, but also informs about the power to
detect DE in a data set at hand. We believe that this type of posterior analysis will become more and
more important, if results from different studies are to be compared. Often enough researchers are left
to wonder why there is a lack of overlap in DE-genes across similar experiments. PowsimR will allow the
researcher to distinguish between actual discrepancies and incongruities due to lack of power.
Version: 1.2.3
transcriptional profiling at single-cell resolution and droplet devices make single cell
transcriptomics high-throughput, allowing to characterize thousands or even millions of single cells.
In powsimR, we have implemented a flexible tool to assess power and sample size requirements for
differential expression (DE) analysis of single cell and bulk RNA-seq experiments. For our read count
simulations, we (1) reliably model the mean, dispersion and dropout distributions as well as the
relationship between those factors from the data. (2) Simulate read counts from the empirical
mean-variance and dropout relations, while offering flexible choices of the number of differentially
expressed genes, effect sizes and DE testing method. (3) Finally, we evaluate the power over various
sample sizes. The number of replicates required to achieve the desired statistical power is mainly
determined by technical noise and biological variability and both are considerably larger if the
biological replicates are single cells. powsimR can not only estimate sample sizes necessary to achieve
a certain power, but also informs about the power to detect DE in a data set at hand. We believe that
this type of posterior analysis will become more and more important, if results from different studies
are to be compared. Often enough researchers are left to wonder why there is a lack of overlap in
DE-genes across similar experiments. PowsimR will allow the researcher to distinguish between actual
discrepancies and incongruities due to lack of power.
Version: 1.2.4
Imports: bayNorm, baySeq, BiocGenerics, BiocParallel, broom, BPSC, cobs, cowplot, data.table, DECENT, DESeq2,
doParallel, dplyr, DrImpute, EBSeq, edgeR, fastICA, fitdistrplus, foreach, future, ggplot2, ggpubr,
ggstance, grDevices, grid, Hmisc, iCOBRA, IHW, kernlab, limma, Linnorm, magrittr, MASS, MAST, Matrix,
matrixStats, methods, minpack.lm, moments, monocle, msir, NBPSeq, NOISeq, nonnest2, parallel, penalized,
plotrix, plyr, pscl, qvalue, reshape2, rlang, Rmagic, ROTS, rsvd, Rtsne, RUVSeq, S4Vectors, SAVER,
scales, scater, scDD, scde, SCnorm, scone, scran, sctransform, Seurat, SingleCellExperiment, stats,
SummarizedExperiment, tibble, tidyr, truncnorm, utils, VGAM, ZIM, zinbwave, zingeR, zoo
matrixStats, methods, minpack.lm, moments, monocle, msir, NBPSeq, NOISeq, nonnest2, parallel,
penalized, plotrix, plyr, pscl, qvalue, reshape2, rlang, Rmagic, ROTS, rsvd, Rtsne, RUVSeq, S4Vectors,
SAVER, scales, scater, scDD, scde, SCnorm, scone, scran, sctransform, Seurat, SingleCellExperiment,
stats, SummarizedExperiment, tibble, tidyr, truncnorm, utils, VGAM, ZIM, zinbwave, zingeR, zoo
Remotes: nghiavtr/BPSC, cz-ye/DECENT, mohuangx/SAVER, rhondabacher/SCnorm, statOmics/zingeR, bioc::bayNorm,
bioc::baySeq, bioc::BiocGenerics, bioc::BiocParallel, bioc::DESeq2, bioc::EBSeq, bioc::edgeR,
bioc::iCOBRA, bioc::IHW, bioc::limma, bioc::Linnorm, bioc::MAST, bioc::monocle, bioc::NOISeq,
Expand All @@ -38,7 +39,7 @@ VignetteBuilder: knitr
License: GPL
NeedsCompilation: no
Author: Beate Vieth
Date: 2020-06-30
Date: 2020-11-30
BugReports: https://github.com/bvieth/powsimR/issues
URL: https://github.com/bvieth/powsimR
Maintainer: Beate Vieth <[email protected]>
Expand Down
3 changes: 3 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
# Version 1.2.4 (2020-11-30)

* Change description of log fold change. Change settings in simulation to matrix multiplication.

# Version 1.2.3 (2020-6-30)

Expand Down
15 changes: 8 additions & 7 deletions R/Setup.R
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@
#' @title Setup options for RNA-seq count simulations
#' @description This function generates the settings needed for \code{\link{simulateDE}}.
#' Firstly a set of differential expressed gene IDs with
#' associated fold changes for a given number of genes, simulations and
#' associated log fold changes for a given number of genes, simulations and
#' fraction of DE genes is generated. There are also a number of options relating
#' to count simulations such as downsampling.
#' Secondly, the estimated parameters for count simulations of genes (and spikes) are added.
#' Secondly, the estimated parameters for count simulations of genes (and spikes) as well as samples are added.
#' @usage Setup(ngenes=NULL, nsims=25,
#' p.DE = 0.1, pLFC = 1, p.G = 1,
#' p.B=NULL, bLFC=NULL, bPattern="uncorrelated",
Expand All @@ -24,17 +24,18 @@
#' @param p.DE Numeric vector between 0 and 1 representing
#' the percentage of genes being differentially expressed due to phenotype,
#' i.e. biological signal. Default is \code{0.1}.
#' If no differential expression is desired, then set \code{p.DE=0}.
#' @param pLFC The log phenotypic fold change for DE genes. This can be:
#' (1) a constant, e.g. 2;
#' (2) a vector of values with length being number of DE genes.
#' If the input is a vector and the length is not the number of DE genes,
#' it will be sampled with replacement to generate log fold changes;
#' (3) a function that takes an integer n, and generates a vector of length n,
#' e.g. function(x) rnorm(x, mean=0, sd=1.5).
#' Default is \code{1}.
#' Please note that the fold change should be on \code{\link[base]{log2}} scale!
#' Default is \code{1}, i.e. a log2 fold change of 1 or fold change of 2.
#' Please note that the log fold change should be on \code{\link[base]{log2}} scale!
#' @param p.G Numeric vector indicating the proportion of replicates per group
#' that express the phenotypic fold change. Default is \code{1}, this means all show the expressiond difference.
#' that express the phenotypic fold change. Default is \code{1}, this means all show the expression difference.
#' For example, if \code{0.5} and \code{n1 = 10} and \code{n2 = 8},
#' then only 5 replicates in group 1 and 4 replicates in group 2 express the phenotypic fold change.
#' @param p.B Numeric vector between 0 and 1 representing the percentage of genes
Expand All @@ -49,7 +50,7 @@
#' e.g. function(x) rnorm(x, mean=0, sd=1.5).
#' Note that the simulations of only two batches is implemented.
#' Default is \code{NULL}, i.e. no batch effect.
#' Please note that the fold change should be on \code{\link[base]{log2}} scale!
#' Please note that the log fold change should be on \code{\link[base]{log2}} scale!
#' @param bPattern Character vector for batch effect pattern if \code{p.B} is non-null.
#' Possible options include:
#' "uncorrelated", "orthogonal" and " correlated".
Expand Down Expand Up @@ -118,7 +119,7 @@
#' RNAseq = 'singlecell', Protocol = 'Read',
#' Distribution = 'ZINB', Normalisation = "scran",
#' GeneFilter = 0.1, SampleFilter = 3,
#' sigma = 1.96, NCores = NULL, verbose = TRUE)
#' sigma = 1.96, NCores = NULL, verbose = TRUE)
#' # estimate spike parameters
#' data("SmartSeq2_SpikeIns_Read_Counts")
#' data("SmartSeq2_SpikeInfo")
Expand Down
37 changes: 22 additions & 15 deletions R/utils_sim.R
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,7 @@

# define NB params
means = simOptions$estParamRes$Parameters[[simOptions$DESetup$Draw$MoM]]$means
means = means[means>0]
sizes = simOptions$estParamRes$Parameters[[simOptions$DESetup$Draw$MoM]]$size
meansizefit = simOptions$estParamRes$Fit[[simOptions$DESetup$Draw$Fit]]$meansizefit

Expand All @@ -143,18 +144,19 @@
true.means = means[index]

# estimate size parameter associated with true mean values
lmu = log2(true.means+1)
lmu = log2(true.means)
predsize.mean = suppressWarnings(approx(meansizefit$x,
meansizefit$y,
xout = lmu, rule=2)$y)
predsize.sd = suppressWarnings(approx(meansizefit$x,
meansizefit$sd,
xout = lmu, rule=2)$y)
sizevec = truncnorm::rtruncnorm(n = length(lmu),
lsizevec = truncnorm::rtruncnorm(n = length(lmu),
a = min(log2(sizes), na.rm = TRUE),
b = max(log2(sizes), na.rm = TRUE),
mean = predsize.mean,
sd = predsize.sd)
sizevec = 2^lsizevec

# size factor
if (simOptions$SimSetup$LibSize == "equal") {
Expand All @@ -180,14 +182,15 @@
ind = !apply(modelmatrix, 2, function(x) { all(x == 1) })
mod = cbind(modelmatrix[, ind])
beta = cbind(coef.dat[, ind])
mumat = log2(effective.means+1) + beta %*% t(mod)
mumat[mumat < 0] = min(log2(effective.means+1))
coef.mat = 2^beta %*% t(mod)
mumat = effective.means * coef.mat
# mumat[mumat < 0] = min(log2(effective.means+1))

# result count matrix
counts = matrix(
stats::rnbinom(nsamples * ngenes,
mu = 2^mumat-1,
size = 2^sizevec),
mu = mumat,
size = sizevec),
ncol = nsamples,
nrow = ngenes,
dimnames = list(paste0(rownames(mumat),"_", seq_len(ngenes)),
Expand All @@ -200,8 +203,8 @@
return(list(counts=counts,
sf=all.facs,
mus=true.means,
disps=1/(2^sizevec),
sizes=2^sizevec,
disps=1/sizevec,
sizes=sizevec,
drops=dropout))
}

Expand All @@ -221,6 +224,7 @@

# define ZINB params
pos.means = simOptions$estParamRes$Parameters[[simOptions$DESetup$Draw$MoM]]$pos.means
pos.means = pos.means[pos.means>0]
pos.sizes = simOptions$estParamRes$Parameters[[simOptions$DESetup$Draw$MoM]]$pos.size
meansizefit = simOptions$estParamRes$Fit[[simOptions$DESetup$Draw$Fit]]$meansizefit

Expand All @@ -235,18 +239,19 @@
true.means = pos.means[index]

# estimate size parameter associated with true mean values
lmu = log2(true.means+1)
lmu = log2(true.means)
predsize.mean = suppressWarnings(approx(meansizefit$x,
meansizefit$y,
xout = lmu, rule=2)$y)
predsize.sd = suppressWarnings(approx(meansizefit$x,
meansizefit$sd,
xout = lmu, rule=2)$y)
sizevec = truncnorm::rtruncnorm(n = length(lmu),
lsizevec = truncnorm::rtruncnorm(n = length(lmu),
a = min(log2(pos.sizes), na.rm = TRUE),
b = max(log2(pos.sizes), na.rm = TRUE),
mean = predsize.mean,
sd = predsize.sd)
sizevec = 2^lsizevec

# size factor
if (simOptions$SimSetup$LibSize == "equal") {
Expand All @@ -272,8 +277,9 @@
ind = !apply(modelmatrix, 2, function(x) { all(x == 1) })
mod = cbind(modelmatrix[, ind])
beta = cbind(coef.dat[, ind])
mumat = log2(effective.means+1) + beta %*% t(mod)
mumat[mumat < 0] = min(log2(effective.means+1))
coef.mat = 2^beta %*% t(mod)
mumat = effective.means * coef.mat
# mumat[mumat < 0] = min(log2(effective.means+1))

meang0fit.nonamplified = simOptions$estParamRes$Fit[[simOptions$DESetup$Draw$Fit]]$meang0fit.nonamplified
meang0fit.amplified = simOptions$estParamRes$Fit[[simOptions$DESetup$Draw$Fit]]$meang0fit.amplified
Expand Down Expand Up @@ -368,8 +374,8 @@
# make the count matrix
counts.nb = matrix(
stats::rnbinom(nsamples * ngenes,
mu = 2^mumat-1,
size = 2^sizevec),
mu = mumat,
size = sizevec),
ncol = nsamples,
nrow = ngenes)

Expand Down Expand Up @@ -403,7 +409,8 @@
return(list(counts=counts,
sf=all.facs,
mus=true.means,
disps=sizevec,
disps=1/sizevec,
sizes=sizevec,
drops=dropout))

}
Expand Down
2 changes: 1 addition & 1 deletion README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ Note that the error "maximal number of DLLs reached..." might occur due to the l
Starting with R version 3.4.0, one can set the environmental variable 'R_MAX_NUM_DLLS' to a higher number. See `?Startup()` for more information. I recommend to increase the maximum number of DLLs that can be loaded to 500. The environmental variable R\_MAX\_NUM\_DLLS can be set in R\_HOME/etc/Renviron prior to starting R. For that locate the Renviron file and add the following line: R\_MAX\_NUM\_DLLS=xy where xy is the number of DLLs.
On my Ubuntu machine, the Renviron file is in /usr/lib/R/etc/ and I can set it to 500.

In addition, the user limits for open files (unix: ulimit) might have to be set to a higher number to accomodate the increase in DLLs. Please check out the help pages for [MACs](https://gist.github.com/tombigel/d503800a282fcadbee14b537735d202c) and [Linux](https://glassonionblog.wordpress.com/2013/01/27/increase-ulimit-and-file-descriptors-limit/) for guidance.
In addition, the user limits for open files (unix: ulimit) might have to be set to a higher number to accommodate the increase in DLLs. Please check out the help pages for [MACs](https://gist.github.com/tombigel/d503800a282fcadbee14b537735d202c) and [Linux](https://glassonionblog.wordpress.com/2013/01/27/increase-ulimit-and-file-descriptors-limit/) for guidance.

## :scroll: Citation

Expand Down
Loading

0 comments on commit efe3506

Please sign in to comment.