update matrix simulation + change lfc wording

bvieth · Nov 30, 2020 · efe3506 · efe3506
1 parent 2a63261
commit efe3506
Show file tree

Hide file tree

Showing 45 changed files with 2,432 additions and 2,177 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -2,29 +2,30 @@ Package: powsimR
 Type: Package
 Title: Power Simulations for RNA-sequencing
 Description: Recent development of very sensitive RNA-seq protocols, such as Smart-seq2 and CEL-seq allows
-             transcriptional profiling at single-cell resolution and droplet devices make single cell transcriptomics
-             high-throughput, allowing to characterize thousands or even millions of single cells. In powsimR, we have
-             implemented a flexible tool to assess power and sample size requirements for differential expression (DE)
-             analysis of single cell and bulk RNA-seq experiments. For our read count simulations, we (1) reliably
-             model the mean, dispersion and dropout distributions as well as the relationship between those factors
-             from the data. (2) Simulate read counts from the empirical mean-variance and dropout relations, while
-             offering flexible choices of the number of differentially expressed genes, effect sizes and DE testing
-             method. (3) Finally, we evaluate the power over various sample sizes. The number of replicates required
-             to achieve the desired statistical power is mainly determined by technical noise and biological
-             variability and both are considerably larger if the biological replicates are single cells. powsimR can
-             not only estimate sample sizes necessary to achieve a certain power, but also informs about the power to
-             detect DE in a data set at hand. We believe that this type of posterior analysis will become more and
-             more important, if results from different studies are to be compared. Often enough researchers are left
-             to wonder why there is a lack of overlap in DE-genes across similar experiments. PowsimR will allow the
-             researcher to distinguish between actual discrepancies and incongruities due to lack of power.
-Version: 1.2.3
+             transcriptional profiling at single-cell resolution and droplet devices make single cell
+             transcriptomics high-throughput, allowing to characterize thousands or even millions of single cells.
+             In powsimR, we have implemented a flexible tool to assess power and sample size requirements for
+             differential expression (DE) analysis of single cell and bulk RNA-seq experiments. For our read count
+             simulations, we (1) reliably model the mean, dispersion and dropout distributions as well as the
+             relationship between those factors from the data. (2) Simulate read counts from the empirical
+             mean-variance and dropout relations, while offering flexible choices of the number of differentially
+             expressed genes, effect sizes and DE testing method. (3) Finally, we evaluate the power over various
+             sample sizes. The number of replicates required to achieve the desired statistical power is mainly
+             determined by technical noise and biological variability and both are considerably larger if the
+             biological replicates are single cells. powsimR can not only estimate sample sizes necessary to achieve
+             a certain power, but also informs about the power to detect DE in a data set at hand. We believe that
+             this type of posterior analysis will become more and more important, if results from different studies
+             are to be compared. Often enough researchers are left to wonder why there is a lack of overlap in
+             DE-genes across similar experiments. PowsimR will allow the researcher to distinguish between actual
+             discrepancies and incongruities due to lack of power.
+Version: 1.2.4
 Imports: bayNorm, baySeq, BiocGenerics, BiocParallel, broom, BPSC, cobs, cowplot, data.table, DECENT, DESeq2,
              doParallel, dplyr, DrImpute, EBSeq, edgeR, fastICA, fitdistrplus, foreach, future, ggplot2, ggpubr,
              ggstance, grDevices, grid, Hmisc, iCOBRA, IHW, kernlab, limma, Linnorm, magrittr, MASS, MAST, Matrix,
-             matrixStats, methods, minpack.lm, moments, monocle, msir, NBPSeq, NOISeq, nonnest2, parallel, penalized,
-             plotrix, plyr, pscl, qvalue, reshape2, rlang, Rmagic, ROTS, rsvd, Rtsne, RUVSeq, S4Vectors, SAVER,
-             scales, scater, scDD, scde, SCnorm, scone, scran, sctransform, Seurat, SingleCellExperiment, stats,
-             SummarizedExperiment, tibble, tidyr, truncnorm, utils, VGAM, ZIM, zinbwave, zingeR, zoo
+             matrixStats, methods, minpack.lm, moments, monocle, msir, NBPSeq, NOISeq, nonnest2, parallel,
+             penalized, plotrix, plyr, pscl, qvalue, reshape2, rlang, Rmagic, ROTS, rsvd, Rtsne, RUVSeq, S4Vectors,
+             SAVER, scales, scater, scDD, scde, SCnorm, scone, scran, sctransform, Seurat, SingleCellExperiment,
+             stats, SummarizedExperiment, tibble, tidyr, truncnorm, utils, VGAM, ZIM, zinbwave, zingeR, zoo
 Remotes: nghiavtr/BPSC, cz-ye/DECENT, mohuangx/SAVER, rhondabacher/SCnorm, statOmics/zingeR, bioc::bayNorm,
              bioc::baySeq, bioc::BiocGenerics, bioc::BiocParallel, bioc::DESeq2, bioc::EBSeq, bioc::edgeR,
              bioc::iCOBRA, bioc::IHW, bioc::limma, bioc::Linnorm, bioc::MAST, bioc::monocle, bioc::NOISeq,
@@ -38,7 +39,7 @@ VignetteBuilder: knitr
 License: GPL
 NeedsCompilation: no
 Author: Beate Vieth
-Date: 2020-06-30
+Date: 2020-11-30
 BugReports: https://github.com/bvieth/powsimR/issues
 URL: https://github.com/bvieth/powsimR
 Maintainer: Beate Vieth <[email protected]>

diff --git a/NEWS.md b/NEWS.md
@@ -1,3 +1,6 @@
+# Version 1.2.4 (2020-11-30)
+
+* Change description of log fold change.  Change settings in simulation to matrix multiplication.
 
 # Version 1.2.3 (2020-6-30)
 

diff --git a/R/Setup.R b/R/Setup.R
@@ -5,10 +5,10 @@
 #' @title Setup options for RNA-seq count simulations
 #' @description This function generates the settings needed for \code{\link{simulateDE}}.
 #' Firstly a set of differential expressed gene IDs with
-#' associated fold changes for a given number of genes, simulations and
+#' associated log fold changes for a given number of genes, simulations and
 #' fraction of DE genes is generated. There are also a number of options relating
 #'  to count simulations such as downsampling.
-#'  Secondly, the estimated parameters for count simulations of genes (and spikes) are added.
+#'  Secondly, the estimated parameters for count simulations of genes (and spikes) as well as samples are added.
 #' @usage Setup(ngenes=NULL, nsims=25,
 #' p.DE = 0.1, pLFC = 1, p.G = 1,
 #' p.B=NULL, bLFC=NULL, bPattern="uncorrelated",
@@ -24,17 +24,18 @@
 #' @param p.DE Numeric vector between 0 and 1 representing
 #' the percentage of genes being differentially expressed due to phenotype,
 #' i.e. biological signal. Default is \code{0.1}.
+#' If no differential expression is desired, then set \code{p.DE=0}.
 #' @param pLFC The log phenotypic fold change for DE genes. This can be:
 #' (1) a constant, e.g. 2;
 #' (2) a vector of values with length being number of DE genes.
 #' If the input is a vector and the length is not the number of DE genes,
 #' it will be sampled with replacement to generate log fold changes;
 #' (3) a function that takes an integer n, and generates a vector of length n,
 #' e.g. function(x) rnorm(x, mean=0, sd=1.5).
-#' Default is \code{1}.
-#' Please note that the fold change should be on \code{\link[base]{log2}} scale!
+#' Default is \code{1}, i.e. a log2 fold change of 1 or fold change of 2.
+#' Please note that the log fold change should be on \code{\link[base]{log2}} scale!
 #' @param p.G Numeric vector indicating the proportion of replicates per group
-#' that express the phenotypic fold change. Default is \code{1}, this means all show the expressiond difference.
+#' that express the phenotypic fold change. Default is \code{1}, this means all show the expression difference.
 #' For example, if \code{0.5} and \code{n1 = 10} and \code{n2 = 8},
 #' then only 5 replicates in group 1 and 4 replicates in group 2 express the phenotypic fold change.
 #' @param p.B Numeric vector between 0 and 1 representing the percentage of genes
@@ -49,7 +50,7 @@
 #'  e.g. function(x) rnorm(x, mean=0, sd=1.5).
 #' Note that the simulations of only two batches is implemented.
 #' Default is \code{NULL}, i.e. no batch effect.
-#' Please note that the fold change should be on \code{\link[base]{log2}} scale!
+#' Please note that the log fold change should be on \code{\link[base]{log2}} scale!
 #' @param bPattern Character vector for batch effect pattern if \code{p.B} is non-null.
 #' Possible options include:
 #' "uncorrelated", "orthogonal" and " correlated".
@@ -118,7 +119,7 @@
 #'                                RNAseq = 'singlecell', Protocol = 'Read',
 #'                                Distribution = 'ZINB', Normalisation = "scran",
 #'                                GeneFilter = 0.1, SampleFilter = 3,
-#'                               sigma = 1.96, NCores = NULL, verbose = TRUE)
+#'                                sigma = 1.96, NCores = NULL, verbose = TRUE)
 #' # estimate spike parameters
 #' data("SmartSeq2_SpikeIns_Read_Counts")
 #' data("SmartSeq2_SpikeInfo")

diff --git a/R/utils_sim.R b/R/utils_sim.R
@@ -129,6 +129,7 @@
 
   # define NB params
   means = simOptions$estParamRes$Parameters[[simOptions$DESetup$Draw$MoM]]$means
+  means = means[means>0]
   sizes = simOptions$estParamRes$Parameters[[simOptions$DESetup$Draw$MoM]]$size
   meansizefit = simOptions$estParamRes$Fit[[simOptions$DESetup$Draw$Fit]]$meansizefit
 
@@ -143,18 +144,19 @@
   true.means = means[index]
 
   # estimate size parameter associated with true mean values
-  lmu = log2(true.means+1)
+  lmu = log2(true.means)
   predsize.mean = suppressWarnings(approx(meansizefit$x,
                                           meansizefit$y,
                                           xout = lmu, rule=2)$y)
   predsize.sd = suppressWarnings(approx(meansizefit$x,
                                         meansizefit$sd,
                                         xout = lmu, rule=2)$y)
-  sizevec = truncnorm::rtruncnorm(n = length(lmu),
+  lsizevec = truncnorm::rtruncnorm(n = length(lmu),
                                   a = min(log2(sizes), na.rm = TRUE),
                                   b = max(log2(sizes), na.rm = TRUE),
                                   mean = predsize.mean,
                                   sd = predsize.sd)
+  sizevec = 2^lsizevec
 
   # size factor
   if (simOptions$SimSetup$LibSize == "equal") {
@@ -180,14 +182,15 @@
   ind = !apply(modelmatrix, 2, function(x) { all(x == 1) })
   mod = cbind(modelmatrix[, ind])
   beta = cbind(coef.dat[, ind])
-  mumat = log2(effective.means+1) + beta %*% t(mod)
-  mumat[mumat < 0] = min(log2(effective.means+1))
+  coef.mat = 2^beta %*% t(mod)
+  mumat = effective.means * coef.mat
+  # mumat[mumat < 0] = min(log2(effective.means+1))
 
   # result count matrix
   counts = matrix(
     stats::rnbinom(nsamples * ngenes,
-                   mu = 2^mumat-1,
-                   size = 2^sizevec),
+                   mu = mumat,
+                   size = sizevec),
     ncol = nsamples,
     nrow = ngenes,
     dimnames = list(paste0(rownames(mumat),"_", seq_len(ngenes)),
@@ -200,8 +203,8 @@
   return(list(counts=counts,
               sf=all.facs,
               mus=true.means,
-              disps=1/(2^sizevec),
-              sizes=2^sizevec,
+              disps=1/sizevec,
+              sizes=sizevec,
               drops=dropout))
 }
 
@@ -221,6 +224,7 @@
 
   # define ZINB params
   pos.means = simOptions$estParamRes$Parameters[[simOptions$DESetup$Draw$MoM]]$pos.means
+  pos.means = pos.means[pos.means>0]
   pos.sizes = simOptions$estParamRes$Parameters[[simOptions$DESetup$Draw$MoM]]$pos.size
   meansizefit = simOptions$estParamRes$Fit[[simOptions$DESetup$Draw$Fit]]$meansizefit
 
@@ -235,18 +239,19 @@
   true.means = pos.means[index]
 
   # estimate size parameter associated with true mean values
-  lmu = log2(true.means+1)
+  lmu = log2(true.means)
   predsize.mean = suppressWarnings(approx(meansizefit$x,
                                           meansizefit$y,
                                           xout = lmu, rule=2)$y)
   predsize.sd = suppressWarnings(approx(meansizefit$x,
                                         meansizefit$sd,
                                         xout = lmu, rule=2)$y)
-  sizevec = truncnorm::rtruncnorm(n = length(lmu),
+  lsizevec = truncnorm::rtruncnorm(n = length(lmu),
                                   a = min(log2(pos.sizes), na.rm = TRUE),
                                   b = max(log2(pos.sizes), na.rm = TRUE),
                                   mean = predsize.mean,
                                   sd = predsize.sd)
+  sizevec = 2^lsizevec
 
   # size factor
   if (simOptions$SimSetup$LibSize == "equal") {
@@ -272,8 +277,9 @@
   ind = !apply(modelmatrix, 2, function(x) { all(x == 1) })
   mod = cbind(modelmatrix[, ind])
   beta = cbind(coef.dat[, ind])
-  mumat = log2(effective.means+1) + beta %*% t(mod)
-  mumat[mumat < 0] = min(log2(effective.means+1))
+  coef.mat = 2^beta %*% t(mod)
+  mumat = effective.means * coef.mat
+  # mumat[mumat < 0] = min(log2(effective.means+1))
 
   meang0fit.nonamplified = simOptions$estParamRes$Fit[[simOptions$DESetup$Draw$Fit]]$meang0fit.nonamplified
   meang0fit.amplified = simOptions$estParamRes$Fit[[simOptions$DESetup$Draw$Fit]]$meang0fit.amplified
@@ -368,8 +374,8 @@
   # make the count matrix
     counts.nb = matrix(
       stats::rnbinom(nsamples * ngenes,
-                     mu = 2^mumat-1,
-                     size = 2^sizevec),
+                     mu = mumat,
+                     size = sizevec),
       ncol = nsamples,
       nrow = ngenes)
 
@@ -403,7 +409,8 @@
   return(list(counts=counts,
               sf=all.facs,
               mus=true.means,
-              disps=sizevec,
+              disps=1/sizevec,
+              sizes=sizevec,
               drops=dropout))
 
 }

diff --git a/README.Rmd b/README.Rmd
@@ -128,7 +128,7 @@ Note that the error "maximal number of DLLs reached..." might occur due to the l
 Starting with R version 3.4.0, one can set the environmental variable 'R_MAX_NUM_DLLS' to a higher number. See `?Startup()` for more information. I recommend to increase the maximum number of DLLs that can be loaded to 500. The environmental variable R\_MAX\_NUM\_DLLS can be set in R\_HOME/etc/Renviron prior to starting R. For that locate the Renviron file and add the following line: R\_MAX\_NUM\_DLLS=xy where xy is the number of DLLs.
 On my Ubuntu machine, the Renviron file is in /usr/lib/R/etc/ and I can set it to 500.
 
-In addition, the user limits for open files (unix: ulimit) might have to be set to a higher number to accomodate the increase in DLLs. Please check out the help pages for [MACs](https://gist.github.com/tombigel/d503800a282fcadbee14b537735d202c) and [Linux](https://glassonionblog.wordpress.com/2013/01/27/increase-ulimit-and-file-descriptors-limit/) for guidance.
+In addition, the user limits for open files (unix: ulimit) might have to be set to a higher number to accommodate the increase in DLLs. Please check out the help pages for [MACs](https://gist.github.com/tombigel/d503800a282fcadbee14b537735d202c) and [Linux](https://glassonionblog.wordpress.com/2013/01/27/increase-ulimit-and-file-descriptors-limit/) for guidance.
 
 ## :scroll: Citation