Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set changes original data (as.data.table on DESeq/GRanges objects) #3230

Closed
NikdAK opened this issue Dec 17, 2018 · 2 comments
Closed

set changes original data (as.data.table on DESeq/GRanges objects) #3230

NikdAK opened this issue Dec 17, 2018 · 2 comments
Milestone

Comments

@NikdAK
Copy link

NikdAK commented Dec 17, 2018

I noticed a real pitfall (bug) while examining wrong results in my analysis.
Changing a data.table object also alters the original data from which it was copied with as.data.table. This happens for objects like DESeqResults and GRanges.

According to the vignette:

as.data.table methods returns a copy of original data

But this is apparently not true.
Here is a minimal example:

library(data.table)
library(DESeq2)
dds <- makeExampleDESeqDataSet(betaSD=1) #some data
dds <- dds[1:3] #minimalize the example

rownames(dds) <- c("C","B","A") #random order of gene names
dds <- DESeq(dds)
res <- results(dds)

DT <- as.data.table(res) #this should create a copy
DT[, name := rownames(res)] #rename for clarification


DT[,.(name, baseMean,padj)] #print reduced versions
#   name   baseMean       padj
# 1:    C  11.594847 0.01511568
# 2:    B 605.910995 0.99999893
# 3:    A   3.010566 0.99999893

res[,c("baseMean", "padj")]
# DataFrame with 3 rows and 2 columns
#          baseMean               padj
#         <numeric>          <numeric>
# C   11.59484650746 0.0151156828069631
# B 605.910995247964  0.999998925569034
# A 3.01056578221606  0.999998925569034

Now using a set function on DT also changes the values in the original res object.

setkey(DT, "name")

DT[,.(name, baseMean,padj)]
#    name   baseMean       padj
# 1:    A   3.010566 0.99999893
# 2:    B 605.910995 0.99999893
# 3:    C  11.594847 0.01511568

res[,c("baseMean", "padj")]
# DataFrame with 3 rows and 2 columns
#           baseMean               padj
#          <numeric>          <numeric>
# C 3.01056578221606  0.999998925569034
# B 605.910995247964  0.999998925569034
# A   11.59484650746 0.0151156828069631

We notice that the values for genes A and C are swapped. This is probably due to the fact that the values are sorted by setkey but the rownames of res are not!
Therefore any analysis using the original res will be completly wrong.

I am aware that there are fixes for this like:

DT <- copy(as.data.table(res))
DT <- data.table(as.data.frame(res))

but my main issue is the fact, that this behaviour is not obvious and very dangerous for downstream work.

This was already somehow reported but nothing has changed.
data.table issue
GRanges copy

I love using data.table, it is simply amazing.
Hopefully you can address this issue.

Thank you!

@jangorecki
Copy link
Member

Thank you for reporting. In future reports please include calls to attach required libraries in your minimal reproducible example. From brief investigation it looks like as.data.frame method run on res object does not copy, which we assumed to happen.

sapply(res, address)==sapply(as.data.frame(res), address)
#      baseMean log2FoldChange          lfcSE           stat         pvalue           padj 
#          TRUE           TRUE           TRUE           TRUE           TRUE           TRUE 

Will fix

@jangorecki
Copy link
Member

re-opening as PR with the fix is not yet merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants