Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setDT on DF will make data.table class as leading, closes #1078 #1089

Closed
wants to merge 1 commit into from

Conversation

jangorecki
Copy link
Member

It simply put data.table class at first in class vector.
Previous behavior was a bit different, putting data.table just before data.frame, so this should be kept in mind after that change.

@gsee
Copy link

gsee commented Mar 20, 2015

Why is this more desirable? If you want data.table methods to override methods for classes other than data.frame, then why not just remove all the other classes except data.table and data.frame. Otherwise, an object will look like it is a "myClass" object, but e.g. print.myClass will not get dispatched.

@jangorecki
Copy link
Member Author

@gsee yes I would prefer to remove tbl_df and tbl classes on setDT, but this would requires some bigger impact analysis than current PR. Lets see what Arun and Matt thinks of setDT to always results c("data.table","data.frame") class object. For me it is the best option.

@arunsrinivasan
Copy link
Member

I don't think we should be masking other classes (and retaining them). I agree with @gsee. It's due to another FR long back, #64, that we preserve class, and I think this is the right behaviour (even though the FR is about retaining other classes that inherits from data.table).

I don't understand why dplyr class inheritance for data.frames and data.tables is the way it is.

require(data.table)
require(dplyr)
dt = data.table(x=1:5, y=6:10)
tbl = tbl_dt(dt)
dt
#    x  y
# 1: 1  6
# 2: 2  7
# 3: 3  8
# 4: 4  9
# 5: 5 10
class(dt)
# [1] "data.table" "data.frame"
class(tbl)
# [1] "tbl_dt"     "tbl"        "data.table" "data.frame"

tbl already inherits from data.frame or data.table. Then why does it again have a class tbl_df or tbl_dtinheriting from it?

I've no idea how to fix this without removing the old feature. If anyone else has, feel free to issue a PR or share it here.

@gsee
Copy link

gsee commented Mar 21, 2015

Interesting. I think I like the current behavior of setDT() because if I created an object that is really a data.frame underneath before I knew about the data.table package and now I want to turn it into a data.table, it won't remove my class and all the functionality written for that class.

However, I'm not sure if I like the behavior from FR #64.

Suppose I need to create a normalization layer -- I'm getting data from lots of different providers and each of them gives it to me in a different format. So, I create an object that has exactly the information that is common among all of them that I need for my business case. I'll class that object as c("myBusinessClass", "data.table", "data.frame"). I make functions that fetch the data from providers and normalize it; i.e. extract the information required and put it into myBusinessClass objects. Now, I can write functions that take a myBusinessClass object as input. The first thing they'll do is something to the effect of stopifnot(inherits(x, "myBusinessClass")). But, if the object inherits "myBusinessClass", I expect that it has all the columns that every "myBusinessClass" object must have.

FR #64 broke that because a subsetted "myBusinessClass" object is still a "myBusinessClass" object. In order to work around FR #64, I'd need to write more code... maybe an is.myBusinessClass function that checks to see if it inherts "myBusinessClass" AND that it contains all the columns required for a "myBusinessClass" object. I suppose if it's important for my function's input to have specific columns, it would be a good idea for the function to check for that explicity anyway..

@gsee
Copy link

gsee commented Mar 21, 2015

I guess the behavior that came out of FR #64 is consistent with base R.

> class(BOD) <- c("myClass", "data.frame")
> class(BOD[, 1, drop=FALSE])
[1] "myClass"    "data.frame"

@jangorecki
Copy link
Member Author

Using custom class for DT was my question too! in #839
And I think it might (not must!) be worth to keep the class.
But also the setDT should (or even must) deal with the cases as the #1078 which currently requires setDT(as.data.frame(x)).

@gsee
Copy link

gsee commented Mar 21, 2015

I think dplyr is doing the wrong thing here. I'm not familiar with dplyr, but it almost seems like you actually want as.tbl(as.data.table(as.data.frame(df2))).

To be most consistent with base R, I think as.data.table should remove classes other than data.table and data.frame and subsetting an object should not remove classes.

Removing classes on coercion is consistent with base R

class(BOD) <- c("myClass", "data.frame")
class(as.data.frame(BOD))
#[1] "data.frame"

@arunsrinivasan
Copy link
Member

@gsee I find it strange to have class definitions based on some columns being present or absent.

I agree that as.data.table() and setDT() stripping off all other classes while converting to data.tables, and retain if input object is already a data.table might be the fix here.

@jangorecki do you think you can implement this in a squashed PR (for both as.data.table() and setDT())? You might have to fix some tests as well, not sure.

@jangorecki
Copy link
Member Author

I've updated my branch, I see it went pretty bad cause I've squashed a commit which was pulling recent changes from data.table master :)
I will fix it together with the fix for 527 test, which gone broken after my change.
How we want to address that?

@arunsrinivasan
Copy link
Member

@jangorecki it's cases like those that I was afraid of. I find it quite useful in retaining classes. I'd suggest putting this on hold for now.

@jangorecki
Copy link
Member Author

Closing PR as it breaks current tests. Still there is issue #1078 so we can address it later somehow.
Current easiest and way is to: setattr(df,"class",c("data.table","data.frame"))

@jangorecki jangorecki closed this Apr 12, 2015
@jangorecki jangorecki deleted the setDT_tbl_df branch August 15, 2015 12:31
@jangorecki jangorecki restored the setDT_tbl_df branch August 15, 2015 12:31
@jangorecki jangorecki deleted the setDT_tbl_df branch August 15, 2015 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants