-
Notifications
You must be signed in to change notification settings - Fork 999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setlevels #4197
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4197 +/- ##
==========================================
+ Coverage 99.61% 99.61% +<.01%
==========================================
Files 72 72
Lines 13876 13952 +76
==========================================
+ Hits 13822 13898 +76
Misses 54 54
Continue to review full report at Codecov.
|
man/setattr.Rd
Outdated
@@ -24,6 +26,8 @@ setnames(x,old,new,skip_absent=FALSE) | |||
|
|||
\code{setattr} is a more general function that allows setting of any attribute to an object \emph{by reference}. | |||
|
|||
\code{setlevels} is a function to set the levels of factor vector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this isn't needed as Details
. If anything in Details, we should note the sort of consistency checks that are being run that make it advantageous vs setattr
inst/tests/tests.Rraw
Outdated
@@ -16770,6 +16770,8 @@ test(2132.2, fifelse(TRUE, 1, s2), error = "S4 class objects (except nanot | |||
test(2132.3, fcase(TRUE, s1, FALSE, s2), error = "S4 class objects (except nanotime) are not supported. Please see https://github.com/Rdatatable/data.table/issues/4131.") | |||
rm(s1, s2, class2132) | |||
|
|||
# setlevels, #2219 | |||
test(2133.1, setlevels(as.factor(c("A", "A", "B", "B", "B", "C")), c("X", "Y", "Z")), as.factor(c("X", "X", "Y", "Y", "Y", "Z"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue was written specifically about how setlevels
improves vs setattr
because of the consistency checks; can we add some tests of these here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I look in more details at setattrib
and setlevels
I am a bit confused by these functions. Maybe you can shed some light on the following code :
x = as.factor(c("A","A","B","B","B","C"))
print(setlevels(x, c("X","X","Y","Z")))
# X X X X X Y
# Levels: X Y Z
That is not correct, is it? setattr
behaves the same as it is built on top of setlevels
.
Do I need to implement some checks at C level? I am happy to do it but I am bit confused now. I think I went a bit too fast!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed that looks wrong!
I think my understanding of the motivation for setlevels
was to error for that example, since X
is duplicated. If not an error, unique(value)
should be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well that is not a problem, I can make it error :) . Anything else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, that is what I have with setattr
as well:
print(setattr(x, "levels", c("X","X","Y","Z")))
# [1] X X X X X Y
# Levels: X Y Z
I can correct everything if you tell me exactly what each function is suppose to do. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some changes, let me know if anything else is needed. Thanks.
R/data.table.R
Outdated
@@ -2457,6 +2457,14 @@ setattr = function(x,name,value) { | |||
invisible(x) | |||
} | |||
|
|||
setlevels = function(x, value) { | |||
if (anyDuplicated(value) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have an internal any_duplicated (or completely different name) which is specialized for character strings, and faster than base. If so, this anyDuplicated call should be moved down to C to use the internal one.
Excited to see setlevels() finally be exported :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely re-wrote setlevels
please let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the WIP except if you think it needs further changes?
How about making the |
I like your proposal but I am worried that we will break the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add tests for cases when
- there are factor levels for which data are not present
factor("a", levels=c("a","b"))
- factor levels are not ordered
factor("a", levels=c("b","a"))
- there are NAs in input
factor(c("a",NA), levels=c("b","a"))
- there are NAs in input and in factor levels
factor(c("a",NA), levels=c("b","a",NA), exclude=NULL)
- no NAs in input but in factor levels
factor(c("a"), levels=c("b","a",NA), exclude=NULL)
- as well for setting those NAs levels
for (int64_t i=0; i<nx; ++i) { | ||
if (STRING_ELT(xchar, i) == STRING_ELT(old_lvl, j)) { | ||
SET_STRING_ELT(xchar, i, STRING_ELT(new_lvl, j)); | ||
goto label; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't we avoid goto
in favor of breaking the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could yes. I just fund it cleaner to use goto
. break
use goto
under the hood I think. If you are not confortable with goto
i can change it. No issue. I use goto
to jump the error message mainly. I would need to create a variable to avoid it otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW in our source we're only using goto
in fread.c
so far. not sure if Matt has a stylistic judgment on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error("Element '%s' of 'old' does not exist in 'x'.", CHAR(STRING_ELT(old_lvl, j))); | ||
label:; | ||
} | ||
SEXP ans = PROTECT(duplicate(x)); nprotect++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattdowle do we need copyAsPlain
here? x
is INTSXP so eventually could be a compact 1:n altrep sequence
Thanks @jangorecki for the feedback. Adding back the WIP flag as I won't be able to look at the above in the coming days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be much easier to read those tests if they would have a comment at the end line what they mean to test. Just mentioning for future, no need to add now.
If those new tests has been added based on the expected output, and not the current output of setlevels
, then it should be good.
Closes #2219