Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setlevels #4197

Closed
wants to merge 14 commits into from
Closed

Setlevels #4197

wants to merge 14 commits into from

Conversation

2005m
Copy link
Contributor

@2005m 2005m commented Jan 24, 2020

Closes #2219

@codecov
Copy link

codecov bot commented Jan 24, 2020

Codecov Report

Merging #4197 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4197      +/-   ##
==========================================
+ Coverage   99.61%   99.61%   +<.01%     
==========================================
  Files          72       72              
  Lines       13876    13952      +76     
==========================================
+ Hits        13822    13898      +76     
  Misses         54       54
Impacted Files Coverage Δ
R/wrappers.R 100% <ø> (ø) ⬆️
R/merge.R 100% <ø> (ø) ⬆️
src/subset.c 100% <ø> (ø) ⬆️
R/as.data.table.R 100% <ø> (ø) ⬆️
src/assign.c 100% <ø> (ø) ⬆️
src/init.c 100% <ø> (ø) ⬆️
src/fwriteR.c 100% <100%> (ø) ⬆️
src/dogroups.c 100% <100%> (ø) ⬆️
src/frank.c 100% <100%> (ø) ⬆️
R/frank.R 100% <100%> (ø) ⬆️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c661b4...55e6111. Read the comment docs.

@2005m 2005m added this to the 1.12.9 milestone Jan 24, 2020
@2005m 2005m requested review from jangorecki and mattdowle January 24, 2020 16:00
man/setattr.Rd Outdated
@@ -24,6 +26,8 @@ setnames(x,old,new,skip_absent=FALSE)

\code{setattr} is a more general function that allows setting of any attribute to an object \emph{by reference}.

\code{setlevels} is a function to set the levels of factor vector.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this isn't needed as Details. If anything in Details, we should note the sort of consistency checks that are being run that make it advantageous vs setattr

@@ -16770,6 +16770,8 @@ test(2132.2, fifelse(TRUE, 1, s2), error = "S4 class objects (except nanot
test(2132.3, fcase(TRUE, s1, FALSE, s2), error = "S4 class objects (except nanotime) are not supported. Please see https://github.com/Rdatatable/data.table/issues/4131.")
rm(s1, s2, class2132)

# setlevels, #2219
test(2133.1, setlevels(as.factor(c("A", "A", "B", "B", "B", "C")), c("X", "Y", "Z")), as.factor(c("X", "X", "Y", "Y", "Y", "Z")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue was written specifically about how setlevels improves vs setattr because of the consistency checks; can we add some tests of these here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I look in more details at setattrib and setlevels I am a bit confused by these functions. Maybe you can shed some light on the following code :

x = as.factor(c("A","A","B","B","B","C"))
print(setlevels(x, c("X","X","Y","Z")))
# X X X X X Y
# Levels: X Y Z

That is not correct, is it? setattr behaves the same as it is built on top of setlevels.

Do I need to implement some checks at C level? I am happy to do it but I am bit confused now. I think I went a bit too fast!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed that looks wrong!

I think my understanding of the motivation for setlevels was to error for that example, since X is duplicated. If not an error, unique(value) should be used.

Copy link
Contributor Author

@2005m 2005m Jan 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well that is not a problem, I can make it error :) . Anything else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, that is what I have with setattr as well:

print(setattr(x, "levels", c("X","X","Y","Z")))
# [1] X X X X X Y
# Levels: X Y Z

I can correct everything if you tell me exactly what each function is suppose to do. Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some changes, let me know if anything else is needed. Thanks.

@2005m 2005m added the WIP label Jan 24, 2020
@2005m 2005m removed the WIP label Jan 26, 2020
R/data.table.R Outdated
@@ -2457,6 +2457,14 @@ setattr = function(x,name,value) {
invisible(x)
}

setlevels = function(x, value) {
if (anyDuplicated(value) == 0) {
Copy link
Member

@mattdowle mattdowle Feb 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have an internal any_duplicated (or completely different name) which is specialized for character strings, and faster than base. If so, this anyDuplicated call should be moved down to C to use the internal one.

Excited to see setlevels() finally be exported :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely re-wrote setlevels please let me know what you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the WIP except if you think it needs further changes?

@mattdowle
Copy link
Member

mattdowle commented Feb 18, 2020

How about making the setlevels interface just like setnames; e.g. with old and new? That way, the common task of renaming one or two levels could be facilitated too. setlevels could reuse the existing code in setnames, either by calling setnames, or maybe the internals of setnames would need to be extracted to a helper function which both setlevels and setnames would call.

@mattdowle mattdowle added the WIP label Feb 18, 2020
@2005m
Copy link
Contributor Author

2005m commented Feb 18, 2020

I like your proposal but I am worried that we will break the setattr function which rely on Csetlevels. I think I am going to write a seperate function for you and rename the current one. That will be cleaner.

@2005m 2005m removed the WIP label Feb 28, 2020
Copy link
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add tests for cases when

  • there are factor levels for which data are not present factor("a", levels=c("a","b"))
  • factor levels are not ordered factor("a", levels=c("b","a"))
  • there are NAs in input factor(c("a",NA), levels=c("b","a"))
  • there are NAs in input and in factor levels factor(c("a",NA), levels=c("b","a",NA), exclude=NULL)
  • no NAs in input but in factor levels factor(c("a"), levels=c("b","a",NA), exclude=NULL)
  • as well for setting those NAs levels

for (int64_t i=0; i<nx; ++i) {
if (STRING_ELT(xchar, i) == STRING_ELT(old_lvl, j)) {
SET_STRING_ELT(xchar, i, STRING_ELT(new_lvl, j));
goto label;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't we avoid goto in favor of breaking the loop?

Copy link
Contributor Author

@2005m 2005m Mar 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could yes. I just fund it cleaner to use goto. break use goto under the hood I think. If you are not confortable with goto i can change it. No issue. I use goto to jump the error message mainly. I would need to create a variable to avoid it otherwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW in our source we're only using goto in fread.c so far. not sure if Matt has a stylistic judgment on it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error("Element '%s' of 'old' does not exist in 'x'.", CHAR(STRING_ELT(old_lvl, j)));
label:;
}
SEXP ans = PROTECT(duplicate(x)); nprotect++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattdowle do we need copyAsPlain here? x is INTSXP so eventually could be a compact 1:n altrep sequence

@2005m
Copy link
Contributor Author

2005m commented Feb 29, 2020

Thanks @jangorecki for the feedback. Adding back the WIP flag as I won't be able to look at the above in the coming days.

@2005m 2005m added the WIP label Feb 29, 2020
Copy link
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be much easier to read those tests if they would have a comment at the end line what they mean to test. Just mentioning for future, no need to add now.
If those new tests has been added based on the expected output, and not the current output of setlevels, then it should be good.

@2005m 2005m removed the WIP label Mar 3, 2020
@2005m 2005m closed this Apr 8, 2020
@2005m 2005m deleted the setlevels branch April 8, 2020 13:20
@jangorecki jangorecki modified the milestones: 1.12.11, 1.12.9 May 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose setlevels() and document
4 participants