-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[c-api][python-package][R-package] expose feature num bin #5048
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this PR!
LGTM for the Python part and just one minor suggestion for the test.
Also, could you please open a new issue or rename the old original one requesting to add the same functionality into the R-package?
I'll include the changes for the R package |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Very nice test cases.
Co-authored-by: Nikita Titov <[email protected]>
Yeah, please check #4829 as example. New function should be registered here: LightGBM/R-package/src/lightgbm_R.cpp Lines 925 to 926 in 3032b64
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm travelling right now but would really like the chance to review this, now that changes to the R package's public API have been included.
Could you please either reserve the R changes for a separate PR, or otherwise wait 2 days to merge this? I'll be able to provide a review the day after tomorrow.
Thank you @jameslamb, I'd really appreciate your review. I don't think we're in a rush to merge this, should I change this to draft or something similar to avoid merging it? |
I was going to say that adding "WIP" to the title will trigger a blocking CI job, but I guess that job doesn't block merging (I can still see the merge button available). So changing this to draft would be fine. As I kind of implied in my comment, it would also have been fine to split the R changes into a separate PR if you wanted. Not a big deal either way though, I'm back and able to review this morning! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work! The changes on the R side look good to me, and I appreciate you making the interface there 1-based to match R's conventions. LightGBM's R package has historically been inconsistent about that (see, for example, discussion in #4970).
I left a few very minor suggestions for the R package.
I also think there's one other thing we should talk about now, on this PR. I believe it'd be useful to have a similar interface, but where you could get the number of bins by feature name.
I think that exposing that would prevent the need for users to write code like this to achieve the same thing when using named data objects like data frames:
feat_indx = [
i, _
for feat_name in df.columns
if feat_name = "my_cool_feature"
][0]
ds.feature_num_bin(feat_indx)
I see three options for that:
- Don't do this at all.
- Allow the input to
feature_num_bin()
in the Python and R packages to acceptUnion[int, str]
- assume that a string passed to this method should be matched exactly in
feature_names
- assume that a string passed to this method should be matched exactly in
- Keep
feature_num_bin()
in the Python and R packages as-is and later add a newDataset
method likefeature_num_bin_by_feature_name()
- this accepts a string, matches it to
feature_names
to get the index (iffeature_names
have been defined), then callsfeature_num_bin()
with that index.
- this accepts a string, matches it to
I only bring this up here in this review because if we want to eventually do Option 2, then I'd recommend changing the argument name to feature
instead of feature_idx
.
I have a mild preference for Option 3, as I think having explicit methods without union types makes it easier to reason about how the code is going to work. But this project already uses union types in so many other places, I'm not opposed to Option 2.
What do you think? also tagging @StrikerRUS @shiyu1994 @guolinke for other opinions.
I like option 2 and I think in the R package it's very straightforward to implement because the feature names get stored as column names. However for the Python package it would involve getting the feature names each time. Would it be in the scope of this PR to add a buffer or something so that they're only computed once? |
Sorry, I'm not proposing also accepting feature names as part of this PR. I don't want to delay this PR by adding on more things. Just thought we should make a decision about how we think we might support the use of feature names, since it could impact the choice of argument name for the new methods introduced in this PR. |
@jameslamb Thanks a lot for the suggestion of adding access by feature name! I'm for option 2 for the consistency with some other APIs, e.g. LightGBM/python-package/lightgbm/plotting.py Lines 184 to 187 in 5631366
LightGBM/python-package/lightgbm/basic.py Lines 3741 to 3744 in 5631366
|
Ok works for me @jmoralez and @StrikerRUS ! Let's go with option 2. @jmoralez for this PR, could you just change the keyword argument name to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me! Thanks very much for the changes and excellent tests.
@jameslamb Sorry for missing the message. I also agree with option 2. |
No problem! Thanks for taking the time to come back and comment! |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
This adds the
LGBM_DatasetGetFeatureNumBin
function to the C API to be able to get the number of constructed bins for a feature. I also added this to the Python API as the Dataset methodfeature_num_bin
and added a test for it.Closes #3406.