Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Improvements in feature sampling #4278

Merged

Conversation

vinaydes
Copy link
Contributor

With this PR, the feature sampling overhead is greatly reduced, especially for wide (thousands of features) datasets. The PR requires some structural changes in RAFT therefore is marked as WIP.

@vinaydes vinaydes requested a review from a team as a code owner October 12, 2021 07:56
@teju85
Copy link
Member

teju85 commented Oct 12, 2021

Any ideas we could also add support for feature subsampling with weights in the same PR? Or better to keep it separate in another PR?

@vinaydes
Copy link
Contributor Author

Any ideas we could also add support for feature subsampling with weights in the same PR? Or better to keep it separate in another PR?

Weighted sampling should be possible. I'll see if I can manage to add it to same PR.

@venkywonka
Copy link
Contributor

No regressions in gbm-bench, both in accuracy and perf 🙌🏻. Since most of the datasets in gbm-bench have small number of columns, speedups are not expected (except epsilon, where 2000 columns shows a slight improvement in perf).

accuracy comparison with branch-21.12

sampling-comparison-with-main accuracy

perf comparison with branch-21.12

sampling-comparison-with-main time

@github-actions
Copy link

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

@venkywonka
Copy link
Contributor

rerun tests

@venkywonka
Copy link
Contributor

The previous pushed changes include two feature-samping strategies that are decided based on if the sampling problem size corresponds to allowable static shared memory availability and register pressure. The default strategy is a sorting-based sampling (this kernel). Another strategy using a batchwise adaptation of the algo-L of reservoir sampling (this kernel) is used as a fallback.

The former strategy is more performant than latter. (by 1.5x times in target datasets with wide columns (~100000))

benchmarks on GBM datasets

  • No regressions in accuracy
  • There is only a slight improvement in performance for gbm-datasets as all their columns are small so the improvement in the feature-sampling portion does not significantly affect the end-to-end times.

max_features: "sqrt", n_trees: 1000, n_bins: 128, n_streams: 4, max_samples: 0.5, bootstrap: true
max_features: 0.7, n_trees: 1000, n_bins: 128, n_streams: 4, max_samples: 0.5, bootstrap: true

benchmark on a representative synthetic dataset

  • This feature-sampling improvement is more pronounced when it becomes the bottleneck in problems where datasets are very wide (~100000 cols).
  • The below is a benchmark on such a representative synthetic regression dataset with 1000 rows and 100000 cols

max_features: "sqrt", n_trees: 1000, n_bins: 128, n_streams: 4, max_samples: 1.0, bootstrap: true

The above benchmarks have been rerun after the latest commit and modified in-place

@vinaydes vinaydes changed the title [WIP] Improvements in feature sampling [REVIEW] Improvements in feature sampling Jun 16, 2022
@venkywonka
Copy link
Contributor

rerun tests

@github-actions github-actions bot added the Cython / Python Cython or Python issue label Jun 29, 2022
@venkywonka
Copy link
Contributor

@teju85 could you give a final review? ✌🏻

@teju85
Copy link
Member

teju85 commented Jun 29, 2022

Not sure if I'll get time to review this PR soon. Maybe @vinaydes or @tfeher ?

@vinaydes
Copy link
Contributor Author

I already had a look at the code. I am okay with the changes. I am not approving as some part of code is written by me and I have been part of PR from the start. @tfeher Let us know if we need to explain the changes for you to review.

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @venkywonka and @vinaydes for the PR! In general it looks good, there are only a few smaller issues.

@vinaydes
Copy link
Contributor Author

@dantegd I have addressed the changes @tfeher had asked for. This PR can now be merged.

@vinaydes
Copy link
Contributor Author

vinaydes commented Aug 1, 2022

@tfeher I think you need to re-review or accept the changes, cause the merging is blocked for that.

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vinaydes for addressing the issues!

I think my request for improving the docstring was not clear, I have added a suggestion to illustrate what did I mean.

On one hand this is a nitpick, and it should not hold up this PR, therfore I am approving the PR.
On the other hand, if someone picks up rapidsai/raft#767, then such information could be very useful.

@vinaydes
Copy link
Contributor Author

vinaydes commented Aug 1, 2022

I'll take a look at CI failures.

@codecov-commenter
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@33c0170). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.08    #4278   +/-   ##
===============================================
  Coverage                ?   78.02%           
===============================================
  Files                   ?      180           
  Lines                   ?    11385           
  Branches                ?        0           
===============================================
  Hits                    ?     8883           
  Misses                  ?     2502           
  Partials                ?        0           
Flag Coverage Δ
dask 46.21% <0.00%> (?)
non-dask 67.27% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 33c0170...b5751f1. Read the comment docs.

@dantegd
Copy link
Member

dantegd commented Aug 3, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 3b3b891 into rapidsai:branch-22.08 Aug 3, 2022
@venkywonka
Copy link
Contributor

Yaaaay ❤️🥳

jakirkham pushed a commit to jakirkham/cuml that referenced this pull request Feb 27, 2023
With this PR, the feature sampling overhead is greatly reduced, especially for wide (thousands of features) datasets. The PR requires some structural changes in RAFT therefore is marked as WIP.

Authors:
  - Vinay Deshpande (https://github.com/vinaydes)
  - Ray Douglass (https://github.com/raydouglass)
  - Andy Adinets (https://github.com/canonizer)
  - Jordan Jacobelli (https://github.com/Ethyling)
  - Jiwei Liu (https://github.com/daxiongshu)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Christopher Akiki (https://github.com/cakiki)
  - Venkat (https://github.com/venkywonka)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4278
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CUDA/C++ Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.