-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] num_rows_per_file
doesn't work with small values
#45393
Comments
8 tasks
@bveeramani assuming this is a won't-fix, I'll close this issue |
actually this should be fixed |
how should we fix this? Should it be best effort with +/- |
8 tasks
anson627
pushed a commit
to anson627/ray
that referenced
this issue
Jan 31, 2025
## Why are these changes needed? num_rows_per_file makes it seem like there is "exact" semantics, whereas the actual underlying implementation is more like "at least". This updates the parameter name to reflect that, but we should double check whether or not this is the best way to expose this. Closes ray-project#45393 ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Richard Liaw <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]> Signed-off-by: Anson Qian <[email protected]>
anson627
pushed a commit
to anson627/ray
that referenced
this issue
Jan 31, 2025
## Why are these changes needed? num_rows_per_file makes it seem like there is "exact" semantics, whereas the actual underlying implementation is more like "at least". This updates the parameter name to reflect that, but we should double check whether or not this is the best way to expose this. Closes ray-project#45393 ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Richard Liaw <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]> Signed-off-by: Anson Qian <[email protected]>
srinathk10
pushed a commit
that referenced
this issue
Feb 2, 2025
## Why are these changes needed? num_rows_per_file makes it seem like there is "exact" semantics, whereas the actual underlying implementation is more like "at least". This updates the parameter name to reflect that, but we should double check whether or not this is the best way to expose this. Closes #45393 ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Richard Liaw <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]>
anyadontfly
pushed a commit
to anyadontfly/ray
that referenced
this issue
Feb 13, 2025
## Why are these changes needed? num_rows_per_file makes it seem like there is "exact" semantics, whereas the actual underlying implementation is more like "at least". This updates the parameter name to reflect that, but we should double check whether or not this is the best way to expose this. Closes ray-project#45393 ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Richard Liaw <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]> Signed-off-by: Puyuan Yao <[email protected]>
park12sj
pushed a commit
to park12sj/ray
that referenced
this issue
Mar 18, 2025
## Why are these changes needed? num_rows_per_file makes it seem like there is "exact" semantics, whereas the actual underlying implementation is more like "at least". This updates the parameter name to reflect that, but we should double check whether or not this is the best way to expose this. Closes ray-project#45393 ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Richard Liaw <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
What happened + What you expected to happen
My dataset contains 28k rows. I tried writing this to Parquet with
num_rows_per_file=700
. I expected several Parquet files each with 700 rows, but instead got a single file with all the rows.Versions / Dependencies
4d37e55
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: