Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] num_rows_per_file doesn't work with small values #45393

Closed
bveeramani opened this issue May 16, 2024 · 3 comments · Fixed by #49978
Closed

[Data] num_rows_per_file doesn't work with small values #45393

bveeramani opened this issue May 16, 2024 · 3 comments · Fixed by #49978
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks

Comments

@bveeramani
Copy link
Member

What happened + What you expected to happen

My dataset contains 28k rows. I tried writing this to Parquet with num_rows_per_file=700. I expected several Parquet files each with 700 rows, but instead got a single file with all the rows.

Versions / Dependencies

4d37e55

Reproduction script

import os

import ray

ray.data.range(100, override_num_blocks=1).write_parquet(
    "/tmp/results", num_rows_per_file=10
)


print(os.listdir("/tmp/results"))

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@bveeramani bveeramani added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) data Ray Data-related issues labels May 16, 2024
@raulchen
Copy link
Contributor

raulchen commented Jun 3, 2024

@bveeramani assuming this is a won't-fix, I'll close this issue

@raulchen raulchen closed this as completed Jun 3, 2024
@raulchen
Copy link
Contributor

raulchen commented Jun 3, 2024

actually this should be fixed

@raulchen raulchen reopened this Jun 3, 2024
@raulchen raulchen added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 3, 2024
@richardliaw
Copy link
Contributor

how should we fix this?

Should it be best effort with +/- O(num_blocks) extra files, and ask user to repartition explicitly for more precision?

anson627 pushed a commit to anson627/ray that referenced this issue Jan 31, 2025
## Why are these changes needed?

num_rows_per_file makes it seem like there is "exact" semantics, whereas
the actual underlying implementation is more like "at least".

This updates the parameter name to reflect that, but we should double
check whether or not this is the best way to expose this.

Closes ray-project#45393

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Richard Liaw <[email protected]>
Co-authored-by: Balaji Veeramani <[email protected]>
Signed-off-by: Anson Qian <[email protected]>
anson627 pushed a commit to anson627/ray that referenced this issue Jan 31, 2025
## Why are these changes needed?

num_rows_per_file makes it seem like there is "exact" semantics, whereas
the actual underlying implementation is more like "at least".

This updates the parameter name to reflect that, but we should double
check whether or not this is the best way to expose this.

Closes ray-project#45393

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Richard Liaw <[email protected]>
Co-authored-by: Balaji Veeramani <[email protected]>
Signed-off-by: Anson Qian <[email protected]>
srinathk10 pushed a commit that referenced this issue Feb 2, 2025
## Why are these changes needed?

num_rows_per_file makes it seem like there is "exact" semantics, whereas
the actual underlying implementation is more like "at least".

This updates the parameter name to reflect that, but we should double
check whether or not this is the best way to expose this.


Closes #45393 

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Richard Liaw <[email protected]>
Co-authored-by: Balaji Veeramani <[email protected]>
anyadontfly pushed a commit to anyadontfly/ray that referenced this issue Feb 13, 2025
## Why are these changes needed?

num_rows_per_file makes it seem like there is "exact" semantics, whereas
the actual underlying implementation is more like "at least".

This updates the parameter name to reflect that, but we should double
check whether or not this is the best way to expose this.

Closes ray-project#45393

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Richard Liaw <[email protected]>
Co-authored-by: Balaji Veeramani <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
park12sj pushed a commit to park12sj/ray that referenced this issue Mar 18, 2025
## Why are these changes needed?

num_rows_per_file makes it seem like there is "exact" semantics, whereas
the actual underlying implementation is more like "at least".

This updates the parameter name to reflect that, but we should double
check whether or not this is the best way to expose this.


Closes ray-project#45393 

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Richard Liaw <[email protected]>
Co-authored-by: Balaji Veeramani <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants