New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Data] support batch_format for Sort and Aggregate #48287

Merged

scottjlee merged 8 commits into ray-project:master from xingyu-long:issue_46748

Nov 13, 2024

Contributor

xingyu-long commented Oct 27, 2024 •

edited by scottjlee

Loading

Why are these changes needed?

While we calling xxx.map_groups(..., batch_format="..."), we may invoke sort function and creating empty blocks which still uses pyarrow by default. And, when we invoke another sort call on top of it, we will hit AttributeError: 'DataFrame' object has no attribute 'num_rows' since we uses first block type. (However, we may have different blocks). See more details in #46748

Related issue number

Close #46748

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

xingyu-long requested review from scottjlee, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners

October 27, 2024 22:02

xingyu-long force-pushed the issue_46748 branch from a2b2bc3 to 0045e27 Compare

October 27, 2024 22:03

Contributor Author

xingyu-long commented Oct 27, 2024

Hi @scottjlee, could you take a look when you have time? Thanks a lot!

scottjlee reviewed

View reviewed changes

Contributor

scottjlee left a comment

Great start, have a suggestion on how we can set the batch_format without having it explicitly set at Dataset level.

python/ray/data/grouped_data.py Outdated Show resolved Hide resolved

python/ray/data/dataset.py Outdated Show resolved Hide resolved

scottjlee reviewed

View reviewed changes

python/ray/data/tests/test_execution_optimizer.py Outdated Show resolved Hide resolved

xingyu-long requested a review from scottjlee

October 29, 2024 05:04

scottjlee reviewed

View reviewed changes

python/ray/data/_internal/logical/rules/inherit_batch_format.py Outdated Show resolved Hide resolved

python/ray/data/_internal/logical/rules/inherit_batch_format.py Outdated Show resolved Hide resolved

xingyu-long force-pushed the issue_46748 branch from 6fc8b80 to bc0caf0 Compare

November 3, 2024 18:36

xingyu-long requested a review from scottjlee

November 3, 2024 18:39

xingyu-long changed the title ~~[Data] support batch_format for Sort~~ [Data] support batch_format for Sort and Aggregate

xingyu-long force-pushed the issue_46748 branch from bc0caf0 to e3f14e7 Compare

November 3, 2024 18:40

scottjlee reviewed

View reviewed changes

python/ray/data/_internal/planner/exchange/sort_task_spec.py Outdated Show resolved Hide resolved

python/ray/data/_internal/logical/rules/inherit_batch_format.py Outdated Show resolved Hide resolved

xingyu-long requested review from alexeykudinkin and srinathk10 as code owners

November 8, 2024 03:02

xingyu-long force-pushed the issue_46748 branch from 81097ac to f39498b Compare

November 8, 2024 03:04

xingyu-long added 5 commits

November 7, 2024 19:07


          [Data] support batch_format for Sort

b353f4f

Signed-off-by: Xingyu Long <[email protected]>


          Remove batch_format at dataset level to address Scott's comments

86af289

Signed-off-by: Xingyu Long <[email protected]>


          Add inherit_batch_format rule

21eb19d

Signed-off-by: Xingyu Long <[email protected]>


          Update tests to verify inherit_batch_format rule

9e35188

Signed-off-by: Xingyu Long <[email protected]>


          address the comments

e4f5009

Signed-off-by: Xingyu Long <[email protected]>

xingyu-long force-pushed the issue_46748 branch from f39498b to e4f5009 Compare

November 8, 2024 03:08

xingyu-long requested a review from scottjlee

November 8, 2024 03:08

scottjlee approved these changes

View reviewed changes

Contributor

scottjlee left a comment

Thanks for your contribution @xingyu-long !

python/ray/data/_internal/logical/rules/inherit_batch_format.py Outdated Show resolved Hide resolved

scottjlee self-assigned this

scottjlee added the go label

alexeykudinkin reviewed

View reviewed changes

python/ray/data/_internal/logical/rules/inherit_batch_format.py Outdated Show resolved Hide resolved


          Use AbstractAllToAll instead of limiting to Sort and Aggregate

afeb03c

Signed-off-by: Xingyu Long <[email protected]>

xingyu-long force-pushed the issue_46748 branch from 20a24af to afeb03c Compare

November 9, 2024 00:24

xingyu-long requested review from alexeykudinkin and scottjlee

November 9, 2024 00:25

scottjlee approved these changes

View reviewed changes

alexeykudinkin reviewed

View reviewed changes

python/ray/data/_internal/logical/rules/inherit_batch_format.py

Comment on lines +22 to +25

+                      for node in op.post_order_iter():
+                          nodes.appendleft(node)
+                      while len(nodes) > 0:

Contributor

alexeykudinkin Nov 12, 2024

Let's combine these 2 loops

Contributor Author

xingyu-long Nov 13, 2024

Looking through the codebase, I was trying to use same coding style for this matter. it seems not worth it to combine? Thanks!

Contributor

alexeykudinkin Nov 19, 2024

We can't just blindly follow the patterns, these need to make sense, right?

What's the motivation for first collecting the nodes into a queue, instead of traversing the iterator directly?

python/ray/data/_internal/logical/rules/inherit_batch_format.py Show resolved Hide resolved

python/ray/data/tests/test_execution_optimizer.py Show resolved Hide resolved

python/ray/data/tests/test_execution_optimizer.py Show resolved Hide resolved


          add comments for test cases

b92c5f6

Signed-off-by: Xingyu Long <[email protected]>

scottjlee approved these changes

View reviewed changes


          Merge branch 'master' into issue_46748

76d1dc1

scottjlee enabled auto-merge (squash)

November 13, 2024 18:23

scottjlee merged commit 3f195b4 into ray-project:master

6 checks passed

JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request


          [Data] support batch_format for Sort and Aggregate (ray-project#48287)

a461f7a

## Why are these changes needed?
While we calling `xxx.map_groups(..., batch_format="...")`, we may
invoke sort function and creating empty blocks which still uses pyarrow
by default. And, when we invoke another sort call on top of it, we will
hit `AttributeError: 'DataFrame' object has no attribute 'num_rows'`
since we uses first block type. (However, we may have different blocks).
See more details in ray-project#46748

## Related issue number

Close ray-project#46748

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Xingyu Long <[email protected]>
Co-authored-by: Scott Lee <[email protected]>

mohitjain2504 pushed a commit to mohitjain2504/ray that referenced this pull request


          [Data] support batch_format for Sort and Aggregate (ray-project#48287)

2bb5a25

## Why are these changes needed?
While we calling `xxx.map_groups(..., batch_format="...")`, we may
invoke sort function and creating empty blocks which still uses pyarrow
by default. And, when we invoke another sort call on top of it, we will
hit `AttributeError: 'DataFrame' object has no attribute 'num_rows'`
since we uses first block type. (However, we may have different blocks).
See more details in ray-project#46748

## Related issue number

Close ray-project#46748

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Xingyu Long <[email protected]>
Co-authored-by: Scott Lee <[email protected]>
Signed-off-by: mohitjain2504 <[email protected]>

dentiny pushed a commit to dentiny/ray that referenced this pull request


          [Data] support batch_format for Sort and Aggregate (ray-project#48287)

37e5d4e

## Why are these changes needed?
While we calling `xxx.map_groups(..., batch_format="...")`, we may
invoke sort function and creating empty blocks which still uses pyarrow
by default. And, when we invoke another sort call on top of it, we will
hit `AttributeError: 'DataFrame' object has no attribute 'num_rows'`
since we uses first block type. (However, we may have different blocks).
See more details in ray-project#46748

## Related issue number

Close ray-project#46748

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Xingyu Long <[email protected]>
Co-authored-by: Scott Lee <[email protected]>
Signed-off-by: hjiang <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

alexeykudinkin alexeykudinkin left review comments

scottjlee scottjlee approved these changes

bveeramani Awaiting requested review from bveeramani

raulchen Awaiting requested review from raulchen

stephanie-wang Awaiting requested review from stephanie-wang

omatthew98 Awaiting requested review from omatthew98

srinathk10 Awaiting requested review from srinathk10

Labels

go