-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docs/data] Add Blocks and Streaming Execution to Quickstart #50022
[docs/data] Add Blocks and Streaming Execution to Quickstart #50022
Conversation
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
|
||
The following figure visualizes a dataset with three blocks, each holding 1000 rows. | ||
Ray Data holds the :class:`~ray.data.Dataset` on the process that triggers execution | ||
(which is usually the driver) and stores the blocks as objects in Ray's shared-memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe link to driver definition in glossary? New ray user may not know exactly what it is
|
||
Key benefits of streaming execution include: | ||
|
||
* **Memory Efficient**: Processes data in chunks rather than loading everything into memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick but best stick to blocks and not introduce a new term
The streaming model allows Ray Data to efficiently handle datasets much larger than memory while maintaining high performance through parallel execution. | ||
|
||
.. note:: | ||
Operations like :meth:`ds.sort() <ray.data.Dataset.sort>` and :meth:`ds.groupby() <ray.data.Dataset.groupby>` require materializing data, which may impact memory usage for very large datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: best to refer to methods that are decorated as consumption API (e.g. ds.split
or ds.split_at_indices
) as opposed to all to all operations like groupby
which theoretically don't require a full materializing of the data
@@ -3,21 +3,52 @@ | |||
Quickstart | |||
========== | |||
|
|||
Learn about :class:`Dataset <ray.data.Dataset>` and the capabilities it provides. | |||
This page introduces the :class:`Dataset <ray.data.Dataset>` concept and the capabilities it provides. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: but perhaps best to state this is a guide to get up and running with Ray Data instead of "about the Dataset concept"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I'd prefer to move key concepts to a dedicated "Key concepts" page for consistency with other libraries and so that the "Quickstart" just illustrates basic usage, but will defer to @angelinalg
|
||
This guide provides a lightweight introduction to: | ||
|
||
* :ref:`Key Concepts: Datasets and Blocks <data_key_concepts>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the style guide the Data docs adhere to, use sentence case for headings and titles: https://developers.google.com/style/capitalization#capitalization-in-titles-and-headings
* :ref:`Key Concepts: Datasets and Blocks <data_key_concepts>` | |
* :ref:`Key Concepts: Datasets and blocks <data_key_concepts>` |
* :ref:`Loading data <loading_key_concept>` | ||
* :ref:`Transforming data <transforming_key_concept>` | ||
* :ref:`Consuming data <consuming_key_concept>` | ||
* :ref:`Saving data <saving_key_concept>` | ||
* :ref:`Streaming Execution Model <streaming_execution_model>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* :ref:`Streaming Execution Model <streaming_execution_model>` | |
* :ref:`Streaming execution model <streaming_execution_model>` |
* **Memory Efficient**: Processes data in chunks rather than loading everything into memory | ||
* **Pipeline Parallelism**: Different stages of the pipeline can execute concurrently | ||
* **Automatic Memory Management**: Ray Data automatically spills data to disk if memory pressure is high | ||
* **Lazy Evaluation**: Transformations are not executed until an action triggers the pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://developers.google.com/style/contractions
* **Lazy Evaluation**: Transformations are not executed until an action triggers the pipeline | |
* **Lazy Evaluation**: Transformations aren't executed until an action triggers the pipeline |
|
||
.. _streaming_execution_model: | ||
|
||
Streaming Execution Model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Streaming Execution Model | |
Streaming execution model |
OK closing |
## Why are these changes needed? Refresher for #50022, but on a separate page and a bit more holistic. It's not tightly integrated into the other pages yet but I will do a revision of quickstart/overview/data.rst pages. ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Richard Liaw <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]>
Why are these changes needed?
Adds:
This update makes the quickstart guide more beginner-friendly by explaining
fundamental concepts upfront and highlighting Ray Data's efficient streaming
execution model.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.