Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs/data] Add Blocks and Streaming Execution to Quickstart #50022

Closed

Conversation

richardliaw
Copy link
Contributor

Why are these changes needed?

Adds:

  • Introduces Dataset and Block concepts with clear explanations and visual diagram
  • Adds new section explaining Ray Data's streaming execution model

This update makes the quickstart guide more beginner-friendly by explaining
fundamental concepts upfront and highlighting Ray Data's efficient streaming
execution model.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
@richardliaw richardliaw requested a review from a team as a code owner January 23, 2025 01:18

The following figure visualizes a dataset with three blocks, each holding 1000 rows.
Ray Data holds the :class:`~ray.data.Dataset` on the process that triggers execution
(which is usually the driver) and stores the blocks as objects in Ray's shared-memory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe link to driver definition in glossary? New ray user may not know exactly what it is


Key benefits of streaming execution include:

* **Memory Efficient**: Processes data in chunks rather than loading everything into memory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick but best stick to blocks and not introduce a new term

The streaming model allows Ray Data to efficiently handle datasets much larger than memory while maintaining high performance through parallel execution.

.. note::
Operations like :meth:`ds.sort() <ray.data.Dataset.sort>` and :meth:`ds.groupby() <ray.data.Dataset.groupby>` require materializing data, which may impact memory usage for very large datasets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: best to refer to methods that are decorated as consumption API (e.g. ds.split or ds.split_at_indices) as opposed to all to all operations like groupby which theoretically don't require a full materializing of the data

@@ -3,21 +3,52 @@
Quickstart
==========

Learn about :class:`Dataset <ray.data.Dataset>` and the capabilities it provides.
This page introduces the :class:`Dataset <ray.data.Dataset>` concept and the capabilities it provides.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: but perhaps best to state this is a guide to get up and running with Ray Data instead of "about the Dataset concept"

Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I'd prefer to move key concepts to a dedicated "Key concepts" page for consistency with other libraries and so that the "Quickstart" just illustrates basic usage, but will defer to @angelinalg


This guide provides a lightweight introduction to:

* :ref:`Key Concepts: Datasets and Blocks <data_key_concepts>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the style guide the Data docs adhere to, use sentence case for headings and titles: https://developers.google.com/style/capitalization#capitalization-in-titles-and-headings

Suggested change
* :ref:`Key Concepts: Datasets and Blocks <data_key_concepts>`
* :ref:`Key Concepts: Datasets and blocks <data_key_concepts>`

* :ref:`Loading data <loading_key_concept>`
* :ref:`Transforming data <transforming_key_concept>`
* :ref:`Consuming data <consuming_key_concept>`
* :ref:`Saving data <saving_key_concept>`
* :ref:`Streaming Execution Model <streaming_execution_model>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* :ref:`Streaming Execution Model <streaming_execution_model>`
* :ref:`Streaming execution model <streaming_execution_model>`

* **Memory Efficient**: Processes data in chunks rather than loading everything into memory
* **Pipeline Parallelism**: Different stages of the pipeline can execute concurrently
* **Automatic Memory Management**: Ray Data automatically spills data to disk if memory pressure is high
* **Lazy Evaluation**: Transformations are not executed until an action triggers the pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://developers.google.com/style/contractions

Suggested change
* **Lazy Evaluation**: Transformations are not executed until an action triggers the pipeline
* **Lazy Evaluation**: Transformations aren't executed until an action triggers the pipeline


.. _streaming_execution_model:

Streaming Execution Model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Streaming Execution Model
Streaming execution model

@richardliaw
Copy link
Contributor Author

OK closing

@richardliaw richardliaw mentioned this pull request Jan 29, 2025
8 tasks
richardliaw added a commit that referenced this pull request Feb 5, 2025
## Why are these changes needed?

Refresher for #50022, but on a
separate page and a bit more holistic.

It's not tightly integrated into the other pages yet but I will do a
revision of quickstart/overview/data.rst pages.

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Richard Liaw <[email protected]>
Co-authored-by: Balaji Veeramani <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants