[docs/data] Add Blocks and Streaming Execution to Quickstart #50022

richardliaw · 2025-01-23T01:18:30Z

Why are these changes needed?

Adds:

Introduces Dataset and Block concepts with clear explanations and visual diagram
Adds new section explaining Ray Data's streaming execution model

This update makes the quickstart guide more beginner-friendly by explaining
fundamental concepts upfront and highlighting Ray Data's efficient streaming
execution model.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Richard Liaw <[email protected]>

scottsun94 · 2025-01-23T01:53:08Z

doc/source/data/quickstart.rst

+
+The following figure visualizes a dataset with three blocks, each holding 1000 rows.
+Ray Data holds the :class:`~ray.data.Dataset` on the process that triggers execution
+(which is usually the driver) and stores the blocks as objects in Ray's shared-memory


maybe link to driver definition in glossary? New ray user may not know exactly what it is

marwan116 · 2025-01-23T02:37:12Z

doc/source/data/quickstart.rst

+
+Key benefits of streaming execution include:
+
+* **Memory Efficient**: Processes data in chunks rather than loading everything into memory


nitpick but best stick to blocks and not introduce a new term

marwan116 · 2025-01-23T02:41:10Z

doc/source/data/quickstart.rst

+The streaming model allows Ray Data to efficiently handle datasets much larger than memory while maintaining high performance through parallel execution.
+
+.. note::
+   Operations like :meth:`ds.sort() <ray.data.Dataset.sort>` and :meth:`ds.groupby() <ray.data.Dataset.groupby>` require materializing data, which may impact memory usage for very large datasets.


nitpick: best to refer to methods that are decorated as consumption API (e.g. ds.split or ds.split_at_indices) as opposed to all to all operations like groupby which theoretically don't require a full materializing of the data

marwan116 · 2025-01-23T02:45:57Z

doc/source/data/quickstart.rst

@@ -3,21 +3,52 @@
 Quickstart
 ==========

-Learn about :class:`Dataset <ray.data.Dataset>` and the capabilities it provides.
+This page introduces the :class:`Dataset <ray.data.Dataset>` concept and the capabilities it provides.


nit: but perhaps best to state this is a guide to get up and running with Ray Data instead of "about the Dataset concept"

bveeramani

LGTM.

I'd prefer to move key concepts to a dedicated "Key concepts" page for consistency with other libraries and so that the "Quickstart" just illustrates basic usage, but will defer to @angelinalg

bveeramani · 2025-01-23T07:17:09Z

doc/source/data/quickstart.rst


 This guide provides a lightweight introduction to:

+* :ref:`Key Concepts: Datasets and Blocks <data_key_concepts>`


According to the style guide the Data docs adhere to, use sentence case for headings and titles: https://developers.google.com/style/capitalization#capitalization-in-titles-and-headings

Suggested change

* :ref:`Key Concepts: Datasets and Blocks <data_key_concepts>`

* :ref:`Key Concepts: Datasets and blocks <data_key_concepts>`

bveeramani · 2025-01-23T07:17:17Z

doc/source/data/quickstart.rst

 * :ref:`Loading data <loading_key_concept>`
 * :ref:`Transforming data <transforming_key_concept>`
 * :ref:`Consuming data <consuming_key_concept>`
 * :ref:`Saving data <saving_key_concept>`
+* :ref:`Streaming Execution Model <streaming_execution_model>`


Suggested change

* :ref:`Streaming Execution Model <streaming_execution_model>`

* :ref:`Streaming execution model <streaming_execution_model>`

bveeramani · 2025-01-23T07:21:20Z

doc/source/data/quickstart.rst

+* **Memory Efficient**: Processes data in chunks rather than loading everything into memory
+* **Pipeline Parallelism**: Different stages of the pipeline can execute concurrently
+* **Automatic Memory Management**: Ray Data automatically spills data to disk if memory pressure is high
+* **Lazy Evaluation**: Transformations are not executed until an action triggers the pipeline


https://developers.google.com/style/contractions

Suggested change

* **Lazy Evaluation**: Transformations are not executed until an action triggers the pipeline

* **Lazy Evaluation**: Transformations aren't executed until an action triggers the pipeline

bveeramani · 2025-01-23T07:21:31Z

doc/source/data/quickstart.rst

+
+.. _streaming_execution_model:
+
+Streaming Execution Model


Suggested change

Streaming Execution Model

Streaming execution model

richardliaw · 2025-01-29T16:19:37Z

OK closing

## Why are these changes needed? Refresher for #50022, but on a separate page and a bit more holistic. It's not tightly integrated into the other pages yet but I will do a revision of quickstart/overview/data.rst pages. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Richard Liaw <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]>

richardliaw added 2 commits January 22, 2025 17:16

add-quickstart-data

6f44f47

Signed-off-by: Richard Liaw <[email protected]>

fixsmall

82039ee

Signed-off-by: Richard Liaw <[email protected]>

richardliaw requested a review from a team as a code owner January 23, 2025 01:18

scottsun94 reviewed Jan 23, 2025

View reviewed changes

marwan116 reviewed Jan 23, 2025

View reviewed changes

bveeramani approved these changes Jan 23, 2025

View reviewed changes

richardliaw closed this Jan 29, 2025

richardliaw mentioned this pull request Jan 29, 2025

[data/docs] Key Concepts Page #50129

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs/data] Add Blocks and Streaming Execution to Quickstart #50022

[docs/data] Add Blocks and Streaming Execution to Quickstart #50022

richardliaw commented Jan 23, 2025

scottsun94 Jan 23, 2025

marwan116 Jan 23, 2025

marwan116 Jan 23, 2025

marwan116 Jan 23, 2025

bveeramani left a comment

bveeramani Jan 23, 2025

bveeramani Jan 23, 2025

bveeramani Jan 23, 2025

bveeramani Jan 23, 2025

richardliaw commented Jan 29, 2025


		Key benefits of streaming execution include:

		* Memory Efficient: Processes data in chunks rather than loading everything into memory


		This guide provides a lightweight introduction to:

		* :ref:`Key Concepts: Datasets and Blocks <data_key_concepts>`

	* :ref:`Streaming Execution Model <streaming_execution_model>`
	* :ref:`Streaming execution model <streaming_execution_model>`

	* Lazy Evaluation: Transformations are not executed until an action triggers the pipeline
	* Lazy Evaluation: Transformations aren't executed until an action triggers the pipeline

[docs/data] Add Blocks and Streaming Execution to Quickstart #50022

[docs/data] Add Blocks and Streaming Execution to Quickstart #50022

Conversation

richardliaw commented Jan 23, 2025

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bveeramani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardliaw commented Jan 29, 2025