-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docs/data] Add Blocks and Streaming Execution to Quickstart #50022
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -3,21 +3,52 @@ | |||||
Quickstart | ||||||
========== | ||||||
|
||||||
Learn about :class:`Dataset <ray.data.Dataset>` and the capabilities it provides. | ||||||
This page introduces the :class:`Dataset <ray.data.Dataset>` concept and the capabilities it provides. | ||||||
|
||||||
This guide provides a lightweight introduction to: | ||||||
|
||||||
* :ref:`Key Concepts: Datasets and Blocks <data_key_concepts>` | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. According to the style guide the Data docs adhere to, use sentence case for headings and titles: https://developers.google.com/style/capitalization#capitalization-in-titles-and-headings
Suggested change
|
||||||
* :ref:`Loading data <loading_key_concept>` | ||||||
* :ref:`Transforming data <transforming_key_concept>` | ||||||
* :ref:`Consuming data <consuming_key_concept>` | ||||||
* :ref:`Saving data <saving_key_concept>` | ||||||
* :ref:`Streaming Execution Model <streaming_execution_model>` | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
Datasets | ||||||
-------- | ||||||
.. _data_key_concepts: | ||||||
|
||||||
Key Concepts: Datasets and Blocks | ||||||
--------------------------------- | ||||||
|
||||||
There are two main concepts in Ray Data: | ||||||
|
||||||
* :class:`Dataset <ray.data.Dataset>` | ||||||
* :ref:`Blocks <blocks_key_concept>` | ||||||
|
||||||
:class:`Dataset <ray.data.Dataset>` is the main user-facing Python API. It represents a | ||||||
distributed data collection and defines data loading and processing operations. You | ||||||
typically use the API in this way: | ||||||
|
||||||
1. Create a Dataset from external storage or in-memory data. | ||||||
2. Apply transformations to the data. | ||||||
3. Write the outputs to external storage or feed the outputs to training workers. | ||||||
|
||||||
The Dataset API is lazy, meaning that operations are not executed until you call an action | ||||||
like :meth:`~ray.data.Dataset.show`. This allows Ray Data to optimize the execution plan | ||||||
and execute operations in parallel. | ||||||
|
||||||
A *block* is the basic unit of data that Ray Data operates on. A block is a contiguous | ||||||
subset of rows from a dataset. | ||||||
|
||||||
The following figure visualizes a dataset with three blocks, each holding 1000 rows. | ||||||
Ray Data holds the :class:`~ray.data.Dataset` on the process that triggers execution | ||||||
(which is usually the driver) and stores the blocks as objects in Ray's shared-memory | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe link to driver definition in glossary? New ray user may not know exactly what it is |
||||||
:ref:`object store <objects-in-ray>`. | ||||||
|
||||||
.. image:: images/dataset-arch.svg | ||||||
|
||||||
.. | ||||||
https://docs.google.com/drawings/d/1PmbDvHRfVthme9XD7EYM-LIHPXtHdOfjCbc1SCsM64k/edit | ||||||
|
||||||
Ray Data's main abstraction is a :class:`Dataset <ray.data.Dataset>`, which | ||||||
is a distributed data collection. Datasets are designed for machine learning, and they | ||||||
can represent data collections that exceed a single machine's memory. | ||||||
|
||||||
.. _loading_key_concept: | ||||||
|
||||||
|
@@ -137,3 +168,46 @@ or remote filesystems. | |||||
|
||||||
|
||||||
To learn more about saving dataset contents, see :ref:`Saving data <saving-data>`. | ||||||
|
||||||
.. _streaming_execution_model: | ||||||
|
||||||
Streaming Execution Model | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
------------------------- | ||||||
|
||||||
Ray Data uses a streaming execution model to efficiently process large datasets. | ||||||
Rather than materializing the entire dataset in memory at once, | ||||||
Ray Data can process data in a streaming fashion through a pipeline of operations. | ||||||
|
||||||
Here's how it works: | ||||||
|
||||||
.. code-block:: python | ||||||
|
||||||
import ray | ||||||
|
||||||
# Create a dataset with 1K rows | ||||||
ds = ray.data.range(1_000) | ||||||
|
||||||
# Define a pipeline of operations | ||||||
ds = ds.map(lambda x: x * 2) | ||||||
ds = ds.filter(lambda x: x % 4 == 0) | ||||||
|
||||||
# Data starts flowing when you call an action like show() | ||||||
ds.show(5) | ||||||
|
||||||
Key benefits of streaming execution include: | ||||||
|
||||||
* **Memory Efficient**: Processes data in chunks rather than loading everything into memory | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nitpick but best stick to blocks and not introduce a new term |
||||||
* **Pipeline Parallelism**: Different stages of the pipeline can execute concurrently | ||||||
* **Automatic Memory Management**: Ray Data automatically spills data to disk if memory pressure is high | ||||||
* **Lazy Evaluation**: Transformations are not executed until an action triggers the pipeline | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. https://developers.google.com/style/contractions
Suggested change
|
||||||
|
||||||
The streaming model allows Ray Data to efficiently handle datasets much larger than memory while maintaining high performance through parallel execution. | ||||||
|
||||||
.. note:: | ||||||
Operations like :meth:`ds.sort() <ray.data.Dataset.sort>` and :meth:`ds.groupby() <ray.data.Dataset.groupby>` require materializing data, which may impact memory usage for very large datasets. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nitpick: best to refer to methods that are decorated as consumption API (e.g. |
||||||
|
||||||
|
||||||
Deep Dive into Ray Data | ||||||
----------------------- | ||||||
|
||||||
To learn more about Ray Data, see :ref:`Ray Data Internals <data_internals>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: but perhaps best to state this is a guide to get up and running with Ray Data instead of "about the Dataset concept"