Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs/data] Add Blocks and Streaming Execution to Quickstart #50022

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 80 additions & 6 deletions doc/source/data/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,21 +3,52 @@
Quickstart
==========

Learn about :class:`Dataset <ray.data.Dataset>` and the capabilities it provides.
This page introduces the :class:`Dataset <ray.data.Dataset>` concept and the capabilities it provides.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: but perhaps best to state this is a guide to get up and running with Ray Data instead of "about the Dataset concept"


This guide provides a lightweight introduction to:

* :ref:`Key Concepts: Datasets and Blocks <data_key_concepts>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the style guide the Data docs adhere to, use sentence case for headings and titles: https://developers.google.com/style/capitalization#capitalization-in-titles-and-headings

Suggested change
* :ref:`Key Concepts: Datasets and Blocks <data_key_concepts>`
* :ref:`Key Concepts: Datasets and blocks <data_key_concepts>`

* :ref:`Loading data <loading_key_concept>`
* :ref:`Transforming data <transforming_key_concept>`
* :ref:`Consuming data <consuming_key_concept>`
* :ref:`Saving data <saving_key_concept>`
* :ref:`Streaming Execution Model <streaming_execution_model>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* :ref:`Streaming Execution Model <streaming_execution_model>`
* :ref:`Streaming execution model <streaming_execution_model>`


Datasets
--------
.. _data_key_concepts:

Key Concepts: Datasets and Blocks
---------------------------------

There are two main concepts in Ray Data:

* :class:`Dataset <ray.data.Dataset>`
* :ref:`Blocks <blocks_key_concept>`

:class:`Dataset <ray.data.Dataset>` is the main user-facing Python API. It represents a
distributed data collection and defines data loading and processing operations. You
typically use the API in this way:

1. Create a Dataset from external storage or in-memory data.
2. Apply transformations to the data.
3. Write the outputs to external storage or feed the outputs to training workers.

The Dataset API is lazy, meaning that operations are not executed until you call an action
like :meth:`~ray.data.Dataset.show`. This allows Ray Data to optimize the execution plan
and execute operations in parallel.

A *block* is the basic unit of data that Ray Data operates on. A block is a contiguous
subset of rows from a dataset.

The following figure visualizes a dataset with three blocks, each holding 1000 rows.
Ray Data holds the :class:`~ray.data.Dataset` on the process that triggers execution
(which is usually the driver) and stores the blocks as objects in Ray's shared-memory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe link to driver definition in glossary? New ray user may not know exactly what it is

:ref:`object store <objects-in-ray>`.

.. image:: images/dataset-arch.svg

..
https://docs.google.com/drawings/d/1PmbDvHRfVthme9XD7EYM-LIHPXtHdOfjCbc1SCsM64k/edit

Ray Data's main abstraction is a :class:`Dataset <ray.data.Dataset>`, which
is a distributed data collection. Datasets are designed for machine learning, and they
can represent data collections that exceed a single machine's memory.

.. _loading_key_concept:

Expand Down Expand Up @@ -137,3 +168,46 @@ or remote filesystems.


To learn more about saving dataset contents, see :ref:`Saving data <saving-data>`.

.. _streaming_execution_model:

Streaming Execution Model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Streaming Execution Model
Streaming execution model

-------------------------

Ray Data uses a streaming execution model to efficiently process large datasets.
Rather than materializing the entire dataset in memory at once,
Ray Data can process data in a streaming fashion through a pipeline of operations.

Here's how it works:

.. code-block:: python

import ray

# Create a dataset with 1K rows
ds = ray.data.range(1_000)

# Define a pipeline of operations
ds = ds.map(lambda x: x * 2)
ds = ds.filter(lambda x: x % 4 == 0)

# Data starts flowing when you call an action like show()
ds.show(5)

Key benefits of streaming execution include:

* **Memory Efficient**: Processes data in chunks rather than loading everything into memory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick but best stick to blocks and not introduce a new term

* **Pipeline Parallelism**: Different stages of the pipeline can execute concurrently
* **Automatic Memory Management**: Ray Data automatically spills data to disk if memory pressure is high
* **Lazy Evaluation**: Transformations are not executed until an action triggers the pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://developers.google.com/style/contractions

Suggested change
* **Lazy Evaluation**: Transformations are not executed until an action triggers the pipeline
* **Lazy Evaluation**: Transformations aren't executed until an action triggers the pipeline


The streaming model allows Ray Data to efficiently handle datasets much larger than memory while maintaining high performance through parallel execution.

.. note::
Operations like :meth:`ds.sort() <ray.data.Dataset.sort>` and :meth:`ds.groupby() <ray.data.Dataset.groupby>` require materializing data, which may impact memory usage for very large datasets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: best to refer to methods that are decorated as consumption API (e.g. ds.split or ds.split_at_indices) as opposed to all to all operations like groupby which theoretically don't require a full materializing of the data



Deep Dive into Ray Data
-----------------------

To learn more about Ray Data, see :ref:`Ray Data Internals <data_internals>`.
2 changes: 1 addition & 1 deletion python/ray/data/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -2127,7 +2127,7 @@ def normalize_variety(group: pd.DataFrame) -> pd.DataFrame:

Args:
key: A column name or list of column names.
If this is ``None``, place all rows in a single group.
If this is ``None``, place all rows in a single group.

Returns:
A lazy :class:`~ray.data.grouped_data.GroupedData`.
Expand Down