Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Ballista crate README content #878

Merged
merged 4 commits into from
Aug 15, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions ballista/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,7 @@ redundancy in the case of a scheduler failing.

# Getting Started

Fully working examples are available. Refer to the [Ballista Examples README](../ballista-examples/README.md) for
more information.
Refer to the core [Ballista crate README](rust/client/README.md) for the Getting Started guide.

## Distributed Scheduler Overview

Expand Down
100 changes: 98 additions & 2 deletions ballista/rust/client/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,102 @@
under the License.
-->

# Ballista - Rust
# Ballista: Distributed Scheduler for Apache Arrow DataFusion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you considered using #![doc = include_str!("../README.md")] here to keep both versions in sync? we could annotate the example code with rust,no_run to make sure it's skipped in doc test and also renders properly on github.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had no idea about this feature ... thank you for pointing this out. I will update the PR tomorrow with this.


This crate contains the Ballista client library. For an example usage, please refer [here](../benchmarks/tpch/README.md).
Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow and
DataFusion. It is built on an architecture that allows other programming languages (such as Python, C++, and
Java) to be supported as first-class citizens without paying a penalty for serialization costs.

The foundational technologies in Ballista are:

- [Apache Arrow](https://arrow.apache.org/) memory model and compute kernels for efficient processing of data.
- [Apache Arrow Flight Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) for efficient
data transfer between processes.
- [Google Protocol Buffers](https://developers.google.com/protocol-buffers) for serializing query plans.
- [Docker](https://www.docker.com/) for packaging up executors along with user-defined code.

Ballista can be deployed as a standalone cluster and also supports [Kubernetes](https://kubernetes.io/). In either
case, the scheduler can be configured to use [etcd](https://etcd.io/) as a backing store to (eventually) provide
redundancy in the case of a scheduler failing.

## Starting a cluster

There are numerous ways to start a Ballista cluster, including support for Docker and
Kubernetes. For full documentation, refer to the
[DataFusion User Guide](https://github.com/apache/arrow-datafusion/tree/master/docs/user-guide)

A simple way to start a local cluster for testing purposes is to use cargo to install
the scheduler and executor crates.

```bash
cargo install ballista-scheduler
cargo install ballista-executor
```

With these crates installed, it is now possible to start a scheduler process.

```bash
RUST_LOG=info ballista-scheduler
```

The scheduler will bind to port 50050 by default.

Next, start an executor processes in a new terminal session with the specified concurrency
level.

```bash
RUST_LOG=info ballista-executor -c 4
```

The executor will bind to port 50051 by default. Additional executors can be started by
manually specifying a bind port. For example:

```bash
RUST_LOG=info ballista-executor --bind-port 50052 -c 4
```

## Executing a query

Ballista provides a `BallistaContext` as a starting point for creating queries. DataFrames can be created
by invoking the `read_csv`, `read_parquet`, and `sql` methods.

The following example runs a simple aggregate SQL query against a CSV file from the
[New York Taxi and Limousine Commission](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
data set.

```rust,no_run
use ballista::prelude::*;
use datafusion::arrow::util::pretty;
use datafusion::prelude::CsvReadOptions;

#[tokio::main]
async fn main() -> Result<()> {
// create configuration
let config = BallistaConfig::builder()
.set("ballista.shuffle.partitions", "4")
.build()?;

// connect to Ballista scheduler
let ctx = BallistaContext::remote("localhost", 50050, &config);

// register csv file with the execution context
ctx.register_csv(
"tripdata",
"/path/to/yellow_tripdata_2020-01.csv",
CsvReadOptions::new(),
)?;

// execute the query
let df = ctx.sql(
"SELECT passenger_count, MIN(fare_amount), MAX(fare_amount), AVG(fare_amount), SUM(fare_amount)
FROM tripdata
GROUP BY passenger_count
ORDER BY passenger_count",
)?;

// collect the results and print them to stdout
let results = df.collect().await?;
pretty::print_batches(&results)?;
Ok(())
}
```
98 changes: 1 addition & 97 deletions ballista/rust/client/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,103 +15,7 @@
// specific language governing permissions and limitations
// under the License.

//! Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow and
//! DataFusion. It is built on an architecture that allows other programming languages (such as Python, C++, and
//! Java) to be supported as first-class citizens without paying a penalty for serialization costs.
//!
//! The foundational technologies in Ballista are:
//!
//! - [Apache Arrow](https://arrow.apache.org/) memory model and compute kernels for efficient processing of data.
//! - [Apache Arrow Flight Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) for efficient
//! data transfer between processes.
//! - [Google Protocol Buffers](https://developers.google.com/protocol-buffers) for serializing query plans.
//! - [Docker](https://www.docker.com/) for packaging up executors along with user-defined code.
//!
//! Ballista can be deployed as a standalone cluster and also supports [Kubernetes](https://kubernetes.io/). In either
//! case, the scheduler can be configured to use [etcd](https://etcd.io/) as a backing store to (eventually) provide
//! redundancy in the case of a scheduler failing.
//!
//! ## Starting a cluster
//!
//! There are numerous ways to start a Ballista cluster, including support for Docker and
//! Kubernetes. For full documentation, refer to the
//! [DataFusion User Guide](https://github.com/apache/arrow-datafusion/tree/master/docs/user-guide)
//!
//! A simple way to start a local cluster for testing purposes is to use cargo to install
//! the scheduler and executor crates.
//!
//! ```bash
//! cargo install ballista-scheduler
//! cargo install ballista-executor
//! ```
//!
//! With these crates installed, it is now possible to start a scheduler process.
//!
//! ```bash
//! RUST_LOG=info ballista-scheduler
//! ```
//!
//! The scheduler will bind to port 50050 by default.
//!
//! Next, start an executor processes in a new terminal session with the specified concurrency
//! level.
//!
//! ```bash
//! RUST_LOG=info ballista-executor -c 4
//! ```
//!
//! The executor will bind to port 50051 by default. Additional executors can be started by
//! manually specifying a bind port. For example:
//!
//! ```bash
//! RUST_LOG=info ballista-executor --bind-port 50052 -c 4
//! ```
//!
//! ## Executing a query
//!
//! Ballista provides a `BallistaContext` as a starting point for creating queries. DataFrames can be created
//! by invoking the `read_csv`, `read_parquet`, and `sql` methods.
//!
//! The following example runs a simple aggregate SQL query against a CSV file from the
//! [New York Taxi and Limousine Commission](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
//! data set.
//!
//! ```no_run
//! use ballista::prelude::*;
//! use datafusion::arrow::util::pretty;
//! use datafusion::prelude::CsvReadOptions;
//!
//! #[tokio::main]
//! async fn main() -> Result<()> {
//! // create configuration
//! let config = BallistaConfig::builder()
//! .set("ballista.shuffle.partitions", "4")
//! .build()?;
//!
//! // connect to Ballista scheduler
//! let ctx = BallistaContext::remote("localhost", 50050, &config);
//!
//! // register csv file with the execution context
//! ctx.register_csv(
//! "tripdata",
//! "/path/to/yellow_tripdata_2020-01.csv",
//! CsvReadOptions::new(),
//! )?;
//!
//! // execute the query
//! let df = ctx.sql(
//! "SELECT passenger_count, MIN(fare_amount), MAX(fare_amount), AVG(fare_amount), SUM(fare_amount)
//! FROM tripdata
//! GROUP BY passenger_count
//! ORDER BY passenger_count",
//! )?;
//!
//! // collect the results and print them to stdout
//! let results = df.collect().await?;
//! pretty::print_batches(&results)?;
//! Ok(())
//! }
//! ```
#![doc = include_str!("../README.md")]

pub mod columnar_batch;
pub mod context;
Expand Down
6 changes: 4 additions & 2 deletions ballista/rust/core/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@
under the License.
-->

# Ballista - Rust
# Ballista Core Library

This crate contains the core Ballista types.
This crate contains the Ballista core library which is used as a dependency by the `ballista`,
`ballista-scheduler`, and `ballista-executor` crates. Refer to <https://crates.io/crates/ballista> for
general Ballista documentation.
7 changes: 1 addition & 6 deletions ballista/rust/core/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,7 @@
// specific language governing permissions and limitations
// under the License.

//! Ballista Core Library
//!
//! This crate contains the Ballista core library which is used as a dependency by the ballista,
//! ballista-scheduler, and ballista-executor crates. Refer to <https://crates.io/crates/ballista> for
//! general Ballista documentation.

#![doc = include_str!("../README.md")]
#![allow(unused_imports)]
pub const BALLISTA_VERSION: &str = env!("CARGO_PKG_VERSION");

Expand Down
15 changes: 3 additions & 12 deletions ballista/rust/executor/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,7 @@
under the License.
-->

# Ballista Executor - Rust
# Ballista Executor Process

This crate contains the Ballista Executor. It can be used both as a library or as a binary.

## Run

```bash
RUST_LOG=info cargo run --release
...
[2021-02-11T05:30:13Z INFO executor] Running with config: ExecutorConfig { host: "localhost", port: 50051, work_dir: "/var/folders/y8/fc61kyjd4n53tn444n72rjrm0000gn/T/.tmpv1LjN0", concurrent_tasks: 4 }
```

By default, the executor will bind to `localhost` and listen on port `50051`.
This crate contains the Ballista executor process. Refer to <https://crates.io/crates/ballista> for
documentation.
5 changes: 1 addition & 4 deletions ballista/rust/executor/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,7 @@
// specific language governing permissions and limitations
// under the License.

//! Ballista Executor Process
//!
//! This crate contains the Ballista executor process. Refer to <https://crates.io/crates/ballista> for
//! documentation.
#![doc = include_str!("../README.md")]

pub mod collect;
pub mod execution_loop;
Expand Down
37 changes: 3 additions & 34 deletions ballista/rust/scheduler/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,38 +17,7 @@
under the License.
-->

# Ballista Scheduler
# Ballista Scheduler Process

This crate contains the Ballista Scheduler. It can be used both as a library or as a binary.

## Run

```bash
$ RUST_LOG=info cargo run --release
...
[2021-02-11T05:29:30Z INFO scheduler] Ballista v0.4.2-SNAPSHOT Scheduler listening on 0.0.0.0:50050
[2021-02-11T05:30:13Z INFO ballista::scheduler] Received register_executor request for ExecutorMetadata { id: "6d10f5d2-c8c3-4e0f-afdb-1f6ec9171321", host: "localhost", port: 50051 }
```

By default, the scheduler will bind to `localhost` and listen on port `50051`.

## Connecting to Scheduler

Scheduler supports REST model also using content negotiation.
For e.x if you want to get list of executors connected to the scheduler,
you can do (assuming you use default config)

```bash
curl --request GET \
--url http://localhost:50050/executors \
--header 'Accept: application/json'
```

## Scheduler UI

A basic ui for the scheduler is in `ui/scheduler` of the ballista repo.
It can be started using the following [yarn](https://yarnpkg.com/) command

```bash
yarn && yarn start
```
This crate contains the Ballista scheduler process. Refer to <https://crates.io/crates/ballista> for
documentation.
5 changes: 1 addition & 4 deletions ballista/rust/scheduler/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,7 @@
// specific language governing permissions and limitations
// under the License.

//! Ballista Scheduler Process
//!
//! This crate contains the Ballista scheduler process. Refer to <https://crates.io/crates/ballista> for
//! documentation.
#![doc = include_str!("../README.md")]

pub mod api;
pub mod planner;
Expand Down