From f0aede0e789cb420bb684d52986a1bbe55323eff Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sat, 1 Apr 2023 11:17:13 -0400 Subject: [PATCH 1/2] Move content from README.md to docs site --- README.md | 172 +----------------- docs/source/contributor-guide/architecture.md | 26 +++ docs/source/contributor-guide/index.md | 46 ++--- docs/source/index.rst | 17 +- docs/source/user-guide/comparison.md | 33 ++++ docs/source/user-guide/integration.md | 35 ++++ docs/source/user-guide/introduction.md | 2 +- docs/source/user-guide/users.md | 67 +++++++ 8 files changed, 204 insertions(+), 194 deletions(-) create mode 100644 docs/source/contributor-guide/architecture.md create mode 100644 docs/source/user-guide/comparison.md create mode 100644 docs/source/user-guide/integration.md create mode 100644 docs/source/user-guide/users.md diff --git a/README.md b/README.md index 953f08bd451a..c9ca835695fd 100644 --- a/README.md +++ b/README.md @@ -19,6 +19,8 @@ # DataFusion +[![Coverage Status](https://codecov.io/gh/apache/arrow-datafusion/rust/branch/master/graph/badge.svg)](https://codecov.io/gh/apache/arrow-datafusion?branch=master) + logo DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in @@ -27,176 +29,8 @@ in-memory format. DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchmark.clickhouse.com/), built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community. -[![Coverage Status](https://codecov.io/gh/apache/arrow-datafusion/rust/branch/master/graph/badge.svg)](https://codecov.io/gh/apache/arrow-datafusion?branch=master) - -## Features - -- Feature-rich [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html) and [DataFrame API](https://arrow.apache.org/datafusion/user-guide/dataframe.html) -- Blazingly fast, vectorized, multi-threaded, streaming execution engine. -- Native support for Parquet, CSV, JSON, and Avro file formats. Support - for custom file formats and non file datasources via the `TableProvider` trait. -- Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL, - other query languages, custom plan and execution nodes, optimizer passes, and more. -- Streaming, asynchronous IO directly from popular object stores, including AWS S3, - Azure Blob Storage, and Google Cloud Storage. Other storage systems are supported via the - `ObjectStore` trait. -- [Excellent Documentation](https://docs.rs/datafusion/latest) and a - [welcoming community](https://arrow.apache.org/datafusion/contributor-guide/communication.html). -- A state of the art query optimizer with projection and filter pushdown, sort aware optimizations, - automatic join reordering, expression coercion, and more. -- Permissive Apache 2.0 License, Apache Software Foundation governance -- Written in [Rust](https://www.rust-lang.org/), a modern system language with development - productivity similar to Java or Golang, the performance of C++, and - [loved by programmers everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted). -- Support for [Substrait](https://substrait.io/) for query plan serialization, making it easier to integrate DataFusion - with other projects, and to pass plans across language boundaries. - -## Use Cases - -DataFusion can be used without modification as an embedded SQL -engine or can be customized and used as a foundation for -building new systems. Here are some examples of systems built using DataFusion: - -- Specialized Analytical Database systems such as [CeresDB] and more general Apache Spark like system such a [Ballista]. -- New query language engines such as [prql-query] and accelerators such as [VegaFusion] -- Research platform for new Database Systems, such as [Flock] -- SQL support to another library, such as [dask sql] -- Streaming data platforms such as [Synnada] -- Tools for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files such as [qv] -- A faster Spark runtime replacement [Blaze] - -By using DataFusion, the projects are freed to focus on their specific -features, and avoid reimplementing general (but still necessary) -features such as an expression representation, standard optimizations, -execution plans, file format support, etc. - -## Why DataFusion? - -- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion is very fast. -- _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem -- _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific use case -- _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems. - -## Comparisons with other projects - -When compared to similar systems, DataFusion typically is: - -1. Targeted at developers, rather than end users / data scientists. -2. Designed to be embedded, rather than a complete file based SQL system. -3. Governed by the [Apache Software Foundation](https://www.apache.org/) process, rather than a single company or individual. -4. Implemented in `Rust`, rather than `C/C++` - -Here is a comparison with similar projects that may help understand -when DataFusion might be be suitable and unsuitable for your needs: - -- [DuckDB](http://www.duckdb.org) is an open source, in process analytic database. - Like DataFusion, it supports very fast execution, both from its custom file format - and directly from parquet files. Unlike DataFusion, it is written in C/C++ and it - is primarily used directly by users as a serverless database and query system rather - than as a library for building such database systems. - -- [Polars](http://pola.rs): Polars is one of the fastest DataFrame - libraries at the time of writing. Like DataFusion, it is also - written in Rust and uses the Apache Arrow memory model, but unlike - DataFusion it does not provide SQL nor as many extension points. - -- [Facebook Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) - is an execution engine. Like DataFusion, Velox aims to - provide a reusable foundation for building database-like systems. Unlike DataFusion, - it is written in C/C++ and does not include a SQL frontend or planning /optimization - framework. - -- [Databend](https://github.com/datafuselabs/databend) is a complete - database system. Like DataFusion it is also written in Rust and - utilizes the Apache Arrow memory model, but unlike DataFusion it - targets end-users rather than developers of other database systems. - -## DataFusion Community Extensions - -There are a number of community projects that extend DataFusion or -provide integrations with other systems. - -### Language Bindings - -- [datafusion-c](https://github.com/datafusion-contrib/datafusion-c) -- [datafusion-python](https://github.com/apache/arrow-datafusion-python) -- [datafusion-ruby](https://github.com/datafusion-contrib/datafusion-ruby) -- [datafusion-java](https://github.com/datafusion-contrib/datafusion-java) - -### Integrations - -- [datafusion-bigtable](https://github.com/datafusion-contrib/datafusion-bigtable) -- [datafusion-catalogprovider-glue](https://github.com/datafusion-contrib/datafusion-catalogprovider-glue) - -## Known Uses - -Here are some of the projects known to use DataFusion: - -- [Ballista](https://github.com/apache/arrow-ballista) Distributed SQL Query Engine -- [Blaze](https://github.com/blaze-init/blaze) Spark accelerator with DataFusion at its core -- [CeresDB](https://github.com/CeresDB/ceresdb) Distributed Time-Series Database -- [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust) -- [CnosDB](https://github.com/cnosdb/cnosdb) Open Source Distributed Time Series Database -- [Cube Store](https://github.com/cube-js/cube.js/tree/master/rust) -- [Dask SQL](https://github.com/dask-contrib/dask-sql) Distributed SQL query engine in Python -- [datafusion-tui](https://github.com/datafusion-contrib/datafusion-tui) Text UI for DataFusion -- [delta-rs](https://github.com/delta-io/delta-rs) Native Rust implementation of Delta Lake -- [Flock](https://github.com/flock-lab/flock) -- [GreptimeDB](https://github.com/GreptimeTeam/greptimedb) Open Source & Cloud Native Distributed Time Series Database -- [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database -- [Kamu](https://github.com/kamu-data/kamu-cli/) Planet-scale streaming data pipeline -- [Parseable](https://github.com/parseablehq/parseable) Log storage and observability platform -- [qv](https://github.com/timvw/qv) Quickly view your data -- [ROAPI](https://github.com/roapi/roapi) -- [Seafowl](https://github.com/splitgraph/seafowl) CDN-friendly analytical database -- [Synnada](https://synnada.ai/) Streaming-first framework for data products -- [Tensorbase](https://github.com/tensorbase/tensorbase) -- [VegaFusion](https://vegafusion.io/) Server-side acceleration for the [Vega](https://vega.github.io/) visualization grammar -- [ZincObserve](https://github.com/zinclabs/zincobserve) Distributed cloud native observability platform - -[ballista]: https://github.com/apache/arrow-ballista -[blaze]: https://github.com/blaze-init/blaze -[ceresdb]: https://github.com/CeresDB/ceresdb -[cloudfuse buzz]: https://github.com/cloudfuse-io/buzz-rust -[cnosdb]: https://github.com/cnosdb/cnosdb -[cube store]: https://github.com/cube-js/cube.js/tree/master/rust -[dask sql]: https://github.com/dask-contrib/dask-sql -[datafusion-tui]: https://github.com/datafusion-contrib/datafusion-tui -[delta-rs]: https://github.com/delta-io/delta-rs -[flock]: https://github.com/flock-lab/flock -[kamu]: https://github.com/kamu-data/kamu-cli -[greptime db]: https://github.com/GreptimeTeam/greptimedb -[influxdb iox]: https://github.com/influxdata/influxdb_iox -[parseable]: https://github.com/parseablehq/parseable -[prql-query]: https://github.com/prql/prql-query -[qv]: https://github.com/timvw/qv -[roapi]: https://github.com/roapi/roapi -[seafowl]: https://github.com/splitgraph/seafowl -[synnada]: https://synnada.ai/ -[tensorbase]: https://github.com/tensorbase/tensorbase -[vegafusion]: https://vegafusion.io/ -[zincobserve]: https://github.com/zinclabs/zincobserve "if you know of another project, please submit a PR to add a link!" +See the Project Website at https://arrow.apache.org/datafusion/ for more details. ## Examples Please see the [example usage](https://arrow.apache.org/datafusion/user-guide/example-usage.html) in the user guide and the [datafusion-examples](https://github.com/apache/arrow-datafusion/tree/master/datafusion-examples) crate for more information on how to use DataFusion. - -## Roadmap - -Please see [Roadmap](docs/source/contributor-guide/roadmap.md) for information of where the project is headed. - -## Architecture Overview - -There is no formal document describing DataFusion's architecture yet, but the following presentations offer a good overview of its different components and how they interact together. - -- (July 2022): DataFusion and Arrow: Supercharge Your Data Analytical Tool with a Rusty Query Engine: [recording](https://www.youtube.com/watch?v=Rii1VTn3seQ) and [slides](https://docs.google.com/presentation/d/1q1bPibvu64k2b7LPi7Yyb0k3gA1BiUYiUbEklqW1Ckc/view#slide=id.g11054eeab4c_0_1165) -- (March 2021): The DataFusion architecture is described in _Query Engine Design and the Rust-Based DataFusion in Apache Arrow_: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts [~ 15 minutes in](https://www.youtube.com/watch?v=K6eCAVEk4kU&t=875s)) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934) -- (February 2021): How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ) - -## User Guide - -Please see [User Guide](https://arrow.apache.org/datafusion/) for more information about DataFusion. - -## Contributor Guide - -Please see [Contributor Guide](docs/source/contributor-guide/index.md) for information about contributing to DataFusion. diff --git a/docs/source/contributor-guide/architecture.md b/docs/source/contributor-guide/architecture.md new file mode 100644 index 000000000000..3150060ff304 --- /dev/null +++ b/docs/source/contributor-guide/architecture.md @@ -0,0 +1,26 @@ + + +# Architecture + +There is no formal document describing DataFusion's architecture yet, but the following presentations offer a good overview of its different components and how they interact together. + +- (July 2022): DataFusion and Arrow: Supercharge Your Data Analytical Tool with a Rusty Query Engine: [recording](https://www.youtube.com/watch?v=Rii1VTn3seQ) and [slides](https://docs.google.com/presentation/d/1q1bPibvu64k2b7LPi7Yyb0k3gA1BiUYiUbEklqW1Ckc/view#slide=id.g11054eeab4c_0_1165) +- (March 2021): The DataFusion architecture is described in _Query Engine Design and the Rust-Based DataFusion in Apache Arrow_: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts [~ 15 minutes in](https://www.youtube.com/watch?v=K6eCAVEk4kU&t=875s)) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934) +- (February 2021): How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ) diff --git a/docs/source/contributor-guide/index.md b/docs/source/contributor-guide/index.md index d7172329c251..df1709979b49 100644 --- a/docs/source/contributor-guide/index.md +++ b/docs/source/contributor-guide/index.md @@ -31,7 +31,9 @@ You can find a curated [good-first-issue](https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) list to help you get started. -# Pull Requests +# Developer's guide + +## Pull Requests We welcome pull requests (PRs) from anyone from the community. @@ -39,8 +41,6 @@ DataFusion is a very active fast-moving project and we try to review and merge P Review bandwidth is currently our most limited resource, and we highly encourage reviews by the broader community. If you are waiting for your PR to be reviewed, consider helping review other PRs that are waiting. Such review both helps the reviewer to learn the codebase and become more expert, as well as helps identify issues in the PR (such as lack of test coverage), that can be addressed and make future reviews faster and more efficient. -## Merging PRs - Since we are a worldwide community, we have contributors in many timezones who review and comment. To ensure anyone who wishes has an opportunity to review a PR, our committers try to ensure that at least 24 hours passes between when a "major" PR is approved and when it is merged. A "major" PR means there is a substantial change in design or a change in the API. Committers apply their best judgment to determine what constitutes a substantial change. A "minor" PR might be merged without a 24 hour delay, again subject to the judgment of the committer. Examples of potential "minor" PRs are: @@ -50,11 +50,11 @@ A "major" PR means there is a substantial change in design or a change in the AP 3. Non-controversial build-related changes (clippy, version upgrades etc.) 4. Smaller non-controversial feature additions -# Developer's guide +## Getting Started This section describes how you can get started at developing DataFusion. -## Windows setup +### Windows setup ```shell wget https://az792536.vo.msecnd.net/vms/VMBuild_20190311/VirtualBox/MSEdge/MSEdge.Win10.VirtualBox.zip @@ -63,7 +63,7 @@ git-bash.exe cargo build ``` -## Protoc Installation +### Protoc Installation Compiling DataFusion from sources requires an installed version of the protobuf compiler, `protoc`. @@ -85,7 +85,7 @@ libprotoc 3.12.4 Alternatively a binary release can be downloaded from the [Release Page](https://github.com/protocolbuffers/protobuf/releases) or [built from source](https://github.com/protocolbuffers/protobuf/blob/main/src/README.md). -## Bootstrap environment +### Bootstrap environment DataFusion is written in Rust and it uses a standard rust toolkit: @@ -110,7 +110,7 @@ or run them all at once: - [dev/rust_lint.sh](../../../dev/rust_lint.sh) -## Test Organization +### Test Organization DataFusion has several levels of tests in its [Test Pyramid](https://martinfowler.com/articles/practical-test-pyramid.html) @@ -118,13 +118,13 @@ and tries to follow [Testing Organization](https://doc.rust-lang.org/book/ch11-0 This section highlights the most important test modules that exist -### Unit tests +#### Unit tests Tests for the code in an individual module are defined in the same source file with a `test` module, following Rust convention -### Rust Integration Tests +#### Rust Integration Tests -There are several tests of the public interface of the DataFusion library in the [tests](../../../datafusion/core/tests) directory. +There are several tests of the public interface of the DataFusion library in the [tests](https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/tests) directory. You can run these tests individually using a command such as @@ -132,18 +132,18 @@ You can run these tests individually using a command such as cargo test -p datafusion --tests sql_integration ``` -One very important test is the [sql_integration](../../../datafusion/core/tests/sql_integration.rs) test which validates DataFusion's ability to run a large assortment of SQL queries against an assortment of data setups. +One very important test is the [sql_integration](https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/sql_integration.rs) test which validates DataFusion's ability to run a large assortment of SQL queries against an assortment of data setups. -### sqllogictests Tests +#### sqllogictests Tests -The [sqllogictests](../../../datafusion/core/tests/sqllogictests) also validate DataFusion SQL against an assortment of data setups. +The [sqllogictests](https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/tests/sqllogictests) also validate DataFusion SQL against an assortment of data setups. Data Driven tests have many benefits including being easier to write and maintain. We are in the process of [migrating sql_integration tests](https://github.com/apache/arrow-datafusion/issues/4460) and encourage you to add new tests using sqllogictests if possible. -## Benchmarks +### Benchmarks -### Criterion Benchmarks +#### Criterion Benchmarks [Criterion](https://docs.rs/criterion/latest/criterion/index.html) is a statistics-driven micro-benchmarking framework used by DataFusion for evaluating the performance of specific code-paths. In particular, the criterion benchmarks help to both guide optimisation efforts, and prevent performance regressions within DataFusion. @@ -153,7 +153,7 @@ Criterion integrates with Cargo's built-in [benchmark support](https://doc.rust- cargo bench --bench BENCHMARK_NAME ``` -A full list of benchmarks can be found [here](../../../datafusion/core/benches). +A full list of benchmarks can be found [here](https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/benches). _[cargo-criterion](https://github.com/bheisler/cargo-criterion) may also be used for more advanced reporting._ @@ -171,13 +171,15 @@ If the environment variable `PARQUET_FILE` is set, the benchmark will run querie The benchmark will automatically remove any generated parquet file on exit, however, if interrupted (e.g. by CTRL+C) it will not. This can be useful for analysing the particular file after the fact, or preserving it to use with `PARQUET_FILE` in subsequent runs. -### Upstream Benchmark Suites +#### Upstream Benchmark Suites -Instructions and tooling for running upstream benchmark suites against DataFusion can be found in [benchmarks](../../../benchmarks). +Instructions and tooling for running upstream benchmark suites against DataFusion can be found in [benchmarks](https://github.com/apache/arrow-datafusion/tree/main/benchmarks). These are valuable for comparative evaluation against alternative Arrow implementations and query engines. -## How to add a new scalar function +## HOWTOs + +### How to add a new scalar function Below is a checklist of what you need to do to add a new scalar function to DataFusion: @@ -197,7 +199,7 @@ Below is a checklist of what you need to do to add a new scalar function to Data - In [expr/src/expr_fn.rs](../../../datafusion/expr/src/expr_fn.rs), add: - a new entry of the `unary_scalar_expr!` macro for the new function. -## How to add a new aggregate function +### How to add a new aggregate function Below is a checklist of what you need to do to add a new aggregate function to DataFusion: @@ -215,7 +217,7 @@ Below is a checklist of what you need to do to add a new aggregate function to D - tests to the function. - In [tests/sql](../../../datafusion/core/tests/sql), add a new test where the function is called through SQL against well known data and returns the expected result. -## How to display plans graphically +### How to display plans graphically The query plans represented by `LogicalPlan` nodes can be graphically rendered using [Graphviz](https://www.graphviz.org/). diff --git a/docs/source/index.rst b/docs/source/index.rst index 57290d5a26a1..07d261c5c414 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -15,12 +15,21 @@ .. specific language governing permissions and limitations .. under the License. +.. image:: _static/images/DataFusion-Logo-Background-White.png + :alt: DataFusion Logo + ======================= Apache Arrow DataFusion ======================= -Table of Contents -================= +DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in +`Rust `_, using the `Apache Arrow `_ +in-memory format. + +DataFusion offers SQL and Dataframe APIs, excellent +`performance `_, built-in support for +CSV, Parquet, JSON, and Avro, extensive customization, and a great +community. .. _toc.guide: @@ -30,6 +39,9 @@ Table of Contents user-guide/introduction user-guide/example-usage + user-guide/users + user-guide/comparison + user-guide/integration user-guide/library user-guide/cli user-guide/dataframe @@ -47,6 +59,7 @@ Table of Contents contributor-guide/index contributor-guide/communication + contributor-guide/architecture contributor-guide/roadmap contributor-guide/quarterly_roadmap contributor-guide/specification/index diff --git a/docs/source/user-guide/comparison.md b/docs/source/user-guide/comparison.md new file mode 100644 index 000000000000..3fb9c3b6a345 --- /dev/null +++ b/docs/source/user-guide/comparison.md @@ -0,0 +1,33 @@ +# Comparisons to Other Projects + +When compared to similar systems, DataFusion typically is: + +1. Targeted at developers, rather than end users / data scientists. +2. Designed to be embedded, rather than a complete file based SQL system. +3. Governed by the [Apache Software Foundation](https://www.apache.org/) process, rather than a single company or individual. +4. Implemented in `Rust`, rather than `C/C++` + +Here is a comparison with similar projects that may help understand +when DataFusion might be be suitable and unsuitable for your needs: + +- [DuckDB](http://www.duckdb.org) is an open source, in process analytic database. + Like DataFusion, it supports very fast execution, both from its custom file format + and directly from parquet files. Unlike DataFusion, it is written in C/C++ and it + is primarily used directly by users as a serverless database and query system rather + than as a library for building such database systems. + +- [Polars](http://pola.rs): Polars is one of the fastest DataFrame + libraries at the time of writing. Like DataFusion, it is also + written in Rust and uses the Apache Arrow memory model, but unlike + DataFusion it does not provide SQL nor as many extension points. + +- [Facebook Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) + is an execution engine. Like DataFusion, Velox aims to + provide a reusable foundation for building database-like systems. Unlike DataFusion, + it is written in C/C++ and does not include a SQL frontend or planning /optimization + framework. + +- [Databend](https://github.com/datafuselabs/databend) is a complete + database system. Like DataFusion it is also written in Rust and + utilizes the Apache Arrow memory model, but unlike DataFusion it + targets end-users rather than developers of other database systems. diff --git a/docs/source/user-guide/integration.md b/docs/source/user-guide/integration.md new file mode 100644 index 000000000000..bffa6b189390 --- /dev/null +++ b/docs/source/user-guide/integration.md @@ -0,0 +1,35 @@ + + +# Integrations and Extensions + +There are a number of community projects that extend DataFusion or +provide integrations with other systems. + +## Language Bindings + +- [datafusion-c](https://github.com/datafusion-contrib/datafusion-c) +- [datafusion-python](https://github.com/apache/arrow-datafusion-python) +- [datafusion-ruby](https://github.com/datafusion-contrib/datafusion-ruby) +- [datafusion-java](https://github.com/datafusion-contrib/datafusion-java) + +## Integrations + +- [datafusion-bigtable](https://github.com/datafusion-contrib/datafusion-bigtable) +- [datafusion-catalogprovider-glue](https://github.com/datafusion-contrib/datafusion-catalogprovider-glue) diff --git a/docs/source/user-guide/introduction.md b/docs/source/user-guide/introduction.md index 55fc59b32047..f906eac78c13 100644 --- a/docs/source/user-guide/introduction.md +++ b/docs/source/user-guide/introduction.md @@ -17,7 +17,7 @@ under the License. --> -# Introduction +# Features, and Usecases DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in [Rust](http://rustlang.org), diff --git a/docs/source/user-guide/users.md b/docs/source/user-guide/users.md new file mode 100644 index 000000000000..0d259c8de3e2 --- /dev/null +++ b/docs/source/user-guide/users.md @@ -0,0 +1,67 @@ + + +# Known Users + +Here are some of the projects known to use DataFusion: + +- [Ballista](https://github.com/apache/arrow-ballista) Distributed SQL Query Engine +- [Blaze](https://github.com/blaze-init/blaze) Spark accelerator with DataFusion at its core +- [CeresDB](https://github.com/CeresDB/ceresdb) Distributed Time-Series Database +- [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust) +- [CnosDB](https://github.com/cnosdb/cnosdb) Open Source Distributed Time Series Database +- [Cube Store](https://github.com/cube-js/cube.js/tree/master/rust) +- [Dask SQL](https://github.com/dask-contrib/dask-sql) Distributed SQL query engine in Python +- [datafusion-tui](https://github.com/datafusion-contrib/datafusion-tui) Text UI for DataFusion +- [delta-rs](https://github.com/delta-io/delta-rs) Native Rust implementation of Delta Lake +- [Flock](https://github.com/flock-lab/flock) +- [GreptimeDB](https://github.com/GreptimeTeam/greptimedb) Open Source & Cloud Native Distributed Time Series Database +- [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database +- [Kamu](https://github.com/kamu-data/kamu-cli/) Planet-scale streaming data pipeline +- [Parseable](https://github.com/parseablehq/parseable) Log storage and observability platform +- [qv](https://github.com/timvw/qv) Quickly view your data +- [ROAPI](https://github.com/roapi/roapi) +- [Seafowl](https://github.com/splitgraph/seafowl) CDN-friendly analytical database +- [Synnada](https://synnada.ai/) Streaming-first framework for data products +- [Tensorbase](https://github.com/tensorbase/tensorbase) +- [VegaFusion](https://vegafusion.io/) Server-side acceleration for the [Vega](https://vega.github.io/) visualization grammar +- [ZincObserve](https://github.com/zinclabs/zincobserve) Distributed cloud native observability platform + +[ballista]: https://github.com/apache/arrow-ballista +[blaze]: https://github.com/blaze-init/blaze +[ceresdb]: https://github.com/CeresDB/ceresdb +[cloudfuse buzz]: https://github.com/cloudfuse-io/buzz-rust +[cnosdb]: https://github.com/cnosdb/cnosdb +[cube store]: https://github.com/cube-js/cube.js/tree/master/rust +[dask sql]: https://github.com/dask-contrib/dask-sql +[datafusion-tui]: https://github.com/datafusion-contrib/datafusion-tui +[delta-rs]: https://github.com/delta-io/delta-rs +[flock]: https://github.com/flock-lab/flock +[kamu]: https://github.com/kamu-data/kamu-cli +[greptime db]: https://github.com/GreptimeTeam/greptimedb +[influxdb iox]: https://github.com/influxdata/influxdb_iox +[parseable]: https://github.com/parseablehq/parseable +[prql-query]: https://github.com/prql/prql-query +[qv]: https://github.com/timvw/qv +[roapi]: https://github.com/roapi/roapi +[seafowl]: https://github.com/splitgraph/seafowl +[synnada]: https://synnada.ai/ +[tensorbase]: https://github.com/tensorbase/tensorbase +[vegafusion]: https://vegafusion.io/ +[zincobserve]: https://github.com/zinclabs/zincobserve "if you know of another project, please submit a PR to add a link!" From 3f5af7c4457610ec94683da3033702b53a6a56cf Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sat, 1 Apr 2023 11:33:46 -0400 Subject: [PATCH 2/2] RAT --- docs/source/index.rst | 1 + docs/source/user-guide/comparison.md | 19 +++++++++++++++++++ 2 files changed, 20 insertions(+) diff --git a/docs/source/index.rst b/docs/source/index.rst index 07d261c5c414..09071a751139 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -63,5 +63,6 @@ community. contributor-guide/roadmap contributor-guide/quarterly_roadmap contributor-guide/specification/index + Github Issue tracker Code of conduct diff --git a/docs/source/user-guide/comparison.md b/docs/source/user-guide/comparison.md index 3fb9c3b6a345..2cb13f326afb 100644 --- a/docs/source/user-guide/comparison.md +++ b/docs/source/user-guide/comparison.md @@ -1,3 +1,22 @@ + + # Comparisons to Other Projects When compared to similar systems, DataFusion typically is: