You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: CONTRIBUTING.md
+35-177
Original file line number
Diff line number
Diff line change
@@ -25,22 +25,28 @@ We welcome and encourage contributions of all kinds, such as:
25
25
2. Documentation improvements
26
26
3. Code (PR or PR Review)
27
27
28
-
In addition to submitting new PRs, we have a healthy tradition of community members helping review each other's PRs. Doing so is a great way to help the community as well as get more familiar with Rust and the relevant codebases.
28
+
In addition to submitting new PRs, we have a healthy tradition of community members helping review each other's PRs.
29
+
Doing so is a great way to help the community as well as get more familiar with Rust and the relevant codebases.
and tries to follow [Testing Organization](https://doc.rust-lang.org/book/ch11-03-test-organization.html) in the The Book.
71
-
72
-
This section highlights the most important test modules that exist
73
-
74
-
### Unit tests
75
-
76
-
Tests for the code in an individual module are defined in the same source file with a `test` module, following Rust convention
77
-
78
-
### Rust Integration Tests
79
-
80
-
There are several tests of the public interface of the DataFusion library in the [tests](https://github.com/apache/arrow-datafusion/blob/master/datafusion/tests) directory.
81
-
82
-
You can run these tests individually using a command such as
83
-
84
-
```shell
85
-
cargo test -p datafusion --tests sql_integration
86
-
```
87
-
88
-
One very important test is the [sql_integraton](https://github.com/apache/arrow-datafusion/blob/master/datafusion/tests/sql_integration.rs) test which validates DataFusion's ability to run a large assortment of SQL queries against an assortment of data setsups.
89
-
90
-
### SQL / Postgres Integration Tests
91
-
92
-
The [integration-tests](https://github.com/apache/arrow-datafusion/blob/master/datafusion/integration-tests] directory contains a harness that runs certain queries against both postgres and datafusion and compares results
The scheduler and executor processes can be configured using toml files, environment variables and command-line
73
+
arguments. The specification for config options can be found here:
142
74
143
-
[Criterion](https://docs.rs/criterion/latest/criterion/index.html) is a statistics-driven micro-benchmarking framework used by DataFusion for evaluating the performance of specific code-paths. In particular, the criterion benchmarks help to both guide optimisation efforts, and prevent performance regressions within DataFusion.
Criterion integrates with Cargo's built-in [benchmark support](https://doc.rust-lang.org/cargo/commands/cargo-bench.html) and a given benchmark can be run with
78
+
Those files fully define Ballista's configuration. If there is a discrepancy between this documentation and the
79
+
files, assume those files are correct.
146
80
147
-
```
148
-
cargo bench --bench BENCHMARK_NAME
149
-
```
150
-
151
-
A full list of benchmarks can be found [here](./datafusion/benches).
152
-
153
-
_[cargo-criterion](https://github.com/bheisler/cargo-criterion) may also be used for more advanced reporting._
154
-
155
-
#### Parquet SQL Benchmarks
156
-
157
-
The parquet SQL benchmarks can be run with
158
-
159
-
```
160
-
cargo bench --bench parquet_query_sql
161
-
```
162
-
163
-
These randomly generate a parquet file, and then benchmark queries sourced from [parquet_query_sql.sql](./datafusion/core/benches/parquet_query_sql.sql) against it. This can therefore be a quick way to add coverage of particular query and/or data paths.
164
-
165
-
If the environment variable `PARQUET_FILE` is set, the benchmark will run queries against this file instead of a randomly generated one. This can be useful for performing multiple runs, potentially with different code, against the same source data, or for testing against a custom dataset.
166
-
167
-
The benchmark will automatically remove any generated parquet file on exit, however, if interrupted (e.g. by CTRL+C) it will not. This can be useful for analysing the particular file after the fact, or preserving it to use with `PARQUET_FILE` in subsequent runs.
81
+
To get a list of command-line arguments, run the binary with `--help`
168
82
169
-
### Upstream Benchmark Suites
83
+
There is an example config file at [ballista/executor/examples/example_executor_config.toml](ballista/executor/examples/example_executor_config.toml)
170
84
171
-
Instructions and tooling for running upstream benchmark suites against DataFusion and/or Ballista can be found in [benchmarks](./benchmarks).
85
+
The order of precedence for arguments is: default config file < environment variables < specified config file < command line arguments.
172
86
173
-
These are valuable for comparative evaluation against alternative Arrow implementations and query engines.
87
+
The executor and scheduler will look for the default config file at `/etc/ballista/[executor|scheduler].toml` To
88
+
specify a config file use the `--config-file` argument.
174
89
175
-
## How to add a new scalar function
90
+
Environment variables are prefixed by `BALLISTA_EXECUTOR` or `BALLISTA_SCHEDULER` for the executor and scheduler
91
+
respectively. Hyphens in command line arguments become underscores. For example, the `--scheduler-host` argument
92
+
for the executor becomes `BALLISTA_EXECUTOR_SCHEDULER_HOST`
176
93
177
-
Below is a checklist of what you need to do to add a new scalar function to DataFusion:
94
+
### Python Environment
178
95
179
-
- Add the actual implementation of the function:
180
-
-[here](datafusion/physical-expr/src/string_expressions.rs) for string functions
181
-
-[here](datafusion/physical-expr/src/math_expressions.rs) for math functions
182
-
-[here](datafusion/physical-expr/src/datetime_expressions.rs) for datetime functions
183
-
- create a new module [here](datafusion/physical-expr/src) for other functions
184
-
- In [core/src/physical_plan](datafusion/core/src/physical_plan/functions.rs), add:
185
-
- a new variant to `BuiltinScalarFunction`
186
-
- a new entry to `FromStr` with the name of the function as called by SQL
187
-
- a new line in `return_type` with the expected return type of the function, given an incoming type
188
-
- a new line in `signature` with the signature of the function (number and types of its arguments)
189
-
- a new line in `create_physical_expr`/`create_physical_fun` mapping the built-in to the implementation
190
-
- tests to the function.
191
-
- In [core/tests/sql](datafusion/core/tests/sql), add a new test where the function is called through SQL against well known data and returns the expected result.
192
-
- In [core/src/logical_plan/expr](datafusion/core/src/logical_plan/expr.rs), add:
193
-
- a new entry of the `unary_scalar_expr!` macro for the new function.
194
-
- In [core/src/logical_plan/mod](datafusion/core/src/logical_plan/mod.rs), add:
195
-
- a new entry in the `pub use expr::{}` set.
96
+
Refer to the instructions in the Python Bindings [README](./python/README.md)
196
97
197
-
##How to add a new aggregate function
98
+
### Javascript Environment
198
99
199
-
Below is a checklist of what you need to do to add a new aggregate function to DataFusion:
100
+
Refer to the instructions in the Scheduler Web UI [README](./ballista/scheduler/ui/README.md)
200
101
201
-
- Add the actual implementation of an `Accumulator` and `AggregateExpr`:
202
-
-[here](datafusion/src/physical_plan/string_expressions.rs) for string functions
203
-
-[here](datafusion/src/physical_plan/math_expressions.rs) for math functions
204
-
-[here](datafusion/src/physical_plan/datetime_expressions.rs) for datetime functions
205
-
- create a new module [here](datafusion/src/physical_plan) for other functions
206
-
- In [src/physical_plan/aggregates](datafusion/src/physical_plan/aggregates.rs), add:
207
-
- a new variant to `BuiltinAggregateFunction`
208
-
- a new entry to `FromStr` with the name of the function as called by SQL
209
-
- a new line in `return_type` with the expected return type of the function, given an incoming type
210
-
- a new line in `signature` with the signature of the function (number and types of its arguments)
211
-
- a new line in `create_aggregate_expr` mapping the built-in to the implementation
212
-
- tests to the function.
213
-
- In [tests/sql.rs](datafusion/tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
102
+
## Integration Tests
214
103
215
-
## How to display plans graphically
216
-
217
-
The query plans represented by `LogicalPlan` nodes can be graphically
218
-
rendered using [Graphviz](http://www.graphviz.org/).
219
-
220
-
To do so, save the output of the `display_graphviz` function to a file.:
221
-
222
-
```rust
223
-
// Create plan somehow...
224
-
letmutoutput=File::create("/tmp/plan.dot")?;
225
-
write!(output, "{}", plan.display_graphviz());
226
-
```
227
-
228
-
Then, use the `dot` command line tool to render it into a file that
229
-
can be displayed. For example, the following command creates a
230
-
`/tmp/plan.pdf` file:
104
+
The integration tests can be executed by running the following command from the root of the repository.
231
105
232
106
```bash
233
-
dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf
107
+
./dev/integration-tests.sh
234
108
```
235
109
236
-
## Specification
237
-
238
-
We formalize DataFusion semantics and behaviors through specification
239
-
documents. These specifications are useful to be used as references to help
240
-
resolve ambiguities during development or code reviews.
241
-
242
-
You are also welcome to propose changes to existing specifications or create
243
-
new specifications as you see fit.
244
-
245
-
Here is the list current active specifications:
246
-
247
-
-[Output field name semantic](https://arrow.apache.org/datafusion/specification/output-field-name-semantic.html)
0 commit comments