Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING] new design of select, transform and combine #2214

Merged
merged 32 commits into from
May 5, 2020
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
55031d7
implement AbstractDataFrame functionality
bkamins Apr 27, 2020
55bde12
preparation in grouping, rename to _mutate in non-grouping
bkamins Apr 27, 2020
2f81c63
tentative rework of _combine that should be able to support select an…
bkamins Apr 27, 2020
fd951c5
continue grouping
bkamins Apr 28, 2020
eb9ace9
implement select, transform, select! and transform! for GroupedDataFr…
bkamins Apr 28, 2020
6908ee8
update DataFrame constructor
bkamins Apr 28, 2020
7b644dd
fix handling of aggregates
bkamins Apr 28, 2020
2753235
code cleanup
bkamins Apr 28, 2020
2a03190
improve canonical check + start rewriting tests
bkamins Apr 28, 2020
7b86eb8
allow changing sort order of groups in cannonical test
bkamins Apr 28, 2020
cb94903
make old tests pass
bkamins Apr 29, 2020
908d489
Merge branch 'master' into improve_selection
bkamins Apr 29, 2020
384c0b1
change error thrown on Julia 1.0
bkamins Apr 29, 2020
ea574c4
done tests of combine
bkamins Apr 29, 2020
8977017
finish tests and documentation
bkamins Apr 29, 2020
d51f3f8
updates after review comments
bkamins Apr 30, 2020
ef461e6
Apply suggestions from code review
bkamins May 1, 2020
245714d
fixes after code review
bkamins May 1, 2020
2bd31ff
add deprecated map tests
bkamins May 1, 2020
9d1b20d
fix error types in select
bkamins May 1, 2020
0f3d309
avoid computing idx, starts and ends in combine if regroup=true
bkamins May 1, 2020
1d69fa3
performance improvements
bkamins May 1, 2020
5713194
@simd did not improve the performance here
bkamins May 1, 2020
1f34d55
Update docs/src/man/split_apply_combine.md
bkamins May 1, 2020
2201789
add an example of passing function as a first argument to combine
bkamins May 1, 2020
2aa9170
change regroup to ungroup
bkamins May 2, 2020
cf4736c
Merge branch 'master' into improve_selection
bkamins May 5, 2020
333cca2
Apply suggestions from code review
bkamins May 5, 2020
334aba0
Merge remote-tracking branch 'origin/improve_selection' into improve_…
bkamins May 5, 2020
10b9474
update docs
bkamins May 5, 2020
792b57d
improve description of what gets returned in combine and select
bkamins May 5, 2020
f34873c
fix repeated code
bkamins May 5, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 28 additions & 2 deletions docs/src/man/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -792,7 +792,9 @@ julia> mean(df.A)
2.5
```

We can also apply a function to each column of a `DataFrame` using `select`. For example:
We can also apply a function to each column of a `DataFrame` using `select`.
`select` always returns the same number of rows in the result as the source
data frame. For example:
```jldoctest dataframe
julia> df = DataFrame(A = 1:4, B = 4.0:-1.0:1.0)
4×2 DataFrame
Expand All @@ -805,13 +807,37 @@ julia> df = DataFrame(A = 1:4, B = 4.0:-1.0:1.0)
│ 4 │ 4 │ 1.0 │

julia> select(df, names(df) .=> sum)
1×2 DataFrame
4×2 DataFrame
│ Row │ A_sum │ B_sum │
│ │ Int64 │ Float64 │
├─────┼───────┼─────────┤
│ 1 │ 10 │ 10.0 │
│ 2 │ 10 │ 10.0 │
│ 3 │ 10 │ 10.0 │
│ 4 │ 10 │ 10.0 │

julia> select(df, names(df) .=> sum, names(df) .=> prod)
4×4 DataFrame
│ Row │ A_sum │ B_sum │ A_prod │ B_prod │
│ │ Int64 │ Float64 │ Int64 │ Float64 │
├─────┼───────┼─────────┼────────┼─────────┤
│ 1 │ 10 │ 10.0 │ 24 │ 24.0 │
│ 2 │ 10 │ 10.0 │ 24 │ 24.0 │
│ 3 │ 10 │ 10.0 │ 24 │ 24.0 │
│ 4 │ 10 │ 10.0 │ 24 │ 24.0 │
```

If instead you prefer to get a result collapsed to the number of rows returned
by the applied functions use the `combine` function:
```
julia> combine(df, names(df) .=> sum)
1×2 DataFrame
│ Row │ A_sum │ B_sum │
│ │ Int64 │ Float64 │
├─────┼───────┼─────────┤
│ 1 │ 10 │ 10.0 │

julia> combine(df, names(df) .=> sum, names(df) .=> prod)
1×4 DataFrame
│ Row │ A_sum │ B_sum │ A_prod │ B_prod │
│ │ Int64 │ Float64 │ Int64 │ Float64 │
Expand Down
148 changes: 128 additions & 20 deletions docs/src/man/split_apply_combine.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,24 @@ framework for handling this sort of computation is described in the paper
"[The Split-Apply-Combine Strategy for Data Analysis](http://www.jstatsoft.org/v40/i01)",
written by Hadley Wickham.

The DataFrames package supports the split-apply-combine strategy through the `by`
function, which is a shorthand for `groupby` followed by `map` and/or `combine`.
`by` takes in three arguments: (1) a `DataFrame`, (2) one or more columns to split
the `DataFrame` on, and (3) a specification of one or more functions to apply to
The DataFrames package supports the split-apply-combine strategy through the
`combine`, `select`/`select!` and `transform`/`transform!` functions.

In order to perform operations by groups you first need to create a `GroupedDataFrame`
object from your data frame using `groupby` function that takes two arguments:
(1) a data frame to be grouped, and (2) a set of columns to group by.

The differences between the above functions are the following:
* `select`: return a data frame with the number and order of rows exactly the same
as the source, preserve only columns that have been calculated;
* `transform`: return a data frame with the number and order of rows exactly the same
as the source, preserve all columns from the source and columns that have been calculated;
* `select!`: is an in-place version of `select`;
* `transform!`: is an in-place version of `transform`;
* `combine`: does not put restrictions on number of rows returned, the order of rows
is specified by the order of groups in `GroupedDataFrame`.

All these functions take a specification of one or more functions to apply to
each subset of the `DataFrame`. This specification can be of the following forms:
1. standard column selectors (integers, symbols, vectors of integers, vectors of symbols,
`All`, `:`, `Between`, `Not` and regular expressions)
Expand All @@ -27,19 +41,20 @@ each subset of the `DataFrame`. This specification can be of the following forms
number of columns are processed (in which case `SubDataFrame` avoids excessive
compilation)

All forms except 1 and 6 can be also passed as the first argument to `map`.

As a special rule that applies to `cols => function` syntax, if `cols` is wrapped
in an `AsTable` object then a `NamedTuple` containing columns selected by `cols` is
passed to `function`.

In all of these cases, `function` can return either a single row or multiple rows.
`function` can always generate a single column by returning a single value or a vector.
Additionally, if `by` is passed exactly one `function` and `target_col` is not specified,
Additionally, if `combine` is passed exactly one `function` as a first argument
and `target_col` is not specified,
`function` can return multiple columns in the form of an `AbstractDataFrame`,
`AbstractMatrix`, `NamedTuple` or `DataFrameRow`.

Here are the rules specifying the shape of the resulting `DataFrame`:
Here are the rules specifying the shape of the resulting `DataFrame` in `combine`
(in `select`/`select!` and `transform`/`transform!` the result has the number
and order of rows equal to the source):
- a single value produces a single row and column per group
- a named tuple or `DataFrameRow` produces a single row and one column per field
- a vector produces a single column with one row per entry
Expand Down Expand Up @@ -87,7 +102,51 @@ julia> iris = DataFrame(CSV.File(joinpath(dirname(pathof(DataFrames)), "../docs/
│ 149 │ 6.2 │ 3.4 │ 5.4 │ 2.3 │ Iris-virginica │
│ 150 │ 5.9 │ 3.0 │ 5.1 │ 1.8 │ Iris-virginica │

julia> by(iris, :Species, :PetalLength => mean)
julia> gdf = groupby(iris, :Species)
GroupedDataFrame with 3 groups based on key: Species
First Group (50 rows): Species = "Iris-setosa"
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ String │
├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────────┤
│ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ Iris-setosa │
│ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ Iris-setosa │
│ 3 │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ Iris-setosa │
│ 4 │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ Iris-setosa │
│ 5 │ 5.0 │ 3.6 │ 1.4 │ 0.2 │ Iris-setosa │
│ 6 │ 5.4 │ 3.9 │ 1.7 │ 0.4 │ Iris-setosa │
│ 7 │ 4.6 │ 3.4 │ 1.4 │ 0.3 │ Iris-setosa │
│ 43 │ 4.4 │ 3.2 │ 1.3 │ 0.2 │ Iris-setosa │
│ 44 │ 5.0 │ 3.5 │ 1.6 │ 0.6 │ Iris-setosa │
│ 45 │ 5.1 │ 3.8 │ 1.9 │ 0.4 │ Iris-setosa │
│ 46 │ 4.8 │ 3.0 │ 1.4 │ 0.3 │ Iris-setosa │
│ 47 │ 5.1 │ 3.8 │ 1.6 │ 0.2 │ Iris-setosa │
│ 48 │ 4.6 │ 3.2 │ 1.4 │ 0.2 │ Iris-setosa │
│ 49 │ 5.3 │ 3.7 │ 1.5 │ 0.2 │ Iris-setosa │
│ 50 │ 5.0 │ 3.3 │ 1.4 │ 0.2 │ Iris-setosa │
Last Group (50 rows): Species = "Iris-virginica"
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ String │
├─────┼─────────────┼────────────┼─────────────┼────────────┼────────────────┤
│ 1 │ 6.3 │ 3.3 │ 6.0 │ 2.5 │ Iris-virginica │
│ 2 │ 5.8 │ 2.7 │ 5.1 │ 1.9 │ Iris-virginica │
│ 3 │ 7.1 │ 3.0 │ 5.9 │ 2.1 │ Iris-virginica │
│ 4 │ 6.3 │ 2.9 │ 5.6 │ 1.8 │ Iris-virginica │
│ 5 │ 6.5 │ 3.0 │ 5.8 │ 2.2 │ Iris-virginica │
│ 6 │ 7.6 │ 3.0 │ 6.6 │ 2.1 │ Iris-virginica │
│ 7 │ 4.9 │ 2.5 │ 4.5 │ 1.7 │ Iris-virginica │
│ 43 │ 5.8 │ 2.7 │ 5.1 │ 1.9 │ Iris-virginica │
│ 44 │ 6.8 │ 3.2 │ 5.9 │ 2.3 │ Iris-virginica │
│ 45 │ 6.7 │ 3.3 │ 5.7 │ 2.5 │ Iris-virginica │
│ 46 │ 6.7 │ 3.0 │ 5.2 │ 2.3 │ Iris-virginica │
│ 47 │ 6.3 │ 2.5 │ 5.0 │ 1.9 │ Iris-virginica │
│ 48 │ 6.5 │ 3.0 │ 5.2 │ 2.0 │ Iris-virginica │
│ 49 │ 6.2 │ 3.4 │ 5.4 │ 2.3 │ Iris-virginica │
│ 50 │ 5.9 │ 3.0 │ 5.1 │ 1.8 │ Iris-virginica │

julia> combine(gdf, :PetalLength => mean)
3×2 DataFrame
│ Row │ Species │ PetalLength_mean │
│ │ String │ Float64 │
Expand All @@ -96,7 +155,7 @@ julia> by(iris, :Species, :PetalLength => mean)
│ 2 │ Iris-versicolor │ 4.26 │
│ 3 │ Iris-virginica │ 5.552 │

julia> by(iris, :Species, nrow)
julia> combine(gdf, nrow)
3×2 DataFrame
│ Row │ Species │ nrow │
│ │ String │ Int64 │
Expand All @@ -105,7 +164,7 @@ julia> by(iris, :Species, nrow)
│ 2 │ Iris-versicolor │ 50 │
│ 3 │ Iris-virginica │ 50 │

julia> by(iris, :Species, nrow, :PetalLength => mean => :mean)
julia> combine(gdf, nrow, :PetalLength => mean => :mean)
3×3 DataFrame
│ Row │ Species │ nrow │ mean │
│ │ String │ Int64 │ Float64 │
Expand All @@ -114,9 +173,8 @@ julia> by(iris, :Species, nrow, :PetalLength => mean => :mean)
│ 2 │ Iris-versicolor │ 50 │ 4.26 │
│ 3 │ Iris-virginica │ 50 │ 5.552 │

julia> by(iris, :Species,
[:PetalLength, :SepalLength] =>
(p, s) -> (a=mean(p)/mean(s), b=sum(p))) # multiple columns are passed as arguments
julia> combine([:PetalLength, :SepalLength] => (p, s) -> (a=mean(p)/mean(s), b=sum(p)),
gdf) # multiple columns are passed as arguments
3×3 DataFrame
│ Row │ Species │ a │ b │
│ │ String │ Float64 │ Float64 │
Expand All @@ -125,9 +183,9 @@ julia> by(iris, :Species,
│ 2 │ Iris-versicolor │ 0.717655 │ 213.0 │
│ 3 │ Iris-virginica │ 0.842744 │ 277.6 │

julia> by(iris, :Species,
AsTable([:PetalLength, :SepalLength]) =>
x -> std(x.PetalLength) / std(x.SepalLength)) # passing a NamedTuple
julia> combine(gdf,
AsTable([:PetalLength, :SepalLength]) =>
x -> std(x.PetalLength) / std(x.SepalLength)) # passing a NamedTuple
3×2 DataFrame
│ Row │ Species │ PetalLength_SepalLength_function │
│ │ String │ Float64 │
Expand All @@ -136,7 +194,7 @@ julia> by(iris, :Species,
│ 2 │ Iris-versicolor │ 0.910378 │
│ 3 │ Iris-virginica │ 0.867923 │

julia> by(iris, :Species, 1:2 => cor, nrow)
julia> combine(gdf, 1:2 => cor, nrow)
3×3 DataFrame
│ Row │ Species │ SepalLength_SepalWidth_cor │ nrow │
│ │ String │ Float64 │ Int64 │
Expand All @@ -147,11 +205,61 @@ julia> by(iris, :Species, 1:2 => cor, nrow)

```

The `by` function also supports the `do` block form. However, as noted above,
If we use `select` or `transform` instead of `combine` we always obtain the number
and of order of rows in the result equal to the source. In the example below
the return values in columns `:SepalLength_SepalWidth_cor` and `:nrow` are
broadcasted to match the number of elements in each group:
```
julia> select(gdf, 1:2 => cor, nrow)
150×3 DataFrame
│ Row │ Species │ SepalLength_SepalWidth_cor │ nrow │
│ │ String │ Float64 │ Int64 │
├─────┼────────────────┼────────────────────────────┼───────┤
│ 1 │ Iris-setosa │ 0.74678 │ 50 │
│ 2 │ Iris-setosa │ 0.74678 │ 50 │
│ 3 │ Iris-setosa │ 0.74678 │ 50 │
│ 4 │ Iris-setosa │ 0.74678 │ 50 │
│ 5 │ Iris-setosa │ 0.74678 │ 50 │
│ 6 │ Iris-setosa │ 0.74678 │ 50 │
│ 7 │ Iris-setosa │ 0.74678 │ 50 │
│ 143 │ Iris-virginica │ 0.457228 │ 50 │
│ 144 │ Iris-virginica │ 0.457228 │ 50 │
│ 145 │ Iris-virginica │ 0.457228 │ 50 │
│ 146 │ Iris-virginica │ 0.457228 │ 50 │
│ 147 │ Iris-virginica │ 0.457228 │ 50 │
│ 148 │ Iris-virginica │ 0.457228 │ 50 │
│ 149 │ Iris-virginica │ 0.457228 │ 50 │
│ 150 │ Iris-virginica │ 0.457228 │ 50 │

julia> transform(gdf, nrow)
150×6 DataFrame
│ Row │ Species │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ nrow │
│ │ String │ Float64 │ Float64 │ Float64 │ Float64 │ Int64 │
├─────┼────────────────┼─────────────┼────────────┼─────────────┼────────────┼───────┤
│ 1 │ Iris-setosa │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ 50 │
│ 2 │ Iris-setosa │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ 50 │
│ 3 │ Iris-setosa │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ 50 │
│ 4 │ Iris-setosa │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ 50 │
│ 5 │ Iris-setosa │ 5.0 │ 3.6 │ 1.4 │ 0.2 │ 50 │
│ 6 │ Iris-setosa │ 5.4 │ 3.9 │ 1.7 │ 0.4 │ 50 │
│ 7 │ Iris-setosa │ 4.6 │ 3.4 │ 1.4 │ 0.3 │ 50 │
│ 143 │ Iris-virginica │ 5.8 │ 2.7 │ 5.1 │ 1.9 │ 50 │
│ 144 │ Iris-virginica │ 6.8 │ 3.2 │ 5.9 │ 2.3 │ 50 │
│ 145 │ Iris-virginica │ 6.7 │ 3.3 │ 5.7 │ 2.5 │ 50 │
│ 146 │ Iris-virginica │ 6.7 │ 3.0 │ 5.2 │ 2.3 │ 50 │
│ 147 │ Iris-virginica │ 6.3 │ 2.5 │ 5.0 │ 1.9 │ 50 │
│ 148 │ Iris-virginica │ 6.5 │ 3.0 │ 5.2 │ 2.0 │ 50 │
│ 149 │ Iris-virginica │ 6.2 │ 3.4 │ 5.4 │ 2.3 │ 50 │
│ 150 │ Iris-virginica │ 5.9 │ 3.0 │ 5.1 │ 1.8 │ 50 │
```

The `combine` function also supports the `do` block form. However, as noted above,
this form is slow and should therefore be avoided when performance matters.

```jldoctest sac
julia> by(iris, :Species) do df
julia> combine(gdf) do df
(m = mean(df.PetalLength), s² = var(df.PetalLength))
end
3×3 DataFrame
Expand Down
Loading