Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING] new design of select, transform and combine #2214

Merged
merged 32 commits into from
May 5, 2020
Merged
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
55031d7
implement AbstractDataFrame functionality
bkamins Apr 27, 2020
55bde12
preparation in grouping, rename to _mutate in non-grouping
bkamins Apr 27, 2020
2f81c63
tentative rework of _combine that should be able to support select an…
bkamins Apr 27, 2020
fd951c5
continue grouping
bkamins Apr 28, 2020
eb9ace9
implement select, transform, select! and transform! for GroupedDataFr…
bkamins Apr 28, 2020
6908ee8
update DataFrame constructor
bkamins Apr 28, 2020
7b644dd
fix handling of aggregates
bkamins Apr 28, 2020
2753235
code cleanup
bkamins Apr 28, 2020
2a03190
improve canonical check + start rewriting tests
bkamins Apr 28, 2020
7b86eb8
allow changing sort order of groups in cannonical test
bkamins Apr 28, 2020
cb94903
make old tests pass
bkamins Apr 29, 2020
908d489
Merge branch 'master' into improve_selection
bkamins Apr 29, 2020
384c0b1
change error thrown on Julia 1.0
bkamins Apr 29, 2020
ea574c4
done tests of combine
bkamins Apr 29, 2020
8977017
finish tests and documentation
bkamins Apr 29, 2020
d51f3f8
updates after review comments
bkamins Apr 30, 2020
ef461e6
Apply suggestions from code review
bkamins May 1, 2020
245714d
fixes after code review
bkamins May 1, 2020
2bd31ff
add deprecated map tests
bkamins May 1, 2020
9d1b20d
fix error types in select
bkamins May 1, 2020
0f3d309
avoid computing idx, starts and ends in combine if regroup=true
bkamins May 1, 2020
1d69fa3
performance improvements
bkamins May 1, 2020
5713194
@simd did not improve the performance here
bkamins May 1, 2020
1f34d55
Update docs/src/man/split_apply_combine.md
bkamins May 1, 2020
2201789
add an example of passing function as a first argument to combine
bkamins May 1, 2020
2aa9170
change regroup to ungroup
bkamins May 2, 2020
cf4736c
Merge branch 'master' into improve_selection
bkamins May 5, 2020
333cca2
Apply suggestions from code review
bkamins May 5, 2020
334aba0
Merge remote-tracking branch 'origin/improve_selection' into improve_…
bkamins May 5, 2020
10b9474
update docs
bkamins May 5, 2020
792b57d
improve description of what gets returned in combine and select
bkamins May 5, 2020
f34873c
fix repeated code
bkamins May 5, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions docs/src/man/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -773,8 +773,8 @@ julia> describe(df)

```

If you are interested in describing only a subset of columns then the easiest way to do it is to
pass a subset of an original data frame to `describe` like this:
If you are interested in describing only a subset of columns then the easiest way
to do it is to pass a subset of an original data frame to `describe` like this:
```jldoctest dataframe
julia> describe(df[!, [:A]))
1×8 DataFrame
Expand All @@ -792,7 +792,7 @@ julia> mean(df.A)
2.5
```

We can also apply a function to each column of a `DataFrame` using `select`. For example:
We can also apply a function to each column of a `DataFrame` using `combine`. For example:
```jldoctest dataframe
julia> df = DataFrame(A = 1:4, B = 4.0:-1.0:1.0)
4×2 DataFrame
Expand All @@ -804,21 +804,24 @@ julia> df = DataFrame(A = 1:4, B = 4.0:-1.0:1.0)
│ 3 │ 3 │ 2.0 │
│ 4 │ 4 │ 1.0 │

julia> select(df, names(df) .=> sum)
julia> combine(df, names(df) .=> sum)
1×2 DataFrame
│ Row │ A_sum │ B_sum │
│ │ Int64 │ Float64 │
├─────┼───────┼─────────┤
│ 1 │ 10 │ 10.0 │

julia> select(df, names(df) .=> sum, names(df) .=> prod)
julia> combine(df, names(df) .=> sum, names(df) .=> prod)
1×4 DataFrame
│ Row │ A_sum │ B_sum │ A_prod │ B_prod │
│ │ Int64 │ Float64 │ Int64 │ Float64 │
├─────┼───────┼─────────┼────────┼─────────┤
│ 1 │ 10 │ 10.0 │ 24 │ 24.0 │
```

If you would prefer the result to have the same number of rows as the source data
frame use `select` instead of `combine`.

### Handling of Columns Stored in a `DataFrame`

Functions that transform a `DataFrame` to produce a
Expand Down
161 changes: 141 additions & 20 deletions docs/src/man/split_apply_combine.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,25 @@ framework for handling this sort of computation is described in the paper
"[The Split-Apply-Combine Strategy for Data Analysis](http://www.jstatsoft.org/v40/i01)",
written by Hadley Wickham.

The DataFrames package supports the split-apply-combine strategy through the `by`
function, which is a shorthand for `groupby` followed by `map` and/or `combine`.
`by` takes in three arguments: (1) a `DataFrame`, (2) one or more columns to split
the `DataFrame` on, and (3) a specification of one or more functions to apply to
The DataFrames package supports the split-apply-combine strategy through the
`groupby` function followed by `combine`, `select`/`select!` or `transform`/`transform!`.

In order to perform operations by groups you first need to create a `GroupedDataFrame`
object from your data frame using the `groupby` function that takes two arguments:
(1) a data frame to be grouped, and (2) a set of columns to group by.

Operations can then be applied on each group using one of the following functions:
* `combine`: does not put restrictions on number of rows returned, the order of rows
is specified by the order of groups in `GroupedDataFrame`; it is typically used
to compute summary statistics by group;
* `select`: return a data frame with the number and order of rows exactly the same
as the source data frame, including only new calculated columns;
`select!` is an in-place version of `select`;
* `transform`: return a data frame with the number and order of rows exactly the same
as the source data frame, including all columns from the source and new calculated columns;
`transform!` is an in-place version of `transform`.

All these functions take a specification of one or more functions to apply to
each subset of the `DataFrame`. This specification can be of the following forms:
1. standard column selectors (integers, symbols, vectors of integers, vectors of symbols,
`All`, `:`, `Between`, `Not` and regular expressions)
Expand All @@ -27,19 +42,22 @@ each subset of the `DataFrame`. This specification can be of the following forms
number of columns are processed (in which case `SubDataFrame` avoids excessive
compilation)

All forms except 1 and 6 can be also passed as the first argument to `map`.

As a special rule that applies to `cols => function` syntax, if `cols` is wrapped
in an `AsTable` object then a `NamedTuple` containing columns selected by `cols` is
passed to `function`.

In all of these cases, `function` can return either a single row or multiple rows.
`function` can always generate a single column by returning a single value or a vector.
Additionally, if `by` is passed exactly one `function` and `target_col` is not specified,
Additionally, if `combine` is passed exactly one `function`, `cols => function`,
or `cols => function => outcol` as a first argument
and `target_col` is not specified,
`function` can return multiple columns in the form of an `AbstractDataFrame`,
`AbstractMatrix`, `NamedTuple` or `DataFrameRow`.

Here are the rules specifying the shape of the resulting `DataFrame`:
`select`/`select!` and `transform`/`transform!` always return a `DataFrame`
with the same number of rows as the source.
For `combine`, the shape of the resulting `DataFrame` is determined
according to the following rules:
- a single value produces a single row and column per group
- a named tuple or `DataFrameRow` produces a single row and one column per field
- a vector produces a single column with one row per entry
Expand Down Expand Up @@ -87,7 +105,51 @@ julia> iris = DataFrame(CSV.File(joinpath(dirname(pathof(DataFrames)), "../docs/
│ 149 │ 6.2 │ 3.4 │ 5.4 │ 2.3 │ Iris-virginica │
│ 150 │ 5.9 │ 3.0 │ 5.1 │ 1.8 │ Iris-virginica │

julia> by(iris, :Species, :PetalLength => mean)
julia> gdf = groupby(iris, :Species)
GroupedDataFrame with 3 groups based on key: Species
First Group (50 rows): Species = "Iris-setosa"
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ String │
├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────────┤
│ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ Iris-setosa │
│ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ Iris-setosa │
│ 3 │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ Iris-setosa │
│ 4 │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ Iris-setosa │
│ 5 │ 5.0 │ 3.6 │ 1.4 │ 0.2 │ Iris-setosa │
│ 6 │ 5.4 │ 3.9 │ 1.7 │ 0.4 │ Iris-setosa │
│ 7 │ 4.6 │ 3.4 │ 1.4 │ 0.3 │ Iris-setosa │
│ 43 │ 4.4 │ 3.2 │ 1.3 │ 0.2 │ Iris-setosa │
│ 44 │ 5.0 │ 3.5 │ 1.6 │ 0.6 │ Iris-setosa │
│ 45 │ 5.1 │ 3.8 │ 1.9 │ 0.4 │ Iris-setosa │
│ 46 │ 4.8 │ 3.0 │ 1.4 │ 0.3 │ Iris-setosa │
│ 47 │ 5.1 │ 3.8 │ 1.6 │ 0.2 │ Iris-setosa │
│ 48 │ 4.6 │ 3.2 │ 1.4 │ 0.2 │ Iris-setosa │
│ 49 │ 5.3 │ 3.7 │ 1.5 │ 0.2 │ Iris-setosa │
│ 50 │ 5.0 │ 3.3 │ 1.4 │ 0.2 │ Iris-setosa │
Last Group (50 rows): Species = "Iris-virginica"
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ String │
├─────┼─────────────┼────────────┼─────────────┼────────────┼────────────────┤
│ 1 │ 6.3 │ 3.3 │ 6.0 │ 2.5 │ Iris-virginica │
│ 2 │ 5.8 │ 2.7 │ 5.1 │ 1.9 │ Iris-virginica │
│ 3 │ 7.1 │ 3.0 │ 5.9 │ 2.1 │ Iris-virginica │
│ 4 │ 6.3 │ 2.9 │ 5.6 │ 1.8 │ Iris-virginica │
│ 5 │ 6.5 │ 3.0 │ 5.8 │ 2.2 │ Iris-virginica │
│ 6 │ 7.6 │ 3.0 │ 6.6 │ 2.1 │ Iris-virginica │
│ 7 │ 4.9 │ 2.5 │ 4.5 │ 1.7 │ Iris-virginica │
│ 43 │ 5.8 │ 2.7 │ 5.1 │ 1.9 │ Iris-virginica │
│ 44 │ 6.8 │ 3.2 │ 5.9 │ 2.3 │ Iris-virginica │
│ 45 │ 6.7 │ 3.3 │ 5.7 │ 2.5 │ Iris-virginica │
│ 46 │ 6.7 │ 3.0 │ 5.2 │ 2.3 │ Iris-virginica │
│ 47 │ 6.3 │ 2.5 │ 5.0 │ 1.9 │ Iris-virginica │
│ 48 │ 6.5 │ 3.0 │ 5.2 │ 2.0 │ Iris-virginica │
│ 49 │ 6.2 │ 3.4 │ 5.4 │ 2.3 │ Iris-virginica │
│ 50 │ 5.9 │ 3.0 │ 5.1 │ 1.8 │ Iris-virginica │

julia> combine(gdf, :PetalLength => mean)
3×2 DataFrame
│ Row │ Species │ PetalLength_mean │
│ │ String │ Float64 │
Expand All @@ -96,7 +158,7 @@ julia> by(iris, :Species, :PetalLength => mean)
│ 2 │ Iris-versicolor │ 4.26 │
│ 3 │ Iris-virginica │ 5.552 │

julia> by(iris, :Species, nrow)
julia> combine(gdf, nrow)
3×2 DataFrame
│ Row │ Species │ nrow │
│ │ String │ Int64 │
Expand All @@ -105,7 +167,7 @@ julia> by(iris, :Species, nrow)
│ 2 │ Iris-versicolor │ 50 │
│ 3 │ Iris-virginica │ 50 │

julia> by(iris, :Species, nrow, :PetalLength => mean => :mean)
julia> combine(gdf, nrow, :PetalLength => mean => :mean)
3×3 DataFrame
│ Row │ Species │ nrow │ mean │
│ │ String │ Int64 │ Float64 │
Expand All @@ -114,9 +176,8 @@ julia> by(iris, :Species, nrow, :PetalLength => mean => :mean)
│ 2 │ Iris-versicolor │ 50 │ 4.26 │
│ 3 │ Iris-virginica │ 50 │ 5.552 │

julia> by(iris, :Species,
[:PetalLength, :SepalLength] =>
(p, s) -> (a=mean(p)/mean(s), b=sum(p))) # multiple columns are passed as arguments
julia> combine([:PetalLength, :SepalLength] => (p, s) -> (a=mean(p)/mean(s), b=sum(p)),
gdf) # multiple columns are passed as arguments
3×3 DataFrame
│ Row │ Species │ a │ b │
│ │ String │ Float64 │ Float64 │
Expand All @@ -125,9 +186,18 @@ julia> by(iris, :Species,
│ 2 │ Iris-versicolor │ 0.717655 │ 213.0 │
│ 3 │ Iris-virginica │ 0.842744 │ 277.6 │

julia> by(iris, :Species,
AsTable([:PetalLength, :SepalLength]) =>
x -> std(x.PetalLength) / std(x.SepalLength)) # passing a NamedTuple
julia> combine(gdf,
AsTable([:PetalLength, :SepalLength]) =>
x -> std(x.PetalLength) / std(x.SepalLength)) # passing a NamedTuple
3×2 DataFrame
│ Row │ Species │ PetalLength_SepalLength_function │
│ │ String │ Float64 │
├─────┼─────────────────┼──────────────────────────────────┤
│ 1 │ Iris-setosa │ 0.492245 │
│ 2 │ Iris-versicolor │ 0.910378 │
│ 3 │ Iris-virginica │ 0.867923 │

julia> combine(x -> std(x.PetalLength) / std(x.SepalLength), gdf) # passing a SubDataFrame
3×2 DataFrame
│ Row │ Species │ PetalLength_SepalLength_function │
│ │ String │ Float64 │
Expand All @@ -136,7 +206,7 @@ julia> by(iris, :Species,
│ 2 │ Iris-versicolor │ 0.910378 │
│ 3 │ Iris-virginica │ 0.867923 │

julia> by(iris, :Species, 1:2 => cor, nrow)
julia> combine(gdf, 1:2 => cor, nrow)
3×3 DataFrame
│ Row │ Species │ SepalLength_SepalWidth_cor │ nrow │
│ │ String │ Float64 │ Int64 │
Expand All @@ -147,11 +217,62 @@ julia> by(iris, :Species, 1:2 => cor, nrow)

```

The `by` function also supports the `do` block form. However, as noted above,
Contrary to `combine`, the `select` and `transform` functions always return
a data frame with the same number and order of rows as the source.
In the example below
the return values in columns `:SepalLength_SepalWidth_cor` and `:nrow` are
broadcasted to match the number of elements in each group:
```
julia> select(gdf, 1:2 => cor)
150×2 DataFrame
│ Row │ Species │ SepalLength_SepalWidth_cor │
│ │ String │ Float64 │
├─────┼────────────────┼────────────────────────────┤
│ 1 │ Iris-setosa │ 0.74678 │
│ 2 │ Iris-setosa │ 0.74678 │
│ 3 │ Iris-setosa │ 0.74678 │
│ 4 │ Iris-setosa │ 0.74678 │
│ 5 │ Iris-setosa │ 0.74678 │
│ 6 │ Iris-setosa │ 0.74678 │
│ 7 │ Iris-setosa │ 0.74678 │
│ 143 │ Iris-virginica │ 0.457228 │
│ 144 │ Iris-virginica │ 0.457228 │
│ 145 │ Iris-virginica │ 0.457228 │
│ 146 │ Iris-virginica │ 0.457228 │
│ 147 │ Iris-virginica │ 0.457228 │
│ 148 │ Iris-virginica │ 0.457228 │
│ 149 │ Iris-virginica │ 0.457228 │
│ 150 │ Iris-virginica │ 0.457228 │

julia> transform(gdf, :Species => x -> chop.(x, head=5, tail=0))
150×6 DataFrame
│ Row │ Species │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species_function │
│ │ String │ Float64 │ Float64 │ Float64 │ Float64 │ SubString… │
├─────┼────────────────┼─────────────┼────────────┼─────────────┼────────────┼──────────────────┤
│ 1 │ Iris-setosa │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ setosa │
│ 2 │ Iris-setosa │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ setosa │
│ 3 │ Iris-setosa │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ setosa │
│ 4 │ Iris-setosa │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ setosa │
│ 5 │ Iris-setosa │ 5.0 │ 3.6 │ 1.4 │ 0.2 │ setosa │
│ 6 │ Iris-setosa │ 5.4 │ 3.9 │ 1.7 │ 0.4 │ setosa │
│ 7 │ Iris-setosa │ 4.6 │ 3.4 │ 1.4 │ 0.3 │ setosa │
│ 143 │ Iris-virginica │ 5.8 │ 2.7 │ 5.1 │ 1.9 │ virginica │
│ 144 │ Iris-virginica │ 6.8 │ 3.2 │ 5.9 │ 2.3 │ virginica │
│ 145 │ Iris-virginica │ 6.7 │ 3.3 │ 5.7 │ 2.5 │ virginica │
│ 146 │ Iris-virginica │ 6.7 │ 3.0 │ 5.2 │ 2.3 │ virginica │
│ 147 │ Iris-virginica │ 6.3 │ 2.5 │ 5.0 │ 1.9 │ virginica │
│ 148 │ Iris-virginica │ 6.5 │ 3.0 │ 5.2 │ 2.0 │ virginica │
│ 149 │ Iris-virginica │ 6.2 │ 3.4 │ 5.4 │ 2.3 │ virginica │
│ 150 │ Iris-virginica │ 5.9 │ 3.0 │ 5.1 │ 1.8 │ virginica │
```

The `combine` function also supports the `do` block form. However, as noted above,
this form is slow and should therefore be avoided when performance matters.

```jldoctest sac
julia> by(iris, :Species) do df
julia> combine(gdf) do df
(m = mean(df.PetalLength), s² = var(df.PetalLength))
end
3×3 DataFrame
Expand Down
Loading