JuliaData · bkamins · May 5, 2020 · Apr 27, 2020 · Apr 27, 2020 · Apr 27, 2020
diff --git a/docs/src/man/getting_started.md b/docs/src/man/getting_started.md
@@ -773,8 +773,8 @@ julia> describe(df)
 
 ```
 
-If you are interested in describing only a subset of columns then the easiest way to do it is to
-pass a subset of an original data frame to `describe` like this:
+If you are interested in describing only a subset of columns then the easiest way
+to do it is to pass a subset of an original data frame to `describe` like this:
 ```jldoctest dataframe
 julia> describe(df[!, [:A]))
 1×8 DataFrame
@@ -792,7 +792,7 @@ julia> mean(df.A)
 2.5
 ```
 
-We can also apply a function to each column of a `DataFrame` using `select`. For example:
+We can also apply a function to each column of a `DataFrame` using `combine`. For example:
 ```jldoctest dataframe
 julia> df = DataFrame(A = 1:4, B = 4.0:-1.0:1.0)
 4×2 DataFrame
@@ -804,21 +804,24 @@ julia> df = DataFrame(A = 1:4, B = 4.0:-1.0:1.0)
 │ 3   │ 3     │ 2.0     │
 │ 4   │ 4     │ 1.0     │
 
-julia> select(df, names(df) .=> sum)
+julia> combine(df, names(df) .=> sum)
 1×2 DataFrame
 │ Row │ A_sum │ B_sum   │
 │     │ Int64 │ Float64 │
 ├─────┼───────┼─────────┤
 │ 1   │ 10    │ 10.0    │
 
-julia> select(df, names(df) .=> sum, names(df) .=> prod)
+julia> combine(df, names(df) .=> sum, names(df) .=> prod)
 1×4 DataFrame
 │ Row │ A_sum │ B_sum   │ A_prod │ B_prod  │
 │     │ Int64 │ Float64 │ Int64  │ Float64 │
 ├─────┼───────┼─────────┼────────┼─────────┤
 │ 1   │ 10    │ 10.0    │ 24     │ 24.0    │
 ```
 
+If you would prefer the result to have the same number of rows as the source data
+frame use `select` instead of `combine`.
+
 ### Handling of Columns Stored in a `DataFrame`
 
 Functions that transform a `DataFrame` to produce a

diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md
@@ -6,10 +6,25 @@ framework for handling this sort of computation is described in the paper
 "[The Split-Apply-Combine Strategy for Data Analysis](http://www.jstatsoft.org/v40/i01)",
 written by Hadley Wickham.
 
-The DataFrames package supports the split-apply-combine strategy through the `by`
-function, which is a shorthand for `groupby` followed by `map` and/or `combine`.
-`by` takes in three arguments: (1) a `DataFrame`, (2) one or more columns to split
-the `DataFrame` on, and (3) a specification of one or more functions to apply to
+The DataFrames package supports the split-apply-combine strategy through the
+`groupby` function followed by `combine`, `select`/`select!` or `transform`/`transform!`.
+
+In order to perform operations by groups you first need to create a `GroupedDataFrame`
+object from your data frame using the `groupby` function that takes two arguments:
+(1) a data frame to be grouped, and (2) a set of columns to group by.
+
+Operations can then be applied on each group using one of the following functions:
+* `combine`: does not put restrictions on number of rows returned, the order of rows
+  is specified by the order of groups in `GroupedDataFrame`; it is typically used
+  to compute summary statistics by group;
+* `select`: return a data frame with the number and order of rows exactly the same
+  as the source data frame, including only new calculated columns;
+  `select!` is an in-place version of `select`;
+* `transform`: return a data frame with the number and order of rows exactly the same
+  as the source data frame, including all columns from the source and new calculated columns;
+  `transform!` is an in-place version of `transform`.
+
+All these functions take a specification of one or more functions to apply to
 each subset of the `DataFrame`. This specification can be of the following forms:
 1. standard column selectors (integers, symbols, vectors of integers, vectors of symbols,
    `All`, `:`, `Between`, `Not` and regular expressions)
@@ -27,19 +42,22 @@ each subset of the `DataFrame`. This specification can be of the following forms
    number of columns are processed (in which case `SubDataFrame` avoids excessive
    compilation)
 
-All forms except 1 and 6 can be also passed as the first argument to `map`.
-
 As a special rule that applies to `cols => function` syntax, if `cols` is wrapped
 in an `AsTable` object then a `NamedTuple` containing columns selected by `cols` is
 passed to `function`.
 
 In all of these cases, `function` can return either a single row or multiple rows.
 `function` can always generate a single column by returning a single value or a vector.
-Additionally, if `by` is passed exactly one `function` and `target_col` is not specified,
+Additionally, if `combine` is passed exactly one `function`, `cols => function`,
+or `cols => function => outcol` as a first argument
+and `target_col` is not specified,
 `function` can return multiple columns in the form of an `AbstractDataFrame`,
 `AbstractMatrix`, `NamedTuple` or `DataFrameRow`.
 
-Here are the rules specifying the shape of the resulting `DataFrame`:
+`select`/`select!` and `transform`/`transform!` always return a `DataFrame`
+with the same number of rows as the source.
+For `combine`, the shape of the resulting `DataFrame` is determined
+according to the following rules:
 - a single value produces a single row and column per group
 - a named tuple or `DataFrameRow` produces a single row and one column per field
 - a vector produces a single column with one row per entry
@@ -87,7 +105,51 @@ julia> iris = DataFrame(CSV.File(joinpath(dirname(pathof(DataFrames)), "../docs/
 │ 149 │ 6.2         │ 3.4        │ 5.4         │ 2.3        │ Iris-virginica │
 │ 150 │ 5.9         │ 3.0        │ 5.1         │ 1.8        │ Iris-virginica │
 
-julia> by(iris, :Species, :PetalLength => mean)
+julia> gdf = groupby(iris, :Species)
+GroupedDataFrame with 3 groups based on key: Species
+First Group (50 rows): Species = "Iris-setosa"
+│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species     │
+│     │ Float64     │ Float64    │ Float64     │ Float64    │ String      │
+├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────────┤
+│ 1   │ 5.1         │ 3.5        │ 1.4         │ 0.2        │ Iris-setosa │
+│ 2   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ Iris-setosa │
+│ 3   │ 4.7         │ 3.2        │ 1.3         │ 0.2        │ Iris-setosa │
+│ 4   │ 4.6         │ 3.1        │ 1.5         │ 0.2        │ Iris-setosa │
+│ 5   │ 5.0         │ 3.6        │ 1.4         │ 0.2        │ Iris-setosa │
+│ 6   │ 5.4         │ 3.9        │ 1.7         │ 0.4        │ Iris-setosa │
+│ 7   │ 4.6         │ 3.4        │ 1.4         │ 0.3        │ Iris-setosa │
+⋮
+│ 43  │ 4.4         │ 3.2        │ 1.3         │ 0.2        │ Iris-setosa │
+│ 44  │ 5.0         │ 3.5        │ 1.6         │ 0.6        │ Iris-setosa │
+│ 45  │ 5.1         │ 3.8        │ 1.9         │ 0.4        │ Iris-setosa │
+│ 46  │ 4.8         │ 3.0        │ 1.4         │ 0.3        │ Iris-setosa │
+│ 47  │ 5.1         │ 3.8        │ 1.6         │ 0.2        │ Iris-setosa │
+│ 48  │ 4.6         │ 3.2        │ 1.4         │ 0.2        │ Iris-setosa │
+│ 49  │ 5.3         │ 3.7        │ 1.5         │ 0.2        │ Iris-setosa │
+│ 50  │ 5.0         │ 3.3        │ 1.4         │ 0.2        │ Iris-setosa │
+⋮
+Last Group (50 rows): Species = "Iris-virginica"
+│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species        │
+│     │ Float64     │ Float64    │ Float64     │ Float64    │ String         │
+├─────┼─────────────┼────────────┼─────────────┼────────────┼────────────────┤
+│ 1   │ 6.3         │ 3.3        │ 6.0         │ 2.5        │ Iris-virginica │
+│ 2   │ 5.8         │ 2.7        │ 5.1         │ 1.9        │ Iris-virginica │
+│ 3   │ 7.1         │ 3.0        │ 5.9         │ 2.1        │ Iris-virginica │
+│ 4   │ 6.3         │ 2.9        │ 5.6         │ 1.8        │ Iris-virginica │
+│ 5   │ 6.5         │ 3.0        │ 5.8         │ 2.2        │ Iris-virginica │
+│ 6   │ 7.6         │ 3.0        │ 6.6         │ 2.1        │ Iris-virginica │
+│ 7   │ 4.9         │ 2.5        │ 4.5         │ 1.7        │ Iris-virginica │
+⋮
+│ 43  │ 5.8         │ 2.7        │ 5.1         │ 1.9        │ Iris-virginica │
+│ 44  │ 6.8         │ 3.2        │ 5.9         │ 2.3        │ Iris-virginica │
+│ 45  │ 6.7         │ 3.3        │ 5.7         │ 2.5        │ Iris-virginica │
+│ 46  │ 6.7         │ 3.0        │ 5.2         │ 2.3        │ Iris-virginica │
+│ 47  │ 6.3         │ 2.5        │ 5.0         │ 1.9        │ Iris-virginica │
+│ 48  │ 6.5         │ 3.0        │ 5.2         │ 2.0        │ Iris-virginica │
+│ 49  │ 6.2         │ 3.4        │ 5.4         │ 2.3        │ Iris-virginica │
+│ 50  │ 5.9         │ 3.0        │ 5.1         │ 1.8        │ Iris-virginica │
+
+julia> combine(gdf, :PetalLength => mean)
 3×2 DataFrame
 │ Row │ Species         │ PetalLength_mean │
 │     │ String          │ Float64          │
@@ -96,7 +158,7 @@ julia> by(iris, :Species, :PetalLength => mean)
 │ 2   │ Iris-versicolor │ 4.26             │
 │ 3   │ Iris-virginica  │ 5.552            │
 
-julia> by(iris, :Species, nrow)
+julia> combine(gdf, nrow)
 3×2 DataFrame
 │ Row │ Species         │ nrow  │
 │     │ String          │ Int64 │
@@ -105,7 +167,7 @@ julia> by(iris, :Species, nrow)
 │ 2   │ Iris-versicolor │ 50    │
 │ 3   │ Iris-virginica  │ 50    │
 
-julia> by(iris, :Species, nrow, :PetalLength => mean => :mean)
+julia> combine(gdf, nrow, :PetalLength => mean => :mean)
 3×3 DataFrame
 │ Row │ Species         │ nrow  │ mean    │
 │     │ String          │ Int64 │ Float64 │
@@ -114,9 +176,8 @@ julia> by(iris, :Species, nrow, :PetalLength => mean => :mean)
 │ 2   │ Iris-versicolor │ 50    │ 4.26    │
 │ 3   │ Iris-virginica  │ 50    │ 5.552   │
 
-julia> by(iris, :Species,
-          [:PetalLength, :SepalLength] =>
-          (p, s) -> (a=mean(p)/mean(s), b=sum(p))) # multiple columns are passed as arguments
+julia> combine([:PetalLength, :SepalLength] => (p, s) -> (a=mean(p)/mean(s), b=sum(p)),
+               gdf) # multiple columns are passed as arguments
 3×3 DataFrame
 │ Row │ Species         │ a        │ b       │
 │     │ String          │ Float64  │ Float64 │
@@ -125,9 +186,18 @@ julia> by(iris, :Species,
 │ 2   │ Iris-versicolor │ 0.717655 │ 213.0   │
 │ 3   │ Iris-virginica  │ 0.842744 │ 277.6   │
 
-julia> by(iris, :Species,
-          AsTable([:PetalLength, :SepalLength]) =>
-          x -> std(x.PetalLength) / std(x.SepalLength)) # passing a NamedTuple
+julia> combine(gdf,
+               AsTable([:PetalLength, :SepalLength]) =>
+               x -> std(x.PetalLength) / std(x.SepalLength)) # passing a NamedTuple
+3×2 DataFrame
+│ Row │ Species         │ PetalLength_SepalLength_function │
+│     │ String          │ Float64                          │
+├─────┼─────────────────┼──────────────────────────────────┤
+│ 1   │ Iris-setosa     │ 0.492245                         │
+│ 2   │ Iris-versicolor │ 0.910378                         │
+│ 3   │ Iris-virginica  │ 0.867923                         │
+
+julia> combine(x -> std(x.PetalLength) / std(x.SepalLength), gdf) # passing a SubDataFrame
 3×2 DataFrame
 │ Row │ Species         │ PetalLength_SepalLength_function │
 │     │ String          │ Float64                          │
@@ -136,7 +206,7 @@ julia> by(iris, :Species,
 │ 2   │ Iris-versicolor │ 0.910378                         │
 │ 3   │ Iris-virginica  │ 0.867923                         │
 
-julia> by(iris, :Species, 1:2 => cor, nrow)
+julia> combine(gdf, 1:2 => cor, nrow)
 3×3 DataFrame
 │ Row │ Species         │ SepalLength_SepalWidth_cor │ nrow  │
 │     │ String          │ Float64                    │ Int64 │
@@ -147,11 +217,62 @@ julia> by(iris, :Species, 1:2 => cor, nrow)
 
 ```
 
-The `by` function also supports the `do` block form. However, as noted above,
+Contrary to `combine`, the `select` and `transform` functions always return
+a data frame with the same number and order of rows as the source.
+In the example below
+the return values in columns `:SepalLength_SepalWidth_cor` and `:nrow` are
+broadcasted to match the number of elements in each group:
+```
+julia> select(gdf, 1:2 => cor)
+150×2 DataFrame
+│ Row │ Species        │ SepalLength_SepalWidth_cor │
+│     │ String         │ Float64                    │
+├─────┼────────────────┼────────────────────────────┤
+│ 1   │ Iris-setosa    │ 0.74678                    │
+│ 2   │ Iris-setosa    │ 0.74678                    │
+│ 3   │ Iris-setosa    │ 0.74678                    │
+│ 4   │ Iris-setosa    │ 0.74678                    │
+│ 5   │ Iris-setosa    │ 0.74678                    │
+│ 6   │ Iris-setosa    │ 0.74678                    │
+│ 7   │ Iris-setosa    │ 0.74678                    │
+⋮
+│ 143 │ Iris-virginica │ 0.457228                   │
+│ 144 │ Iris-virginica │ 0.457228                   │
+│ 145 │ Iris-virginica │ 0.457228                   │
+│ 146 │ Iris-virginica │ 0.457228                   │
+│ 147 │ Iris-virginica │ 0.457228                   │
+│ 148 │ Iris-virginica │ 0.457228                   │
+│ 149 │ Iris-virginica │ 0.457228                   │
+│ 150 │ Iris-virginica │ 0.457228                   │
+
+julia> transform(gdf, :Species => x -> chop.(x, head=5, tail=0))
+150×6 DataFrame
+│ Row │ Species        │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species_function │
+│     │ String         │ Float64     │ Float64    │ Float64     │ Float64    │ SubString…       │
+├─────┼────────────────┼─────────────┼────────────┼─────────────┼────────────┼──────────────────┤
+│ 1   │ Iris-setosa    │ 5.1         │ 3.5        │ 1.4         │ 0.2        │ setosa           │
+│ 2   │ Iris-setosa    │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ setosa           │
+│ 3   │ Iris-setosa    │ 4.7         │ 3.2        │ 1.3         │ 0.2        │ setosa           │
+│ 4   │ Iris-setosa    │ 4.6         │ 3.1        │ 1.5         │ 0.2        │ setosa           │
+│ 5   │ Iris-setosa    │ 5.0         │ 3.6        │ 1.4         │ 0.2        │ setosa           │
+│ 6   │ Iris-setosa    │ 5.4         │ 3.9        │ 1.7         │ 0.4        │ setosa           │
+│ 7   │ Iris-setosa    │ 4.6         │ 3.4        │ 1.4         │ 0.3        │ setosa           │
+⋮
+│ 143 │ Iris-virginica │ 5.8         │ 2.7        │ 5.1         │ 1.9        │ virginica        │
+│ 144 │ Iris-virginica │ 6.8         │ 3.2        │ 5.9         │ 2.3        │ virginica        │
+│ 145 │ Iris-virginica │ 6.7         │ 3.3        │ 5.7         │ 2.5        │ virginica        │
+│ 146 │ Iris-virginica │ 6.7         │ 3.0        │ 5.2         │ 2.3        │ virginica        │
+│ 147 │ Iris-virginica │ 6.3         │ 2.5        │ 5.0         │ 1.9        │ virginica        │
+│ 148 │ Iris-virginica │ 6.5         │ 3.0        │ 5.2         │ 2.0        │ virginica        │
+│ 149 │ Iris-virginica │ 6.2         │ 3.4        │ 5.4         │ 2.3        │ virginica        │
+│ 150 │ Iris-virginica │ 5.9         │ 3.0        │ 5.1         │ 1.8        │ virginica        │
+```
+
+The `combine` function also supports the `do` block form. However, as noted above,
 this form is slow and should therefore be avoided when performance matters.
 
 ```jldoctest sac
-julia> by(iris, :Species) do df
+julia> combine(gdf) do df
            (m = mean(df.PetalLength), s² = var(df.PetalLength))
        end
 3×3 DataFrame