Skip to content

Commit

Permalink
[BREAKING] Add column indexing using strings
Browse files Browse the repository at this point in the history
User visible changes:
* allow using an AbstractString everywhere where Symbol was accepted as a column indicator (i.e. this extends beyond just getindex and getproperty, every function was reviewed and made consistent)
* docstrings were updated to reflect this change + some unrelated docs fixes were added where spotted
* names function consistently returns vector of strings for all types that return column names
* propertynames function consistently returns vector of Symbols for all types that return column names
* keys function consistently returns vector of Symbols for all types that return column names as keys
* depend on DataAPI.jl v1.2
* rename! passes String to a function if API with function is used
* consistently define hasproperty for types that have custom getproperty
* add cols to names function for types where it makes sense consistently
* push! accepts dicts with string keys
* indexing with an arbitrary eltype vector now throws better error messages (informtive ArgumentErroris thrown in more cases rather thanMethodError` in the past)

Internal changes:
* add definitions of SymbolOrString, ColumnIndex (updated), and MultiColumnIndex unions and use them consistently across whole codebase
* remove keys from AbstractIndex as it was never used
* correct formatting of code in many places (mainly docstrings, function signatures, and adding missing return statements in long functions as many of them were updated anyway, but also in unrelated places)
* fix a rule that _names returns vector of Symbols without copying
* in some places switch from names to _names internally to avoid unnecessary allocations
* in some places use groupcols where this function would be an appropriate to call; similarly with names(df, cols) where it made sense
* remove without function that was not needed anymore

Co-authored-by: Milan Bouchet-Valat <[email protected]>
Co-authored-by: pdeffebach <[email protected]>
  • Loading branch information
3 people authored Apr 27, 2020
1 parent ccde40a commit b1f675d
Show file tree
Hide file tree
Showing 42 changed files with 2,709 additions and 1,069 deletions.
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ test = ["DataStructures", "DataValues", "Dates", "Logging", "Random", "Test"]
julia = "1"
CategoricalArrays = "0.8"
Compat = "2.2, 3"
DataAPI = "1.0.1"
DataAPI = "1.2"
InvertedIndices = "1"
IteratorInterfaceExtensions = "0.1.1, 1"
Missings = "0.4.2"
Expand Down
33 changes: 25 additions & 8 deletions docs/src/lib/indexing.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,11 @@ and broadcasting are intended to work with `DataFrame`, `SubDataFrame` and `Data
The rules for a valid type of index into a column are the following:
* a value, later denoted as `col`:
* a `Symbol`;
* an `AbstractString`;
* an `Integer` that is not `Bool`;
* a vector, later denoted as `cols`:
* a vector of `Symbol` (does not have to be a subtype of `AbstractVector{Symbol}`);
* a vector of `AbstractString` (does not have to be a subtype of `AbstractVector{<:AbstractString}`);
* a vector of `Integer` other than `Bool` (does not have to be a subtype of `AbstractVector{<:Integer}`);
* a vector of `Bool` that has to be a subtype of `AbstractVector{Bool}`;
* a regular expression, which gets expanded to a vector of matching column names;
Expand Down Expand Up @@ -122,13 +124,14 @@ so it is unsafe to use it afterwards (the column length correctness will be pres
* `df[CartesianIndex(row, col)] = v` -> the same as `df[row, col] = v`;
* `df[row, cols] = v` -> set row `row` of columns `cols` in-place; the same as `dfr = df[row, cols]; dfr[:] = v`;
* `df[rows, col] = v` -> set rows `rows` of column `col` in-place; `v` must be an `AbstractVector`;
if `rows` is `:` and `col` is a `Symbol` that is not present in `df` then a new column
in `df` is created and holds a `copy` of `v`; equivalent to `df.col = copy(v)` if `col` is a valid identifier;
if `rows` is `:` and `col` is a `Symbol` or `AbstractString`
that is not present in `df` then a new column in `df` is created and holds a `copy` of `v`; equivalent to `df.col = copy(v)` if `col` is a valid identifier;
* `df[rows, cols] = v` -> set rows `rows` of columns `cols` in-place; `v` must be an `AbstractMatrix` or an `AbstractDataFrame`
(in this case column names must match);
* `df[!, col] = v` -> replaces `col` with `v` without copying
(with the exception that if `v` is an `AbstractRange` it gets converted to a `Vector`);
also if `col` is a `Symbol` that is not present in `df` then a new column in `df` is created and holds `v`;
also if `col` is a `Symbol` or `AbstractString` that is not present in `df` then
a new column in `df` is created and holds `v`;
equivalent to `df.col = v` if `col` is a valid identifier;
this is allowed if `ncol(df) == 0 || length(v) == nrow(df)`;
* `df[!, cols] = v` -> replaces existing columns `cols` in data frame `df` with copying;
Expand Down Expand Up @@ -183,10 +186,10 @@ Additional rules:
* in the `df[CartesianIndex(row, col)] .= v`, `df[row, col] .= v` syntaxes `v` is broadcasted into the contents of `df[row, col]` (this is consistent with Julia Base);
* in the `df[row, cols] .= v` syntaxes the assignment to `df` is performed in-place;
* in the `df[rows, col] .= v` and `df[rows, cols] .= v` syntaxes the assignment to `df` is performed in-place;
if `rows` is `:` and `col` is `Symbol` and it is missing from `df` then a new column is allocated and added;
if `rows` is `:` and `col` is `Symbol` or `AbstractString` and it is missing from `df` then a new column is allocated and added;
the length of the column is always the value of `nrow(df)` before the assignment takes place;
* in the `df[!, col] .= v` syntax column `col` is replaced by a freshly allocated vector;
if `col` is `Symbol` and it is missing from `df` then a new column is allocated added;
if `col` is `Symbol` or `AbstractString` and it is missing from `df` then a new column is allocated added;
the length of the column is always the value of `nrow(df)` before the assignment takes place;
* the `df[!, cols] .= v` syntax replaces existing columns `cols` in data frame `df` with freshly allocated vectors;
* `df.col .= v` syntax is allowed and performs in-place assignment to an existing vector `df.col`.
Expand All @@ -197,9 +200,8 @@ Additional rules:

Note that `sdf[!, col] .= v` and `sdf[!, cols] .= v` syntaxes are not allowed as `sdf` can be only modified in-place.

If column indexing using `Symbol` names in `cols` is performed, the order of columns in the operation is specified
by the order of names.

If column indexing using `Symbol` or `AbstractString` names in `cols` is performed, the order
of columns in the operation is specified by the order of names.

## Indexing `GroupedDataFrame`s

Expand Down Expand Up @@ -230,3 +232,18 @@ The elements of a `GroupedDataFrame` are [`SubDataFrame`](@ref)s of its parent.
* `gd[n::Not]` -> Any of the above types wrapped in `Not`. The result
will be a new `GroupedDataFrame` containing all groups in `gd` *not* selected
by the wrapped index.

# Common API for types defined in DataFrames.jl

This table presents return value types of calling `names`, `propertynames` and `keys`
on types exposed to the user by DataFrames.jl:

| Type | `names` | `propertynames` | `keys` |
|---------------------|------------------|------------------|------------------|
| `AbstractDataFrame` | `Vector{String}` | `Vector{Symbol}` | undefined |
| `DataFrameRow` | `Vector{String}` | `Vector{Symbol}` | `Vector{Symbol}` |
| `DataFrameRows` | `Vector{String}` | `Vector{Symbol}` | vector of `Int` |
| `DataFrameColumns` | `Vector{String}` | `Vector{Symbol}` | `Vector{Symbol}` |
| `GroupedDataFrame` | `Vector{String}` | tuple of fields | `GroupKeys` |
| `GroupKeys` | undefined | tuple of fields | vector of `Int` |
| `GroupKey` | `Vector{String}` | `Vector{Symbol}` | `Vector{Symbol}` |
1 change: 1 addition & 0 deletions docs/src/lib/types.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ without caution because:

```@docs
AbstractDataFrame
AsTable
ByRow
DataFrame
DataFrameRow
Expand Down
36 changes: 34 additions & 2 deletions docs/src/man/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,8 @@ julia> df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
```

Columns can be directly (i.e. without copying) accessed via `df.col` or `df[!, :col]`. The latter syntax is more flexible as it allows passing a variable holding the name of the column, and not only a literal name. Note that column names are symbols (`:col` or `Symbol("col")`) rather than strings (`"col"`). Columns can also be accessed using an integer index specifying their position.
Columns can be directly (i.e. without copying) accessed via `df.col`, `df."col"`, `df[!, :col]` or `df[!, "col"]`. The two latter syntaxes are more flexible as they allow passing a variable holding the name of the column, and not only a literal name. Note that column names can be either symbols (written as `:col`, `:var"col"` or `Symbol("col")`) or strings (written as `"col"`).
Columns can also be accessed using an integer index specifying their position.

Since `df[!, :col]` does not make a copy, changing the elements of the column vector returned by this syntax will affect the values stored in the original `df`. To get a copy of the column use `df[:, :col]`: changing the vector returned by this syntax does not change `df`.

Expand All @@ -58,6 +59,13 @@ julia> df.A
3
4
julia> df."A"
4-element Array{Int64,1}:
1
2
3
4
julia> df.A === df[!, :A]
true
Expand All @@ -67,6 +75,15 @@ false
julia> df.A == df[:, :A]
true
julia> df.A === df[!, "A"]
true
julia> df.A === df[:, "A"]
false
julia> df.A == df[:, "A"]
true
julia> df.A === df[!, 1]
true
Expand All @@ -89,15 +106,30 @@ julia> df[:, firstcolumn] == df.A
true
```

Column names can be obtained using the `names` function:
Column names can be obtained as strings using the `names` function:

```jldoctest dataframe
julia> names(df)
2-element Array{String,1}:
"A"
"B"
```

To get column names as `Symbol`s use the `propertynames` function:
```
julia> propertynames(df)
2-element Array{Symbol,1}:
:A
:B
```

!!! note

DataFrames.jl allows to use `Symbol`s (like `:A`) and strings (like `"A"`)
for all column indexing operations for convenience.
However, using `Symbol`s is slightly faster and should generally be preferred.


### Constructing Column by Column

It is also possible to start with an empty `DataFrame` and add columns to it one by one:
Expand Down
Loading

0 comments on commit b1f675d

Please sign in to comment.