-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BREAKING] Add column indexing using strings #2199
Changes from 26 commits
60e72bf
3a44c8f
ae2691a
cc9bbca
73a5a1a
64741b2
e1e04bb
b390558
ddb84f2
981e8bf
5de0354
4f8b2dc
38ad85c
e0cf2bd
95aa693
670664e
4b52cc8
ab29c54
b93a807
26b1980
3678130
0bb2a99
30a8f70
9a3cee0
cf879d3
9cc3702
2a935a6
e40461e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -17,9 +17,11 @@ and broadcasting are intended to work with `DataFrame`, `SubDataFrame` and `Data | |||||
The rules for a valid type of index into a column are the following: | ||||||
* a value, later denoted as `col`: | ||||||
* a `Symbol`; | ||||||
* an `AbstractString`; | ||||||
* an `Integer` that is not `Bool`; | ||||||
* a vector, later denoted as `cols`: | ||||||
* a vector of `Symbol` (does not have to be a subtype of `AbstractVector{Symbol}`); | ||||||
* a vector of `AbstractString` (does not have to be a subtype of `AbstractVector{<:AbstractString}`); | ||||||
* a vector of `Integer` other than `Bool` (does not have to be a subtype of `AbstractVector{<:Integer}`); | ||||||
* a vector of `Bool` that has to be a subtype of `AbstractVector{Bool}`; | ||||||
* a regular expression, which gets expanded to a vector of matching column names; | ||||||
|
@@ -122,13 +124,14 @@ so it is unsafe to use it afterwards (the column length correctness will be pres | |||||
* `df[CartesianIndex(row, col)] = v` -> the same as `df[row, col] = v`; | ||||||
* `df[row, cols] = v` -> set row `row` of columns `cols` in-place; the same as `dfr = df[row, cols]; dfr[:] = v`; | ||||||
* `df[rows, col] = v` -> set rows `rows` of column `col` in-place; `v` must be an `AbstractVector`; | ||||||
if `rows` is `:` and `col` is a `Symbol` that is not present in `df` then a new column | ||||||
in `df` is created and holds a `copy` of `v`; equivalent to `df.col = copy(v)` if `col` is a valid identifier; | ||||||
if `rows` is `:` and `col` is a `Symbol` or `AbstractString` | ||||||
that is not present in `df` then a new column in `df` is created and holds a `copy` of `v`; equivalent to `df.col = copy(v)` if `col` is a valid identifier; | ||||||
* `df[rows, cols] = v` -> set rows `rows` of columns `cols` in-place; `v` must be an `AbstractMatrix` or an `AbstractDataFrame` | ||||||
(in this case column names must match); | ||||||
* `df[!, col] = v` -> replaces `col` with `v` without copying | ||||||
(with the exception that if `v` is an `AbstractRange` it gets converted to a `Vector`); | ||||||
also if `col` is a `Symbol` that is not present in `df` then a new column in `df` is created and holds `v`; | ||||||
also if `col` is a `Symbol` or `AbstractString` that is not present in `df` then | ||||||
a new column in `df` is created and holds `v`; | ||||||
equivalent to `df.col = v` if `col` is a valid identifier; | ||||||
this is allowed if `ncol(df) == 0 || length(v) == nrow(df)`; | ||||||
* `df[!, cols] = v` -> replaces existing columns `cols` in data frame `df` with copying; | ||||||
|
@@ -183,10 +186,10 @@ Additional rules: | |||||
* in the `df[CartesianIndex(row, col)] .= v`, `df[row, col] .= v` syntaxes `v` is broadcasted into the contents of `df[row, col]` (this is consistent with Julia Base); | ||||||
* in the `df[row, cols] .= v` syntaxes the assignment to `df` is performed in-place; | ||||||
* in the `df[rows, col] .= v` and `df[rows, cols] .= v` syntaxes the assignment to `df` is performed in-place; | ||||||
if `rows` is `:` and `col` is `Symbol` and it is missing from `df` then a new column is allocated and added; | ||||||
if `rows` is `:` and `col` is `Symbol` or `AbstractString` and it is missing from `df` then a new column is allocated and added; | ||||||
the length of the column is always the value of `nrow(df)` before the assignment takes place; | ||||||
* in the `df[!, col] .= v` syntax column `col` is replaced by a freshly allocated vector; | ||||||
if `col` is `Symbol` and it is missing from `df` then a new column is allocated added; | ||||||
if `col` is `Symbol` or `AbstractString` and it is missing from `df` then a new column is allocated added; | ||||||
the length of the column is always the value of `nrow(df)` before the assignment takes place; | ||||||
* the `df[!, cols] .= v` syntax replaces existing columns `cols` in data frame `df` with freshly allocated vectors; | ||||||
* `df.col .= v` syntax is allowed and performs in-place assignment to an existing vector `df.col`. | ||||||
|
@@ -197,9 +200,8 @@ Additional rules: | |||||
|
||||||
Note that `sdf[!, col] .= v` and `sdf[!, cols] .= v` syntaxes are not allowed as `sdf` can be only modified in-place. | ||||||
|
||||||
If column indexing using `Symbol` names in `cols` is performed, the order of columns in the operation is specified | ||||||
by the order of names. | ||||||
|
||||||
If column indexing using `Symbol` or `AbstractString` names in `cols` is performed, the order | ||||||
of columns in the operation is specified by the order of names. | ||||||
|
||||||
## Indexing `GroupedDataFrame`s | ||||||
|
||||||
|
@@ -230,3 +232,18 @@ The elements of a `GroupedDataFrame` are [`SubDataFrame`](@ref)s of its parent. | |||||
* `gd[n::Not]` -> Any of the above types wrapped in `Not`. The result | ||||||
will be a new `GroupedDataFrame` containing all groups in `gd` *not* selected | ||||||
by the wrapped index. | ||||||
|
||||||
# Common API for types defined in DataFrames.jl | ||||||
|
||||||
This table presents return value types of calling `names`, `propertynames` and `keys` | ||||||
on types exposed to the user by DataFrames.jl: | ||||||
|
||||||
| Type | `names` | `propertynames` | `keys` | | ||||||
|---------------------|------------------|------------------|------------------| | ||||||
| `AbstractDataFrame` | `Vector{String}` | `Vector{Symbol}` | undefined | | ||||||
| `DataFrameRow` | `Vector{String}` | `Vector{Symbol}` | `Vector{Symbol}` | | ||||||
| `DataFrameRows` | `Vector{String}` | `Vector{Symbol}` | vector of `Int` | | ||||||
| `DataFrameColumns` | `Vector{String}` | `Vector{Symbol}` | `Vector{Symbol}` | | ||||||
| `GroupedDataFrame` | `Vector{String}` | tuple of fields | `GroupKeys` | | ||||||
| `GroupKeys` | undefined | tuple of fields | vector of `Int` | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same here - it is |
||||||
| `GroupKey` | `Vector{String}` | `Vector{Symbol}` | `Vector{Symbol}` | |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -109,6 +109,7 @@ without caution because: | |
|
||
```@docs | ||
AbstractDataFrame | ||
AsTable | ||
ByRow | ||
DataFrame | ||
DataFrameRow | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -45,7 +45,8 @@ julia> df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"]) | |||||
|
||||||
``` | ||||||
|
||||||
Columns can be directly (i.e. without copying) accessed via `df.col` or `df[!, :col]`. The latter syntax is more flexible as it allows passing a variable holding the name of the column, and not only a literal name. Note that column names are symbols (`:col` or `Symbol("col")`) rather than strings (`"col"`). Columns can also be accessed using an integer index specifying their position. | ||||||
Columns can be directly (i.e. without copying) accessed via `df.col`, `df."col"`, `df[!, :col]` or `df[!, "col"]`. The two latter syntaxes are more flexible as they allow passing a variable holding the name of the column, and not only a literal name. Note that column names can be either symbols (written as `:col`, `:var"col"` or `Symbol("col")`) or strings (written as `"col"`). | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems this page would be worth documenting if
Either that it works, or it doesn't There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It does not work. I will add a note in the PR that follows your comments. |
||||||
Columns can also be accessed using an integer index specifying their position. | ||||||
|
||||||
Since `df[!, :col]` does not make a copy, changing the elements of the column vector returned by this syntax will affect the values stored in the original `df`. To get a copy of the column use `df[:, :col]`: changing the vector returned by this syntax does not change `df`. | ||||||
|
||||||
|
@@ -58,6 +59,13 @@ julia> df.A | |||||
3 | ||||||
4 | ||||||
|
||||||
julia> df."A" | ||||||
bkamins marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
4-element Array{Int64,1}: | ||||||
1 | ||||||
2 | ||||||
3 | ||||||
4 | ||||||
|
||||||
julia> df.A === df[!, :A] | ||||||
true | ||||||
|
||||||
|
@@ -67,6 +75,15 @@ false | |||||
julia> df.A == df[:, :A] | ||||||
true | ||||||
|
||||||
julia> df.A === df[!, "A"] | ||||||
true | ||||||
|
||||||
julia> df.A === df[:, "A"] | ||||||
false | ||||||
|
||||||
julia> df.A == df[:, "A"] | ||||||
true | ||||||
|
||||||
julia> df.A === df[!, 1] | ||||||
true | ||||||
|
||||||
|
@@ -89,15 +106,28 @@ julia> df[:, firstcolumn] == df.A | |||||
true | ||||||
``` | ||||||
|
||||||
Column names can be obtained using the `names` function: | ||||||
Column names can be obtained as strings using the `names` function: | ||||||
|
||||||
```jldoctest dataframe | ||||||
julia> names(df) | ||||||
2-element Array{Symbol,1}: | ||||||
:A | ||||||
:B | ||||||
2-element Array{String,1}: | ||||||
"A" | ||||||
"B" | ||||||
``` | ||||||
|
||||||
To get column names as `Symbol`s use the `propertynames` function: | ||||||
``` | ||||||
julia> propertynames(df) | ||||||
(:A, :B) | ||||||
bkamins marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
``` | ||||||
|
||||||
!!! note | ||||||
|
||||||
DataFrames.jl allows to use `Symbol`s (like `:A`) and strings (like `"A"`) | ||||||
for all column indexing operations for convenience. | ||||||
However, using `Symbol`s is slightly faster and should generally be preferred. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added |
||||||
|
||||||
|
||||||
### Constructing Column by Column | ||||||
|
||||||
It is also possible to start with an empty `DataFrame` and add columns to it one by one: | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it is
LinearIndices{1,Tuple{Base.OneTo{Int}}}
but I wanted to avoid writing this as it would be confusing I think.