Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING] Add column indexing using strings #2199

Merged
merged 28 commits into from
Apr 27, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
60e72bf
change rename to pass string to columns
bkamins Apr 16, 2020
3a44c8f
allow string indexing (minimal test and docs updates)
bkamins Apr 16, 2020
ae2691a
Merge branch 'master' into add_string_col_indexing
bkamins Apr 17, 2020
cc9bbca
make keys consistent
bkamins Apr 17, 2020
73a5a1a
fixed old tests
bkamins Apr 17, 2020
64741b2
update manual
bkamins Apr 17, 2020
e1e04bb
correct location of deleterows! deprecated tests
bkamins Apr 17, 2020
b390558
fix for Julia 1.0
bkamins Apr 17, 2020
ddb84f2
only abstractdataframe.jl and dataframe.jl left to test
bkamins Apr 17, 2020
981e8bf
only dataframe.jl left
bkamins Apr 17, 2020
5de0354
remove internal keys test
bkamins Apr 17, 2020
4f8b2dc
ready for review
bkamins Apr 18, 2020
38ad85c
sync with DataAPI v1.2
bkamins Apr 20, 2020
e0cf2bd
Apply suggestions from code review
bkamins Apr 21, 2020
95aa693
updates after the code review
bkamins Apr 21, 2020
670664e
update documentation and tests
bkamins Apr 22, 2020
4b52cc8
Apply suggestions from code review
bkamins Apr 22, 2020
ab29c54
move one ! to a correct place
bkamins Apr 22, 2020
b93a807
Merge remote-tracking branch 'origin/add_string_col_indexing' into ad…
bkamins Apr 22, 2020
26b1980
fix cols documentation
bkamins Apr 22, 2020
3678130
update constructor tests
bkamins Apr 22, 2020
0bb2a99
Apply suggestions from code review
bkamins Apr 22, 2020
30a8f70
introduce MultiColumnIndex and fix overlong lines
bkamins Apr 22, 2020
9a3cee0
add comment why AbstractVector is excluded when handling cols in sort
bkamins Apr 22, 2020
cf879d3
Apply suggestions from code review
bkamins Apr 23, 2020
9cc3702
updates after code review
bkamins Apr 23, 2020
2a935a6
Update docs/src/man/getting_started.md
bkamins Apr 24, 2020
e40461e
define SymbolOrString
bkamins Apr 24, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ test = ["DataStructures", "DataValues", "Dates", "Logging", "Random", "Test"]
julia = "1"
CategoricalArrays = "0.8"
Compat = "2.2, 3"
DataAPI = "1.0.1"
DataAPI = "1.2"
InvertedIndices = "1"
IteratorInterfaceExtensions = "0.1.1, 1"
Missings = "0.4.2"
Expand Down
33 changes: 25 additions & 8 deletions docs/src/lib/indexing.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,11 @@ and broadcasting are intended to work with `DataFrame`, `SubDataFrame` and `Data
The rules for a valid type of index into a column are the following:
* a value, later denoted as `col`:
* a `Symbol`;
* an `AbstractString`;
* an `Integer` that is not `Bool`;
* a vector, later denoted as `cols`:
* a vector of `Symbol` (does not have to be a subtype of `AbstractVector{Symbol}`);
* a vector of `AbstractString` (does not have to be a subtype of `AbstractVector{<:AbstractString}`);
* a vector of `Integer` other than `Bool` (does not have to be a subtype of `AbstractVector{<:Integer}`);
* a vector of `Bool` that has to be a subtype of `AbstractVector{Bool}`;
* a regular expression, which gets expanded to a vector of matching column names;
Expand Down Expand Up @@ -122,13 +124,14 @@ so it is unsafe to use it afterwards (the column length correctness will be pres
* `df[CartesianIndex(row, col)] = v` -> the same as `df[row, col] = v`;
* `df[row, cols] = v` -> set row `row` of columns `cols` in-place; the same as `dfr = df[row, cols]; dfr[:] = v`;
* `df[rows, col] = v` -> set rows `rows` of column `col` in-place; `v` must be an `AbstractVector`;
if `rows` is `:` and `col` is a `Symbol` that is not present in `df` then a new column
in `df` is created and holds a `copy` of `v`; equivalent to `df.col = copy(v)` if `col` is a valid identifier;
if `rows` is `:` and `col` is a `Symbol` or `AbstractString`
that is not present in `df` then a new column in `df` is created and holds a `copy` of `v`; equivalent to `df.col = copy(v)` if `col` is a valid identifier;
* `df[rows, cols] = v` -> set rows `rows` of columns `cols` in-place; `v` must be an `AbstractMatrix` or an `AbstractDataFrame`
(in this case column names must match);
* `df[!, col] = v` -> replaces `col` with `v` without copying
(with the exception that if `v` is an `AbstractRange` it gets converted to a `Vector`);
also if `col` is a `Symbol` that is not present in `df` then a new column in `df` is created and holds `v`;
also if `col` is a `Symbol` or `AbstractString` that is not present in `df` then
a new column in `df` is created and holds `v`;
equivalent to `df.col = v` if `col` is a valid identifier;
this is allowed if `ncol(df) == 0 || length(v) == nrow(df)`;
* `df[!, cols] = v` -> replaces existing columns `cols` in data frame `df` with copying;
Expand Down Expand Up @@ -183,10 +186,10 @@ Additional rules:
* in the `df[CartesianIndex(row, col)] .= v`, `df[row, col] .= v` syntaxes `v` is broadcasted into the contents of `df[row, col]` (this is consistent with Julia Base);
* in the `df[row, cols] .= v` syntaxes the assignment to `df` is performed in-place;
* in the `df[rows, col] .= v` and `df[rows, cols] .= v` syntaxes the assignment to `df` is performed in-place;
if `rows` is `:` and `col` is `Symbol` and it is missing from `df` then a new column is allocated and added;
if `rows` is `:` and `col` is `Symbol` or `AbstractString` and it is missing from `df` then a new column is allocated and added;
the length of the column is always the value of `nrow(df)` before the assignment takes place;
* in the `df[!, col] .= v` syntax column `col` is replaced by a freshly allocated vector;
if `col` is `Symbol` and it is missing from `df` then a new column is allocated added;
if `col` is `Symbol` or `AbstractString` and it is missing from `df` then a new column is allocated added;
the length of the column is always the value of `nrow(df)` before the assignment takes place;
* the `df[!, cols] .= v` syntax replaces existing columns `cols` in data frame `df` with freshly allocated vectors;
* `df.col .= v` syntax is allowed and performs in-place assignment to an existing vector `df.col`.
Expand All @@ -197,9 +200,8 @@ Additional rules:

Note that `sdf[!, col] .= v` and `sdf[!, cols] .= v` syntaxes are not allowed as `sdf` can be only modified in-place.

If column indexing using `Symbol` names in `cols` is performed, the order of columns in the operation is specified
by the order of names.

If column indexing using `Symbol` or `AbstractString` names in `cols` is performed, the order
of columns in the operation is specified by the order of names.

## Indexing `GroupedDataFrame`s

Expand Down Expand Up @@ -230,3 +232,18 @@ The elements of a `GroupedDataFrame` are [`SubDataFrame`](@ref)s of its parent.
* `gd[n::Not]` -> Any of the above types wrapped in `Not`. The result
will be a new `GroupedDataFrame` containing all groups in `gd` *not* selected
by the wrapped index.

# Common API for types defined in DataFrames.jl

This table presents return value types of calling `names`, `propertynames` and `keys`
on types exposed to the user by DataFrames.jl:

| Type | `names` | `propertynames` | `keys` |
|---------------------|------------------|------------------|------------------|
| `AbstractDataFrame` | `Vector{String}` | `Vector{Symbol}` | undefined |
| `DataFrameRow` | `Vector{String}` | `Vector{Symbol}` | `Vector{Symbol}` |
| `DataFrameRows` | `Vector{String}` | `Vector{Symbol}` | vector of `Int` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `DataFrameRows` | `Vector{String}` | `Vector{Symbol}` | vector of `Int` |
| `DataFrameRows` | `Vector{String}` | `Vector{Symbol}` | `Vector{Int}` |

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it is LinearIndices{1,Tuple{Base.OneTo{Int}}} but I wanted to avoid writing this as it would be confusing I think.

| `DataFrameColumns` | `Vector{String}` | `Vector{Symbol}` | `Vector{Symbol}` |
| `GroupedDataFrame` | `Vector{String}` | tuple of fields | `GroupKeys` |
| `GroupKeys` | undefined | tuple of fields | vector of `Int` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `GroupKeys` | undefined | tuple of fields | vector of `Int` |
| `GroupKeys` | undefined | tuple of fields | `AbsractVector{Int}` |

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here - it is LinearIndices{1,Tuple{Base.OneTo{Int}}}

| `GroupKey` | `Vector{String}` | `Vector{Symbol}` | `Vector{Symbol}` |
1 change: 1 addition & 0 deletions docs/src/lib/types.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ without caution because:

```@docs
AbstractDataFrame
AsTable
ByRow
DataFrame
DataFrameRow
Expand Down
36 changes: 34 additions & 2 deletions docs/src/man/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,8 @@ julia> df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])

```

Columns can be directly (i.e. without copying) accessed via `df.col` or `df[!, :col]`. The latter syntax is more flexible as it allows passing a variable holding the name of the column, and not only a literal name. Note that column names are symbols (`:col` or `Symbol("col")`) rather than strings (`"col"`). Columns can also be accessed using an integer index specifying their position.
Columns can be directly (i.e. without copying) accessed via `df.col`, `df."col"`, `df[!, :col]` or `df[!, "col"]`. The two latter syntaxes are more flexible as they allow passing a variable holding the name of the column, and not only a literal name. Note that column names can be either symbols (written as `:col`, `:var"col"` or `Symbol("col")`) or strings (written as `"col"`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this page would be worth documenting if

x = "ol"
df."c$x"`

Either that it works, or it doesn't

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not work. I will add a note in the PR that follows your comments.

Columns can also be accessed using an integer index specifying their position.

Since `df[!, :col]` does not make a copy, changing the elements of the column vector returned by this syntax will affect the values stored in the original `df`. To get a copy of the column use `df[:, :col]`: changing the vector returned by this syntax does not change `df`.

Expand All @@ -58,6 +59,13 @@ julia> df.A
3
4

julia> df."A"
4-element Array{Int64,1}:
1
2
3
4

julia> df.A === df[!, :A]
true

Expand All @@ -67,6 +75,15 @@ false
julia> df.A == df[:, :A]
true

julia> df.A === df[!, "A"]
true

julia> df.A === df[:, "A"]
false

julia> df.A == df[:, "A"]
true

julia> df.A === df[!, 1]
true

Expand All @@ -89,15 +106,30 @@ julia> df[:, firstcolumn] == df.A
true
```

Column names can be obtained using the `names` function:
Column names can be obtained as strings using the `names` function:

```jldoctest dataframe
julia> names(df)
2-element Array{String,1}:
"A"
"B"
```

To get column names as `Symbol`s use the `propertynames` function:
```
julia> propertynames(df)
2-element Array{Symbol,1}:
:A
:B
```

!!! note

DataFrames.jl allows to use `Symbol`s (like `:A`) and strings (like `"A"`)
for all column indexing operations for convenience.
However, using `Symbol`s is slightly faster and should generally be preferred.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
However, using `Symbol`s is slightly faster and should generally be preferred.
However, using `Symbol`s is slightly faster and should generally be preferred, if not generating them via string manipulation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added



### Constructing Column by Column

It is also possible to start with an empty `DataFrame` and add columns to it one by one:
Expand Down
Loading