[BREAKING] Add column indexing using strings

User visible changes: * allow using an AbstractString everywhere where Symbol was accepted as a column indicator (i.e. this extends beyond just getindex and getproperty, every function was reviewed and made consistent) * docstrings were updated to reflect this change + some unrelated docs fixes were added where spotted * names function consistently returns vector of strings for all types that return column names * propertynames function consistently returns vector of Symbols for all types that return column names * keys function consistently returns vector of Symbols for all types that return column names as keys * depend on DataAPI.jl v1.2 * rename! passes String to a function if API with function is used * consistently define hasproperty for types that have custom getproperty * add cols to names function for types where it makes sense consistently * push! accepts dicts with string keys * indexing with an arbitrary eltype vector now throws better error messages (informtive ArgumentErroris thrown in more cases rather thanMethodError` in the past) Internal changes: * add definitions of SymbolOrString, ColumnIndex (updated), and MultiColumnIndex unions and use them consistently across whole codebase * remove keys from AbstractIndex as it was never used * correct formatting of code in many places (mainly docstrings, function signatures, and adding missing return statements in long functions as many of them were updated anyway, but also in unrelated places) * fix a rule that _names returns vector of Symbols without copying * in some places switch from names to _names internally to avoid unnecessary allocations * in some places use groupcols where this function would be an appropriate to call; similarly with names(df, cols) where it made sense * remove without function that was not needed anymore Co-authored-by: Milan Bouchet-Valat <[email protected]> Co-authored-by: pdeffebach <[email protected]>
JuliaData · Apr 27, 2020 · b1f675d · b1f675d
1 parent ccde40a
commit b1f675d
Show file tree

Hide file tree

Showing 42 changed files with 2,709 additions and 1,069 deletions.
diff --git a/Project.toml b/Project.toml
@@ -35,7 +35,7 @@ test = ["DataStructures", "DataValues", "Dates", "Logging", "Random", "Test"]
 julia = "1"
 CategoricalArrays = "0.8"
 Compat = "2.2, 3"
-DataAPI = "1.0.1"
+DataAPI = "1.2"
 InvertedIndices = "1"
 IteratorInterfaceExtensions = "0.1.1, 1"
 Missings = "0.4.2"

diff --git a/docs/src/lib/indexing.md b/docs/src/lib/indexing.md
@@ -17,9 +17,11 @@ and broadcasting are intended to work with `DataFrame`, `SubDataFrame` and `Data
 The rules for a valid type of index into a column are the following:
 * a value, later denoted as `col`:
     * a `Symbol`;
+    * an `AbstractString`;
     * an `Integer` that is not `Bool`;
 * a vector, later denoted as `cols`:
     * a vector of `Symbol` (does not have to be a subtype of `AbstractVector{Symbol}`);
+    * a vector of `AbstractString` (does not have to be a subtype of `AbstractVector{<:AbstractString}`);
     * a vector of `Integer` other than `Bool` (does not have to be a subtype of `AbstractVector{<:Integer}`);
     * a vector of `Bool` that has to be a subtype of `AbstractVector{Bool}`;
     * a regular expression, which gets expanded to a vector of matching column names;
@@ -122,13 +124,14 @@ so it is unsafe to use it afterwards (the column length correctness will be pres
 * `df[CartesianIndex(row, col)] = v` -> the same as `df[row, col] = v`;
 * `df[row, cols] = v` -> set row `row` of columns `cols` in-place; the same as `dfr = df[row, cols]; dfr[:] = v`;
 * `df[rows, col] = v` -> set rows `rows` of column `col` in-place; `v` must be an `AbstractVector`;
-                         if `rows` is `:` and `col` is a `Symbol` that is not present in `df` then a new column
-                         in `df` is created and holds a `copy` of `v`; equivalent to `df.col = copy(v)` if `col` is a valid identifier;
+                         if `rows` is `:` and `col` is a `Symbol` or `AbstractString`
+                         that is not present in `df` then a new column in `df` is created and holds a `copy` of `v`; equivalent to `df.col = copy(v)` if `col` is a valid identifier;
 * `df[rows, cols] = v` -> set rows `rows` of columns `cols` in-place; `v` must be an `AbstractMatrix` or an `AbstractDataFrame`
                       (in this case column names must match);
 * `df[!, col] = v` -> replaces `col` with `v` without copying
                       (with the exception that if `v` is an `AbstractRange` it gets converted to a `Vector`);
-                      also if `col` is a `Symbol` that is not present in `df` then a new column in `df` is created and holds `v`;
+                      also if `col` is a `Symbol` or `AbstractString` that is not present in `df` then
+                      a new column in `df` is created and holds `v`;
                       equivalent to `df.col = v` if `col` is a valid identifier;
                       this is allowed if `ncol(df) == 0 || length(v) == nrow(df)`;
 * `df[!, cols] = v` -> replaces existing columns `cols` in data frame `df` with copying;
@@ -183,10 +186,10 @@ Additional rules:
 * in the `df[CartesianIndex(row, col)] .= v`, `df[row, col] .= v` syntaxes `v` is broadcasted into the contents of `df[row, col]` (this is consistent with Julia Base);
 * in the `df[row, cols] .= v` syntaxes the assignment to `df` is performed in-place;
 * in the `df[rows, col] .= v` and `df[rows, cols] .= v` syntaxes the assignment to `df` is performed in-place;
-  if `rows` is `:` and `col` is `Symbol` and it is missing from `df` then a new column is allocated and added;
+  if `rows` is `:` and `col` is `Symbol` or `AbstractString` and it is missing from `df` then a new column is allocated and added;
   the length of the column is always the value of `nrow(df)` before the assignment takes place;
 * in the `df[!, col] .= v` syntax column `col` is replaced by a freshly allocated vector;
-  if `col` is `Symbol` and it is missing from `df` then a new column is allocated added;
+  if `col` is `Symbol` or `AbstractString` and it is missing from `df` then a new column is allocated added;
   the length of the column is always the value of `nrow(df)` before the assignment takes place;
 * the `df[!, cols] .= v` syntax replaces existing columns `cols` in data frame `df` with freshly allocated vectors;
 * `df.col .= v` syntax is allowed and performs in-place assignment to an existing vector `df.col`.
@@ -197,9 +200,8 @@ Additional rules:
 
 Note that `sdf[!, col] .= v` and `sdf[!, cols] .= v` syntaxes are not allowed as `sdf` can be only modified in-place.
 
-If column indexing using `Symbol` names in `cols` is performed, the order of columns in the operation is specified
-by the order of names.
-
+If column indexing using `Symbol` or `AbstractString` names in `cols` is performed, the order
+of columns in the operation is specified by the order of names.
 
 ## Indexing `GroupedDataFrame`s
 
@@ -230,3 +232,18 @@ The elements of a `GroupedDataFrame` are [`SubDataFrame`](@ref)s of its parent.
 * `gd[n::Not]` -> Any of the above types wrapped in `Not`. The result
    will be a new `GroupedDataFrame` containing all groups in `gd` *not* selected
    by the wrapped index.
+
+# Common API for types defined in DataFrames.jl
+
+This table presents return value types of calling `names`, `propertynames` and `keys`
+on types exposed to the user by DataFrames.jl:
+
+| Type                | `names`          | `propertynames`  | `keys`           |
+|---------------------|------------------|------------------|------------------|
+| `AbstractDataFrame` | `Vector{String}` | `Vector{Symbol}` | undefined        |
+| `DataFrameRow`      | `Vector{String}` | `Vector{Symbol}` | `Vector{Symbol}` |
+| `DataFrameRows`     | `Vector{String}` | `Vector{Symbol}` | vector of `Int`  |
+| `DataFrameColumns`  | `Vector{String}` | `Vector{Symbol}` | `Vector{Symbol}` |
+| `GroupedDataFrame`  | `Vector{String}` | tuple of fields  | `GroupKeys`      |
+| `GroupKeys`         | undefined        | tuple of fields  | vector of `Int`  |
+| `GroupKey`          | `Vector{String}` | `Vector{Symbol}` | `Vector{Symbol}` |
diff --git a/docs/src/lib/types.md b/docs/src/lib/types.md
@@ -109,6 +109,7 @@ without caution because:
 
 ```@docs
 AbstractDataFrame
+AsTable
 ByRow
 DataFrame
 DataFrameRow

diff --git a/docs/src/man/getting_started.md b/docs/src/man/getting_started.md
@@ -45,7 +45,8 @@ julia> df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
 
 ```
 
-Columns can be directly (i.e. without copying) accessed via `df.col` or `df[!, :col]`. The latter syntax is more flexible as it allows passing a variable holding the name of the column, and not only a literal name. Note that column names are symbols (`:col` or `Symbol("col")`) rather than strings (`"col"`). Columns can also be accessed using an integer index specifying their position.
+Columns can be directly (i.e. without copying) accessed via `df.col`, `df."col"`, `df[!, :col]` or `df[!, "col"]`. The two latter syntaxes are more flexible as they allow passing a variable holding the name of the column, and not only a literal name. Note that column names can be either symbols (written as `:col`, `:var"col"` or `Symbol("col")`) or strings (written as `"col"`).
+Columns can also be accessed using an integer index specifying their position.
 
 Since `df[!, :col]` does not make a copy, changing the elements of the column vector returned by this syntax will affect the values stored in the original `df`. To get a copy of the column use `df[:, :col]`: changing the vector returned by this syntax does not change `df`.
 
@@ -58,6 +59,13 @@ julia> df.A
  3
  4
 
+julia> df."A"
+4-element Array{Int64,1}:
+ 1
+ 2
+ 3
+ 4
+
 julia> df.A === df[!, :A]
 true
 
@@ -67,6 +75,15 @@ false
 julia> df.A == df[:, :A]
 true
 
+julia> df.A === df[!, "A"]
+true
+
+julia> df.A === df[:, "A"]
+false
+
+julia> df.A == df[:, "A"]
+true
+
 julia> df.A === df[!, 1]
 true
 
@@ -89,15 +106,30 @@ julia> df[:, firstcolumn] == df.A
 true
 ```
 
-Column names can be obtained using the `names` function:
+Column names can be obtained as strings using the `names` function:
 
 ```jldoctest dataframe
 julia> names(df)
+2-element Array{String,1}:
+ "A"
+ "B"
+ ```
+
+To get column names as `Symbol`s use the `propertynames` function:
+```
+julia> propertynames(df)
 2-element Array{Symbol,1}:
  :A
  :B
 ```
 
+!!! note
+
+    DataFrames.jl allows to use `Symbol`s (like `:A`) and strings (like `"A"`)
+    for all column indexing operations for convenience.
+    However, using `Symbol`s is slightly faster and should generally be preferred.
+
+
 ### Constructing Column by Column
 
 It is also possible to start with an empty `DataFrame` and add columns to it one by one: