[BREAKING] Add column indexing using strings #2199

bkamins · 2020-04-17T07:29:19Z

Fixes #1926.

UPDATED

The list of changes in this PR is:

User visible:

allow using an AbstractString everywhere where Symbol was accepted as a column indicator (i.e. this extends beyond just getindex and getproperty, every function was reviewed and made consistent)
docstrings were updated to reflect this change + some unrelated docs fixes were added where spotted
names function consistently returns vector of strings for all types that return column names
propertynames function consistently returns vector of Symbols for all types that return column names
keys function consistently returns vector of Symbols for all types that return column names as keys
depend on DataAPI.jl v1.2
rename! passes String to a function if API with function is used
consistently define hasproperty for types that have custom getproperty
add cols to names function for types where it makes sense consistently
push! accepts dicts with string keys
indexing with an arbitrary eltype vector now throws better error messages (informtive ArgumentErroris thrown in more cases rather thanMethodError` in the past)

Internal:

add definitions of SymbolOrString, ColumnIndex (updated), and MultiColumnIndex unions and use them consistently across whole codebase
remove keys from AbstractIndex as it was never used
correct formatting of code in many places (mainly docstrings, function signatures, and adding missing return statements in long functions as many of them were updated anyway, but also in unrelated places)
fix a rule that _names returns vector of Symbols without copying
in some places switch from names to _names internally to avoid unnecessary allocations
in some places use groupcols where this function would be an appropriate to call; similarly with names(df, cols) where it made sense
remove without function that was not needed anymore

bkamins · 2020-04-17T10:51:27Z

OK - if anyone is interested - currently I have updated the basic tests and documentation. The PR should be good to have a first look at and comment.

Now I will be working on testing corner cases of the new functionality.

bkamins · 2020-04-17T11:52:05Z

Lessons (re)learned so far:

I already do not remember what works on 1.0 and we want to keep backward compatibility (the language is moving forward fast)
method ambiguities with functions from Base that have very loose signatures is a pain
I expect that some bugs introduced will be caught only after the release and testing in practice (the PR really touches everything we have in DataFrames.jl)

bkamins · 2020-04-17T15:46:49Z

Between from DataAPI.jl does not allow string indexing. I am extending it in DataFrames.jl which is a slight type piracy.

quinnj · 2020-04-17T23:30:28Z

Between from DataAPI.jl does not allow string indexing. I am extending it in DataFrames.jl which is a slight type piracy.

Happy to add it to DataAPI.jl quickly if we want.

bkamins · 2020-04-18T05:21:32Z

Happy to add it to DataAPI.jl quickly if we want.

Thank you.

@nalimilan - what do you think? The definition is:

DataAPI.Between(x::AbstractString, y::AbstractString) = Between(Symbol(x), Symbol(y))
DataAPI.Between(x::Union{Int, Symbol}, y::AbstractString) = Between(x, Symbol(y))
DataAPI.Between(x::AbstractString, y::Union{Int, Symbol}) = Between(Symbol(x), y)

(in DataAPI.jl it would be simpler)

so the idea is is to allow string as input but internally we store it as Symbol, so other packages that require Symbol for column indexing will work.

bkamins · 2020-04-18T08:08:08Z

The PR is ready for reviews / beta testing.

bkamins · 2020-04-18T10:14:34Z

If JuliaData/DataAPI.jl#17 is merged then this PR should be updated to remove Between cases for AbstractString.

nalimilan

Thanks, looks mostly good! Apart from minor things, I have noted a few design decisions regarding what type name accessors should return.

How confident are you that tests cover (almost) everything? Does coverage confirm that all new methods are tested? It's too bad so many tests have to be duplicated, but that's the only solution (we could almost imagine using some Cassette.jl tricks to automatically call each function that appears in a list with its symbol arguments replaced with strings and check it gives the same result...).

docs/src/man/getting_started.md

src/abstractdataframe/abstractdataframe.jl

test/constructors.jl

test/data.jl

test/deprecated.jl

test/indexing.jl

test/select.jl

bkamins · 2020-04-22T10:44:03Z

OK - I am done with round 2 of documentation fixes :)

test/constructors.jl

src/abstractdataframe/abstractdataframe.jl

src/abstractdataframe/reshape.jl

src/dataframe/dataframe.jl

src/groupeddataframe/splitapplycombine.jl

Co-Authored-By: Milan Bouchet-Valat <[email protected]>

bkamins · 2020-04-22T17:44:42Z

Following your fixes I have reformatted some code to also fix the customary line witdth.

This prompted me to make one change (only internal - the API is unchanged). I defined MultiColumnIndex union and use it consistently. It is much cleaner this way I think (as otherwise it is a pain to check if all signatures are correct). Sorry for making this change so late, but I think we should have such an abstraction that complements ColumnIndex.

Finally - it will make implementation of JuliaData/DataAPI.jl#16 a minor PR (we will have to change a few lines of code to incorporate it).

src/abstractdataframe/join.jl

src/other/index.jl

src/abstractdataframe/sort.jl

Co-Authored-By: Milan Bouchet-Valat <[email protected]>

pdeffebach

Thank you for this monumental effort!

After playing around with this, I must say i prefer using :x etc. when working interactively. Remembering to close quotation marks and press the right arrow is kind of annoying. Even with OhMyREPL.

But it's going to be great for new users to not have to worry about Symbols and it will be easier to work programatically with things.

I just hope we don't overwhelm users with too many new methods.

docs/src/man/getting_started.md

src/abstractdataframe/abstractdataframe.jl

src/dataframerow/dataframerow.jl

docs/src/man/getting_started.md

oxinabox · 2020-04-24T20:32:24Z

Can a list of all the other things this PR changes be added to the top post?
Seems like it also reflows a bunch of docstrongs?
And I see some uses of size have been changed to nrows

Co-Authored-By: pdeffebach <[email protected]>

bkamins · 2020-04-24T22:20:55Z

@oxinabox - the problem we had was that there was never a good time to clean up the code and this PR anyway touched almost all files in the code-base (otherwise there was always a problem that such clean-up changes later force rebasing of all PRs, and with this PR we have stalled merging of other PRs anyway till this one is finished to make finishing it easier + I had to read every line in the code so when I spotted something I just corrected it).

E.g. I have changed size to nrow (I think it was only one place) - where size was mixed with ncol in the same expression. The change was from:

"<p>$(digitsep(size(df, 1))) rows × $(digitsep(ncol(df))) columns$omitmsg</p>"

to

"<p>$(digitsep(nrow(df))) rows × $(digitsep(ncol(df))) columns$omitmsg</p>"

So the list of changes in this PR is:

User visible:

allow using an AbstractString everywhere where Symbol was accepted as a column indicator (i.e. this extends beyond just getindex and getproperty, every function was reviewed and made consistent)
docstrings were updated to reflect this change + some unrelated docs fixes were added where spotted
names function consistently returns vector of strings for all types that return column names
propertynames function consistently returns vector of Symbols for all types that return column names
keys function consistently returns vector of Symbols for all types that return column names as keys
depend on DataAPI.jl v1.2
rename! passes String to a function if API with function is used
consistently define hasproperty for types that have custom getproperty
add cols to names function for types where it makes sense consistently
push! accepts dicts with string keys
indexing with an arbitrary eltype vector now throws better error messages (informtive ArgumentErroris thrown in more cases rather thanMethodError` in the past)

Internal:

add definitions of SymbolOrString, ColumnIndex (updated), and MultiColumnIndex unions and use them consistently across whole codebase
remove keys from AbstractIndex as it was never used
correct formatting of code in many places (mainly docstrings, function signatures, and adding missing return statements in long functions as many of them were updated anyway, but also in unrelated places)
fix a rule that _names returns vector of Symbols without copying
in some places switch from names to _names internally to avoid unnecessary allocations
in some places use groupcols where this function would be an appropriate to call; similarly with names(df, cols) where it made sense
remove without function that was not needed anymore

(I will make this list in the top post also)

nalimilan

I'm not really convinced SymbolOrString is better than Union{Symbol, AbstractString}, but otherwise looks good.

bkamins · 2020-04-27T09:57:36Z

I'm not really convinced

I am largely indifferent here but since in base we have VecOrMat I think it is OK, and @pdeffebach found it more readable (and for sure it is shorter so we do not have to break lines so oftern in the code).

bkamins · 2020-04-27T10:30:13Z

Thank you all for working on this huge PR!

oxinabox

Stopping reviewing now since already merged.
Still it looks all good anyway for the 2/3rd I did finish before it was merged

oxinabox · 2020-04-27T10:09:35Z

docs/src/lib/indexing.md

+| `DataFrameRows`     | `Vector{String}` | `Vector{Symbol}` | vector of `Int`  |
+| `DataFrameColumns`  | `Vector{String}` | `Vector{Symbol}` | `Vector{Symbol}` |
+| `GroupedDataFrame`  | `Vector{String}` | tuple of fields  | `GroupKeys`      |
+| `GroupKeys`         | undefined        | tuple of fields  | vector of `Int`  |


Suggested change

| `GroupKeys` | undefined | tuple of fields | vector of `Int` |

| `GroupKeys` | undefined | tuple of fields | `AbsractVector{Int}` |

same here - it is LinearIndices{1,Tuple{Base.OneTo{Int}}}

oxinabox · 2020-04-27T10:10:09Z

docs/src/lib/indexing.md

+|---------------------|------------------|------------------|------------------|
+| `AbstractDataFrame` | `Vector{String}` | `Vector{Symbol}` | undefined        |
+| `DataFrameRow`      | `Vector{String}` | `Vector{Symbol}` | `Vector{Symbol}` |
+| `DataFrameRows`     | `Vector{String}` | `Vector{Symbol}` | vector of `Int`  |


Suggested change

| `DataFrameRows` | `Vector{String}` | `Vector{Symbol}` | vector of `Int` |

| `DataFrameRows` | `Vector{String}` | `Vector{Symbol}` | `Vector{Int}` |

Actually it is LinearIndices{1,Tuple{Base.OneTo{Int}}} but I wanted to avoid writing this as it would be confusing I think.

oxinabox · 2020-04-27T10:12:00Z

docs/src/man/getting_started.md

@@ -45,7 +45,8 @@ julia> df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])

 ```

-Columns can be directly (i.e. without copying) accessed via `df.col` or `df[!, :col]`. The latter syntax is more flexible as it allows passing a variable holding the name of the column, and not only a literal name. Note that column names are symbols (`:col` or `Symbol("col")`) rather than strings (`"col"`). Columns can also be accessed using an integer index specifying their position.
+Columns can be directly (i.e. without copying) accessed via `df.col`, `df."col"`, `df[!, :col]` or `df[!, "col"]`. The two latter syntaxes are more flexible as they allow passing a variable holding the name of the column, and not only a literal name. Note that column names can be either symbols (written as `:col`, `:var"col"` or `Symbol("col")`) or strings (written as `"col"`).


It seems this page would be worth documenting if

x = "ol" df."c$x"`

Either that it works, or it doesn't

It does not work. I will add a note in the PR that follows your comments.

oxinabox · 2020-04-27T10:14:08Z

docs/src/man/getting_started.md

+
+    DataFrames.jl allows to use `Symbol`s (like `:A`) and strings (like `"A"`)
+    for all column indexing operations for convenience.
+    However, using `Symbol`s is slightly faster and should generally be preferred.


Suggested change

However, using `Symbol`s is slightly faster and should generally be preferred.

However, using `Symbol`s is slightly faster and should generally be preferred, if not generating them via string manipulation.

oxinabox · 2020-04-27T10:34:20Z

src/abstractdataframe/abstractdataframe.jl

        throw(ArgumentError("Duplicate names not allowed. Duplicated value(s) are: " *
                            ":$(join(duplicate_names, ", "))"))
    end

    # Put the summary stats into the return data frame
    data = DataFrame()
-    data.variable = names(df)
+    data.variable = copy(_names(df))


this is the same code as propertynames(df)

good point - fixed.

oxinabox · 2020-04-27T10:56:10Z

src/abstractdataframe/abstractdataframe.jl

+          cols::Union{Symbol, AbstractVector{Symbol},
+                      AbstractVector{<:AbstractString}}=:setequal) =


Union{Symbol, AbstractVector{Symbol}, AbstractVector{<:AbstractString}}
matches the same patterns as:
Union{Symbol, AbstractVector{<:SymbolOrString},}
and I think the latter is shorter and easier to read.
I think it would be good to replace all instances of the former with the latter.

This will allow mixing strings and Symbols. I leave it out for now as I want to hear what @nalimilan thinks about it (I will make a note in the PR to consider this).

oxinabox · 2020-04-27T11:32:53Z

src/dataframerow/dataframerow.jl

+            if all(x -> x isa AbstractString, keys(v))
+                v = (;(Symbol.(keys(v)) .=> values(v))...)
+            end
            for n in view(_names(df), idxs)
                if !haskey(v, n)
                    throw(ArgumentError("Column :$n not found in source dictionary"))
                end
            end
        elseif !all(((a, b),) -> a == b, zip(view(_names(df), idxs), keys(v)))
            mismatched = findall(view(_names(df), idxs) .!= collect(keys(v)))
-            throw(ArgumentError("Selected column names do not match the names in assigned value in" *
-                                " positions $(join(mismatched, ", ", " and "))"))
+            throw(ArgumentError("Selected column names do not match the names in assigned " *
+                                "value in positions $(join(mismatched, ", ", " and "))"))
        end

        for (col, val) in pairs(v)


An alterantive way to do this would be to declare a local variable
v_keys which promises to be a collection of Symbol.
e.g.

v_keys = keys(v) v_keys = keytype(v) === Symbol ? keys(v) : Symbol.(v)

then each time keys(v) is used in this function it can be replaced with v_keys
The values are never used until the end

This would save having to convert the values as well,
and line 128 works already with symbols or strings

this is related to the comment above - if we allow mixing strings with Symbols.
Currently we do not allow mixing. When we decide on this I will rewrite it anyway.

bkamins · 2020-04-27T11:52:41Z

Thank you for the comments. I will make a separate PR taking them into account.

bkamins added 7 commits April 16, 2020 17:27

change rename to pass string to columns

60e72bf

allow string indexing (minimal test and docs updates)

3a44c8f

Merge branch 'master' into add_string_col_indexing

ae2691a

make keys consistent

cc9bbca

fixed old tests

73a5a1a

update manual

64741b2

correct location of deleterows! deprecated tests

e1e04bb

fix for Julia 1.0

b390558

bkamins added the breaking The proposed change is breaking. label Apr 17, 2020

bkamins added this to the 1.0 milestone Apr 17, 2020

only abstractdataframe.jl and dataframe.jl left to test

ddb84f2

bkamins mentioned this pull request Apr 17, 2020

dataframerow related docstrings #2196

Merged

bkamins added 2 commits April 18, 2020 00:16

only dataframe.jl left

981e8bf

remove internal keys test

5de0354

ready for review

4f8b2dc

bkamins marked this pull request as ready for review April 18, 2020 08:07

bkamins mentioned this pull request Apr 18, 2020

Handling of strings for column indexing #1926

Closed

This was referenced Apr 18, 2020

allow passing AbstractString to Between JuliaData/DataAPI.jl#17

Merged

add Not to delete! #2191

Merged

[BREAKING] Circular ref bug show #2200

Merged

sync with DataAPI v1.2

38ad85c

nalimilan reviewed Apr 21, 2020

View reviewed changes

nalimilan changed the title ~~[BREAKING] Add column intexing using strings~~ [BREAKING] Add column indexing using strings Apr 21, 2020

fix cols documentation

26b1980

update constructor tests

3678130

nalimilan reviewed Apr 22, 2020

View reviewed changes

bkamins and others added 2 commits April 22, 2020 18:03

Apply suggestions from code review

0bb2a99

Co-Authored-By: Milan Bouchet-Valat <[email protected]>

introduce MultiColumnIndex and fix overlong lines

30a8f70

add comment why AbstractVector is excluded when handling cols in sort

9a3cee0

bkamins mentioned this pull request Apr 23, 2020

Tag a new release to reflect [compat] CategoricalArrays = "0.8" update #2204

Closed

nalimilan reviewed Apr 23, 2020

View reviewed changes

bkamins and others added 2 commits April 23, 2020 17:59

Apply suggestions from code review

cf879d3

Co-Authored-By: Milan Bouchet-Valat <[email protected]>

updates after code review

9cc3702

pdeffebach approved these changes Apr 24, 2020

View reviewed changes

bkamins and others added 2 commits April 24, 2020 23:24

Update docs/src/man/getting_started.md

2a935a6

Co-Authored-By: pdeffebach <[email protected]>

define SymbolOrString

e40461e

bkamins mentioned this pull request Apr 25, 2020

Cleaner syntax #2206

Closed

nalimilan approved these changes Apr 27, 2020

View reviewed changes

bkamins merged commit b1f675d into JuliaData:master Apr 27, 2020

bkamins deleted the add_string_col_indexing branch April 27, 2020 10:29

oxinabox mentioned this pull request Apr 27, 2020

Make ByRow subtype Function #2212

Merged

oxinabox reviewed Apr 27, 2020

View reviewed changes

bkamins mentioned this pull request Apr 27, 2020

cleanup after string PR #2213

Merged

MarcMush mentioned this pull request May 15, 2020

patch for gz pack under windows using 7zip adrien-le-franc/EMSx.jl#7

Merged

tomyun added a commit to cropbox/Cropbox.jl that referenced this pull request Jun 11, 2020

Adapt to DataFrames index change (http://JuliaData/DataFrames.jl#2199)

1bf0a2a

tomyun added a commit to cropbox/Cropbox.jl that referenced this pull request Jun 11, 2020

Adapt to DataFrames index change (JuliaData/DataFrames.jl#2199)

4b62921

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BREAKING] Add column indexing using strings #2199

[BREAKING] Add column indexing using strings #2199

bkamins commented Apr 17, 2020 •

edited

Loading

bkamins commented Apr 17, 2020

bkamins commented Apr 17, 2020

bkamins commented Apr 17, 2020

quinnj commented Apr 17, 2020

bkamins commented Apr 18, 2020

bkamins commented Apr 18, 2020

bkamins commented Apr 18, 2020

nalimilan left a comment

bkamins commented Apr 22, 2020

bkamins commented Apr 22, 2020

pdeffebach left a comment

oxinabox commented Apr 24, 2020

bkamins commented Apr 24, 2020 •

edited

Loading

nalimilan left a comment

bkamins commented Apr 27, 2020

bkamins commented Apr 27, 2020

oxinabox left a comment

oxinabox Apr 27, 2020

bkamins Apr 27, 2020

oxinabox Apr 27, 2020

bkamins Apr 27, 2020

oxinabox Apr 27, 2020

bkamins Apr 27, 2020

oxinabox Apr 27, 2020

bkamins Apr 27, 2020

oxinabox Apr 27, 2020

bkamins Apr 27, 2020

oxinabox Apr 27, 2020

bkamins Apr 27, 2020

oxinabox Apr 27, 2020

bkamins Apr 27, 2020

bkamins commented Apr 27, 2020

	\| `GroupKeys` \| undefined \| tuple of fields \| vector of `Int` \|
	\| `GroupKeys` \| undefined \| tuple of fields \| `AbsractVector{Int}` \|

	\| `DataFrameRows` \| `Vector{String}` \| `Vector{Symbol}` \| vector of `Int` \|
	\| `DataFrameRows` \| `Vector{String}` \| `Vector{Symbol}` \| `Vector{Int}` \|

	However, using `Symbol`s is slightly faster and should generally be preferred.
	However, using `Symbol`s is slightly faster and should generally be preferred, if not generating them via string manipulation.

		cols::Union{Symbol, AbstractVector{Symbol},
		AbstractVector{<:AbstractString}}=:setequal) =

[BREAKING] Add column indexing using strings #2199

[BREAKING] Add column indexing using strings #2199

Conversation

bkamins commented Apr 17, 2020 • edited Loading

bkamins commented Apr 17, 2020

bkamins commented Apr 17, 2020

bkamins commented Apr 17, 2020

quinnj commented Apr 17, 2020

bkamins commented Apr 18, 2020

bkamins commented Apr 18, 2020

bkamins commented Apr 18, 2020

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented Apr 22, 2020

bkamins commented Apr 22, 2020

pdeffebach left a comment

Choose a reason for hiding this comment

oxinabox commented Apr 24, 2020

bkamins commented Apr 24, 2020 • edited Loading

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented Apr 27, 2020

bkamins commented Apr 27, 2020

oxinabox left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Apr 27, 2020

bkamins commented Apr 17, 2020 •

edited

Loading

bkamins commented Apr 24, 2020 •

edited

Loading