Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING] Add column indexing using strings #2199

Merged
merged 28 commits into from
Apr 27, 2020

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Apr 17, 2020

Fixes #1926.

UPDATED

The list of changes in this PR is:

User visible:

  • allow using an AbstractString everywhere where Symbol was accepted as a column indicator (i.e. this extends beyond just getindex and getproperty, every function was reviewed and made consistent)
  • docstrings were updated to reflect this change + some unrelated docs fixes were added where spotted
  • names function consistently returns vector of strings for all types that return column names
  • propertynames function consistently returns vector of Symbols for all types that return column names
  • keys function consistently returns vector of Symbols for all types that return column names as keys
  • depend on DataAPI.jl v1.2
  • rename! passes String to a function if API with function is used
  • consistently define hasproperty for types that have custom getproperty
  • add cols to names function for types where it makes sense consistently
  • push! accepts dicts with string keys
  • indexing with an arbitrary eltype vector now throws better error messages (informtive ArgumentErroris thrown in more cases rather thanMethodError` in the past)

Internal:

  • add definitions of SymbolOrString, ColumnIndex (updated), and MultiColumnIndex unions and use them consistently across whole codebase
  • remove keys from AbstractIndex as it was never used
  • correct formatting of code in many places (mainly docstrings, function signatures, and adding missing return statements in long functions as many of them were updated anyway, but also in unrelated places)
  • fix a rule that _names returns vector of Symbols without copying
  • in some places switch from names to _names internally to avoid unnecessary allocations
  • in some places use groupcols where this function would be an appropriate to call; similarly with names(df, cols) where it made sense
  • remove without function that was not needed anymore

@bkamins
Copy link
Member Author

bkamins commented Apr 17, 2020

OK - if anyone is interested - currently I have updated the basic tests and documentation. The PR should be good to have a first look at and comment.

Now I will be working on testing corner cases of the new functionality.

@bkamins
Copy link
Member Author

bkamins commented Apr 17, 2020

Lessons (re)learned so far:

  • I already do not remember what works on 1.0 and we want to keep backward compatibility (the language is moving forward fast)
  • method ambiguities with functions from Base that have very loose signatures is a pain
  • I expect that some bugs introduced will be caught only after the release and testing in practice (the PR really touches everything we have in DataFrames.jl)

@bkamins bkamins added the breaking The proposed change is breaking. label Apr 17, 2020
@bkamins bkamins added this to the 1.0 milestone Apr 17, 2020
@bkamins
Copy link
Member Author

bkamins commented Apr 17, 2020

Between from DataAPI.jl does not allow string indexing. I am extending it in DataFrames.jl which is a slight type piracy.

@quinnj
Copy link
Member

quinnj commented Apr 17, 2020

Between from DataAPI.jl does not allow string indexing. I am extending it in DataFrames.jl which is a slight type piracy.

Happy to add it to DataAPI.jl quickly if we want.

@bkamins
Copy link
Member Author

bkamins commented Apr 18, 2020

Happy to add it to DataAPI.jl quickly if we want.

Thank you.

@nalimilan - what do you think? The definition is:

DataAPI.Between(x::AbstractString, y::AbstractString) = Between(Symbol(x), Symbol(y))
DataAPI.Between(x::Union{Int, Symbol}, y::AbstractString) = Between(x, Symbol(y))
DataAPI.Between(x::AbstractString, y::Union{Int, Symbol}) = Between(Symbol(x), y)

(in DataAPI.jl it would be simpler)

so the idea is is to allow string as input but internally we store it as Symbol, so other packages that require Symbol for column indexing will work.

@bkamins bkamins marked this pull request as ready for review April 18, 2020 08:07
@bkamins
Copy link
Member Author

bkamins commented Apr 18, 2020

The PR is ready for reviews / beta testing.

@bkamins
Copy link
Member Author

bkamins commented Apr 18, 2020

If JuliaData/DataAPI.jl#17 is merged then this PR should be updated to remove Between cases for AbstractString.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks mostly good! Apart from minor things, I have noted a few design decisions regarding what type name accessors should return.

How confident are you that tests cover (almost) everything? Does coverage confirm that all new methods are tested? It's too bad so many tests have to be duplicated, but that's the only solution (we could almost imagine using some Cassette.jl tricks to automatically call each function that appears in a list with its symbol arguments replaced with strings and check it gives the same result...).

@nalimilan nalimilan changed the title [BREAKING] Add column intexing using strings [BREAKING] Add column indexing using strings Apr 21, 2020
@bkamins
Copy link
Member Author

bkamins commented Apr 22, 2020

OK - I am done with round 2 of documentation fixes :)

@bkamins
Copy link
Member Author

bkamins commented Apr 22, 2020

Following your fixes I have reformatted some code to also fix the customary line witdth.

This prompted me to make one change (only internal - the API is unchanged). I defined MultiColumnIndex union and use it consistently. It is much cleaner this way I think (as otherwise it is a pain to check if all signatures are correct). Sorry for making this change so late, but I think we should have such an abstraction that complements ColumnIndex.

Finally - it will make implementation of JuliaData/DataAPI.jl#16 a minor PR (we will have to change a few lines of code to incorporate it).

Copy link
Contributor

@pdeffebach pdeffebach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this monumental effort!

After playing around with this, I must say i prefer using :x etc. when working interactively. Remembering to close quotation marks and press the right arrow is kind of annoying. Even with OhMyREPL.

But it's going to be great for new users to not have to worry about Symbols and it will be easier to work programatically with things.

I just hope we don't overwhelm users with too many new methods.

@oxinabox
Copy link
Contributor

Can a list of all the other things this PR changes be added to the top post?
Seems like it also reflows a bunch of docstrongs?
And I see some uses of size have been changed to nrows

@bkamins
Copy link
Member Author

bkamins commented Apr 24, 2020

@oxinabox - the problem we had was that there was never a good time to clean up the code and this PR anyway touched almost all files in the code-base (otherwise there was always a problem that such clean-up changes later force rebasing of all PRs, and with this PR we have stalled merging of other PRs anyway till this one is finished to make finishing it easier + I had to read every line in the code so when I spotted something I just corrected it).

E.g. I have changed size to nrow (I think it was only one place) - where size was mixed with ncol in the same expression. The change was from:

"<p>$(digitsep(size(df, 1))) rows × $(digitsep(ncol(df))) columns$omitmsg</p>"

to

"<p>$(digitsep(nrow(df))) rows × $(digitsep(ncol(df))) columns$omitmsg</p>"

So the list of changes in this PR is:

User visible:

  • allow using an AbstractString everywhere where Symbol was accepted as a column indicator (i.e. this extends beyond just getindex and getproperty, every function was reviewed and made consistent)
  • docstrings were updated to reflect this change + some unrelated docs fixes were added where spotted
  • names function consistently returns vector of strings for all types that return column names
  • propertynames function consistently returns vector of Symbols for all types that return column names
  • keys function consistently returns vector of Symbols for all types that return column names as keys
  • depend on DataAPI.jl v1.2
  • rename! passes String to a function if API with function is used
  • consistently define hasproperty for types that have custom getproperty
  • add cols to names function for types where it makes sense consistently
  • push! accepts dicts with string keys
  • indexing with an arbitrary eltype vector now throws better error messages (informtive ArgumentErroris thrown in more cases rather thanMethodError` in the past)

Internal:

  • add definitions of SymbolOrString, ColumnIndex (updated), and MultiColumnIndex unions and use them consistently across whole codebase
  • remove keys from AbstractIndex as it was never used
  • correct formatting of code in many places (mainly docstrings, function signatures, and adding missing return statements in long functions as many of them were updated anyway, but also in unrelated places)
  • fix a rule that _names returns vector of Symbols without copying
  • in some places switch from names to _names internally to avoid unnecessary allocations
  • in some places use groupcols where this function would be an appropriate to call; similarly with names(df, cols) where it made sense
  • remove without function that was not needed anymore

(I will make this list in the top post also)

@bkamins bkamins mentioned this pull request Apr 25, 2020
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really convinced SymbolOrString is better than Union{Symbol, AbstractString}, but otherwise looks good.

@bkamins
Copy link
Member Author

bkamins commented Apr 27, 2020

I'm not really convinced

I am largely indifferent here but since in base we have VecOrMat I think it is OK, and @pdeffebach found it more readable (and for sure it is shorter so we do not have to break lines so oftern in the code).

@bkamins bkamins merged commit b1f675d into JuliaData:master Apr 27, 2020
@bkamins bkamins deleted the add_string_col_indexing branch April 27, 2020 10:29
@bkamins
Copy link
Member Author

bkamins commented Apr 27, 2020

Thank you all for working on this huge PR!

Copy link
Contributor

@oxinabox oxinabox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stopping reviewing now since already merged.
Still it looks all good anyway for the 2/3rd I did finish before it was merged

| `DataFrameRows` | `Vector{String}` | `Vector{Symbol}` | vector of `Int` |
| `DataFrameColumns` | `Vector{String}` | `Vector{Symbol}` | `Vector{Symbol}` |
| `GroupedDataFrame` | `Vector{String}` | tuple of fields | `GroupKeys` |
| `GroupKeys` | undefined | tuple of fields | vector of `Int` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `GroupKeys` | undefined | tuple of fields | vector of `Int` |
| `GroupKeys` | undefined | tuple of fields | `AbsractVector{Int}` |

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here - it is LinearIndices{1,Tuple{Base.OneTo{Int}}}

|---------------------|------------------|------------------|------------------|
| `AbstractDataFrame` | `Vector{String}` | `Vector{Symbol}` | undefined |
| `DataFrameRow` | `Vector{String}` | `Vector{Symbol}` | `Vector{Symbol}` |
| `DataFrameRows` | `Vector{String}` | `Vector{Symbol}` | vector of `Int` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `DataFrameRows` | `Vector{String}` | `Vector{Symbol}` | vector of `Int` |
| `DataFrameRows` | `Vector{String}` | `Vector{Symbol}` | `Vector{Int}` |

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it is LinearIndices{1,Tuple{Base.OneTo{Int}}} but I wanted to avoid writing this as it would be confusing I think.

@@ -45,7 +45,8 @@ julia> df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])

```

Columns can be directly (i.e. without copying) accessed via `df.col` or `df[!, :col]`. The latter syntax is more flexible as it allows passing a variable holding the name of the column, and not only a literal name. Note that column names are symbols (`:col` or `Symbol("col")`) rather than strings (`"col"`). Columns can also be accessed using an integer index specifying their position.
Columns can be directly (i.e. without copying) accessed via `df.col`, `df."col"`, `df[!, :col]` or `df[!, "col"]`. The two latter syntaxes are more flexible as they allow passing a variable holding the name of the column, and not only a literal name. Note that column names can be either symbols (written as `:col`, `:var"col"` or `Symbol("col")`) or strings (written as `"col"`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this page would be worth documenting if

x = "ol"
df."c$x"`

Either that it works, or it doesn't

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not work. I will add a note in the PR that follows your comments.


DataFrames.jl allows to use `Symbol`s (like `:A`) and strings (like `"A"`)
for all column indexing operations for convenience.
However, using `Symbol`s is slightly faster and should generally be preferred.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
However, using `Symbol`s is slightly faster and should generally be preferred.
However, using `Symbol`s is slightly faster and should generally be preferred, if not generating them via string manipulation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

throw(ArgumentError("Duplicate names not allowed. Duplicated value(s) are: " *
":$(join(duplicate_names, ", "))"))
end

# Put the summary stats into the return data frame
data = DataFrame()
data.variable = names(df)
data.variable = copy(_names(df))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the same code as propertynames(df)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point - fixed.

Comment on lines +1371 to +1372
cols::Union{Symbol, AbstractVector{Symbol},
AbstractVector{<:AbstractString}}=:setequal) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Union{Symbol, AbstractVector{Symbol}, AbstractVector{<:AbstractString}}
matches the same patterns as:
Union{Symbol, AbstractVector{<:SymbolOrString},}
and I think the latter is shorter and easier to read.
I think it would be good to replace all instances of the former with the latter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will allow mixing strings and Symbols. I leave it out for now as I want to hear what @nalimilan thinks about it (I will make a note in the PR to consider this).

Comment on lines +114 to 128
if all(x -> x isa AbstractString, keys(v))
v = (;(Symbol.(keys(v)) .=> values(v))...)
end
for n in view(_names(df), idxs)
if !haskey(v, n)
throw(ArgumentError("Column :$n not found in source dictionary"))
end
end
elseif !all(((a, b),) -> a == b, zip(view(_names(df), idxs), keys(v)))
mismatched = findall(view(_names(df), idxs) .!= collect(keys(v)))
throw(ArgumentError("Selected column names do not match the names in assigned value in" *
" positions $(join(mismatched, ", ", " and "))"))
throw(ArgumentError("Selected column names do not match the names in assigned " *
"value in positions $(join(mismatched, ", ", " and "))"))
end

for (col, val) in pairs(v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alterantive way to do this would be to declare a local variable
v_keys which promises to be a collection of Symbol.
e.g.

v_keys = keys(v)
v_keys = keytype(v) === Symbol ? keys(v) : Symbol.(v)

then each time keys(v) is used in this function it can be replaced with v_keys
The values are never used until the end

This would save having to convert the values as well,
and line 128 works already with symbols or strings

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is related to the comment above - if we allow mixing strings with Symbols.
Currently we do not allow mixing. When we decide on this I will rewrite it anyway.

@bkamins
Copy link
Member Author

bkamins commented Apr 27, 2020

Thank you for the comments. I will make a separate PR taking them into account.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handling of strings for column indexing
5 participants