-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BREAKING] Add column indexing using strings #2199
Conversation
OK - if anyone is interested - currently I have updated the basic tests and documentation. The PR should be good to have a first look at and comment. Now I will be working on testing corner cases of the new functionality. |
Lessons (re)learned so far:
|
|
Happy to add it to DataAPI.jl quickly if we want. |
Thank you. @nalimilan - what do you think? The definition is:
(in DataAPI.jl it would be simpler) so the idea is is to allow string as input but internally we store it as |
The PR is ready for reviews / beta testing. |
If JuliaData/DataAPI.jl#17 is merged then this PR should be updated to remove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks mostly good! Apart from minor things, I have noted a few design decisions regarding what type name accessors should return.
How confident are you that tests cover (almost) everything? Does coverage confirm that all new methods are tested? It's too bad so many tests have to be duplicated, but that's the only solution (we could almost imagine using some Cassette.jl tricks to automatically call each function that appears in a list with its symbol arguments replaced with strings and check it gives the same result...).
OK - I am done with round 2 of documentation fixes :) |
Co-Authored-By: Milan Bouchet-Valat <[email protected]>
Following your fixes I have reformatted some code to also fix the customary line witdth. This prompted me to make one change (only internal - the API is unchanged). I defined Finally - it will make implementation of JuliaData/DataAPI.jl#16 a minor PR (we will have to change a few lines of code to incorporate it). |
Co-Authored-By: Milan Bouchet-Valat <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this monumental effort!
After playing around with this, I must say i prefer using :x
etc. when working interactively. Remembering to close quotation marks and press the right arrow is kind of annoying. Even with OhMyREPL.
But it's going to be great for new users to not have to worry about Symbol
s and it will be easier to work programatically with things.
I just hope we don't overwhelm users with too many new methods.
Can a list of all the other things this PR changes be added to the top post? |
Co-Authored-By: pdeffebach <[email protected]>
@oxinabox - the problem we had was that there was never a good time to clean up the code and this PR anyway touched almost all files in the code-base (otherwise there was always a problem that such clean-up changes later force rebasing of all PRs, and with this PR we have stalled merging of other PRs anyway till this one is finished to make finishing it easier + I had to read every line in the code so when I spotted something I just corrected it). E.g. I have changed
to
So the list of changes in this PR is: User visible:
Internal:
(I will make this list in the top post also) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really convinced SymbolOrString
is better than Union{Symbol, AbstractString}
, but otherwise looks good.
I am largely indifferent here but since in base we have |
Thank you all for working on this huge PR! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stopping reviewing now since already merged.
Still it looks all good anyway for the 2/3rd I did finish before it was merged
| `DataFrameRows` | `Vector{String}` | `Vector{Symbol}` | vector of `Int` | | ||
| `DataFrameColumns` | `Vector{String}` | `Vector{Symbol}` | `Vector{Symbol}` | | ||
| `GroupedDataFrame` | `Vector{String}` | tuple of fields | `GroupKeys` | | ||
| `GroupKeys` | undefined | tuple of fields | vector of `Int` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `GroupKeys` | undefined | tuple of fields | vector of `Int` | | |
| `GroupKeys` | undefined | tuple of fields | `AbsractVector{Int}` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here - it is LinearIndices{1,Tuple{Base.OneTo{Int}}}
|---------------------|------------------|------------------|------------------| | ||
| `AbstractDataFrame` | `Vector{String}` | `Vector{Symbol}` | undefined | | ||
| `DataFrameRow` | `Vector{String}` | `Vector{Symbol}` | `Vector{Symbol}` | | ||
| `DataFrameRows` | `Vector{String}` | `Vector{Symbol}` | vector of `Int` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `DataFrameRows` | `Vector{String}` | `Vector{Symbol}` | vector of `Int` | | |
| `DataFrameRows` | `Vector{String}` | `Vector{Symbol}` | `Vector{Int}` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it is LinearIndices{1,Tuple{Base.OneTo{Int}}}
but I wanted to avoid writing this as it would be confusing I think.
@@ -45,7 +45,8 @@ julia> df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"]) | |||
|
|||
``` | |||
|
|||
Columns can be directly (i.e. without copying) accessed via `df.col` or `df[!, :col]`. The latter syntax is more flexible as it allows passing a variable holding the name of the column, and not only a literal name. Note that column names are symbols (`:col` or `Symbol("col")`) rather than strings (`"col"`). Columns can also be accessed using an integer index specifying their position. | |||
Columns can be directly (i.e. without copying) accessed via `df.col`, `df."col"`, `df[!, :col]` or `df[!, "col"]`. The two latter syntaxes are more flexible as they allow passing a variable holding the name of the column, and not only a literal name. Note that column names can be either symbols (written as `:col`, `:var"col"` or `Symbol("col")`) or strings (written as `"col"`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this page would be worth documenting if
x = "ol"
df."c$x"`
Either that it works, or it doesn't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not work. I will add a note in the PR that follows your comments.
|
||
DataFrames.jl allows to use `Symbol`s (like `:A`) and strings (like `"A"`) | ||
for all column indexing operations for convenience. | ||
However, using `Symbol`s is slightly faster and should generally be preferred. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, using `Symbol`s is slightly faster and should generally be preferred. | |
However, using `Symbol`s is slightly faster and should generally be preferred, if not generating them via string manipulation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
throw(ArgumentError("Duplicate names not allowed. Duplicated value(s) are: " * | ||
":$(join(duplicate_names, ", "))")) | ||
end | ||
|
||
# Put the summary stats into the return data frame | ||
data = DataFrame() | ||
data.variable = names(df) | ||
data.variable = copy(_names(df)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the same code as propertynames(df)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point - fixed.
cols::Union{Symbol, AbstractVector{Symbol}, | ||
AbstractVector{<:AbstractString}}=:setequal) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Union{Symbol, AbstractVector{Symbol}, AbstractVector{<:AbstractString}}
matches the same patterns as:
Union{Symbol, AbstractVector{<:SymbolOrString},}
and I think the latter is shorter and easier to read.
I think it would be good to replace all instances of the former with the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will allow mixing strings and Symbol
s. I leave it out for now as I want to hear what @nalimilan thinks about it (I will make a note in the PR to consider this).
if all(x -> x isa AbstractString, keys(v)) | ||
v = (;(Symbol.(keys(v)) .=> values(v))...) | ||
end | ||
for n in view(_names(df), idxs) | ||
if !haskey(v, n) | ||
throw(ArgumentError("Column :$n not found in source dictionary")) | ||
end | ||
end | ||
elseif !all(((a, b),) -> a == b, zip(view(_names(df), idxs), keys(v))) | ||
mismatched = findall(view(_names(df), idxs) .!= collect(keys(v))) | ||
throw(ArgumentError("Selected column names do not match the names in assigned value in" * | ||
" positions $(join(mismatched, ", ", " and "))")) | ||
throw(ArgumentError("Selected column names do not match the names in assigned " * | ||
"value in positions $(join(mismatched, ", ", " and "))")) | ||
end | ||
|
||
for (col, val) in pairs(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alterantive way to do this would be to declare a local variable
v_keys
which promises to be a collection of Symbol
.
e.g.
v_keys = keys(v)
v_keys = keytype(v) === Symbol ? keys(v) : Symbol.(v)
then each time keys(v)
is used in this function it can be replaced with v_keys
The values are never used until the end
This would save having to convert the values
as well,
and line 128 works already with symbols or strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is related to the comment above - if we allow mixing strings with Symbol
s.
Currently we do not allow mixing. When we decide on this I will rewrite it anyway.
Thank you for the comments. I will make a separate PR taking them into account. |
Fixes #1926.
UPDATED
The list of changes in this PR is:
User visible:
AbstractString
everywhere whereSymbol
was accepted as a column indicator (i.e. this extends beyond justgetindex
andgetproperty
, every function was reviewed and made consistent)names
function consistently returns vector of strings for all types that return column namespropertynames
function consistently returns vector ofSymbol
s for all types that return column nameskeys
function consistently returns vector ofSymbol
s for all types that return column names as keysrename!
passesString
to a function if API with function is usedhasproperty
for types that have customgetproperty
cols
tonames
function for types where it makes sense consistentlypush!
accepts dicts with string keysindexing with an arbitrary eltype vector now throws better error messages (informtive
ArgumentErroris thrown in more cases rather than
MethodError` in the past)Internal:
SymbolOrString
,ColumnIndex
(updated), andMultiColumnIndex
unions and use them consistently across whole codebasekeys
fromAbstractIndex
as it was never usedreturn
statements in long functions as many of them were updated anyway, but also in unrelated places)_names
returns vector ofSymbol
s without copyingnames
to_names
internally to avoid unnecessary allocationsgroupcols
where this function would be an appropriate to call; similarly withnames(df, cols)
where it made sensewithout
function that was not needed anymore