Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing append!(::CategoricalArray, ::SubArray{<:CategoricalArray}) #170

Open
bkamins opened this issue Nov 18, 2018 · 29 comments
Open

missing append!(::CategoricalArray, ::SubArray{<:CategoricalArray}) #170

bkamins opened this issue Nov 18, 2018 · 29 comments

Comments

@bkamins
Copy link
Member

bkamins commented Nov 18, 2018

We do not have a way to append subarray to an array.

@nalimilan I do not want to mess up with a hasty PR, because a more thought of redesign of SubArray handling should be done, and you know the package best.

This request is pretty urgent to decide what to do and tag a new release, as without it we will not be able to change getindex in DataFrames.jl to return views in case sdf[colind] (now it returns a copy so all worked). This, in consequence, has a major impact on groupby-related functions.

@bkamins
Copy link
Member Author

bkamins commented Nov 18, 2018

I see that we have support for SubArray, but it seems it is not complete.

Also there is a question if we want to support RepeatedVector (it is a very rare case probably, but I think it gives us guidance if we can have some "general" solution).

@nalimilan
Copy link
Member

That's probably just an oversight. PR welcome!

Regarding RepeatedVector, the generic AbstractArray fallback should work for it, right? Though of course a specialized method could be much faster. At any rate that method should probably live in DataFrames.

@bkamins
Copy link
Member Author

bkamins commented Nov 19, 2018

Filled #171 to fix it.

Agreed that RepeatedVector should be fixed in DataFrames.jl, but my question is what AbstractArray fallback are you referring to (I do not see such a fallback in the code).

@nalimilan
Copy link
Member

my question is what AbstractArray fallback are you referring to (I do not see such a fallback in the code).

Sorry, I always think Base provides a generic fallback based on resize! and setindex!, but actually that's not the case. We should provide our own fallback doing precisely that.

@bkamins
Copy link
Member Author

bkamins commented Nov 19, 2018

That was my problem in the first place :).
#171 should be merged as is I think this is a safe operation and I know what it should do (i.e. the crucial part is that it preserves the levels of the parent of the view).

A generic AbstractArray functionality should be another PR. And here the question is: what do we want to accept? If you want to append non-categorical array do you want to extend levels of the original array if there are missing levels in the array you append to (and if yes in what order - sequence of appearance in the appended array)?

@nalimilan
Copy link
Member

I think the fallback should literally be defined as calling resize! and setindex! repeatedly (or maybe copyto!, which should be equivalent). Then levels are automatically expanded in the order of appearance, and isordered is set to false if new levels are added.

@bkamins
Copy link
Member Author

bkamins commented Nov 19, 2018

OK. I will investigate this as a separate PR (and you will make me to better know the internals of CategoricalArrays.jl 😄).

@bkamins
Copy link
Member Author

bkamins commented Nov 19, 2018

While working on #171 I noted that actually appending unordered CategoricalArray to ordered one keeps the array that we append to ordered. This is conflicting what you have written above (maybe not strictly, but it is a similar situation).

Also if we append an ordered categorical array to other categorical array we keep the order from the LHS. Here is an example:

julia> x = categorical([1,2], ordered=true)
2-element CategoricalArray{Int64,1,UInt32}:
 1
 2

julia> y = categorical([1,2], ordered=true)
2-element CategoricalArray{Int64,1,UInt32}:
 1
 2

julia> levels!(y, [2,1])
2-element CategoricalArray{Int64,1,UInt32}:
 1
 2

julia> append!(x, y)
4-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 1
 2

julia> levels(x)
2-element Array{Int64,1}:
 1
 2

julia> levels(y)
2-element Array{Int64,1}:
 2
 1

Proposal: we should keep the result ordered only if what we append is ordered AND that order does not conflict the order of what we append to.

Also in #171 I would keep the as-is behavior (so only change the tests), but in the next PR implement the target functionality (if we decide something needs to be changed).

@nalimilan
Copy link
Member

Good catch. Yes, we should mark the array as unordered in that case. There's logic for that in copyto!. But I agree that's a separate PR.

@bkamins
Copy link
Member Author

bkamins commented Nov 22, 2018

OK. So I have the following design question before proceeding.

This throws an error:

x = categorical([1,2,3], ordered=true)
push!(x, 10)

Should then the following statements throw an error

append!(x, [10]) # I would say yes

append!(x, categorical([10])) # I would say yes

append!(x, categorical([10], ordered=true)) # I would say yes

y = categorical([1,2], ordered=true)
levels!(y, [2,1])
append!(x, y) # I would say yes

y = categorical([1,2,10], ordered=true)
append!(x, y[1:2]) # I would say no

@bkamins
Copy link
Member Author

bkamins commented Nov 22, 2018

As for the case:

y = categorical([1,2], ordered=true)
levels!(y, [2,1])
append!(x, y) # I would say yes

I see that copyto!(x, y) is allowed here, so maybe append!(x, y) should be allowed.

However, then the design issue I have is that:

julia> x = categorical([1,2,3], ordered=true)
3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

julia> x[1] = 4
ERROR: cannot add new level 4 since ordered pools cannot be extended implicitly. Use the levels! function to set new levels, or the ordered! function to mark the pool as unordered.

julia> copyto!(x, [4])
ERROR: cannot add new level 4 since ordered pools cannot be extended implicitly. Use the levels! function to set new levels, or the ordered! function to mark the pool as unordered.

julia> copyto!(x, categorical([4]))
3-element CategoricalArray{Int64,1,UInt32}:
 4
 2
 3

and I do not understand why the last operation succeeds while first two failed (which would be expected).

@nalimilan
Copy link
Member

Funny, I had forgotten about that strict behavior. Then we should throw an error in all cases where a level would have to be added.

@bkamins
Copy link
Member Author

bkamins commented Nov 22, 2018

OK, so the rule (in general in all functions should be):

  1. it is not allowed to implicitly add a level to ordered catarray nor implicitly change it to unordered.
  2. it is allowed to implicitly add a level to unordered catarray. it will be added at the end of its levels in the order of appearance.
  3. as a consequence - it actually does not matter what is the type of source of the operation (we inherit the properties of of destination of the operation following rules 1 and 2)

right?

This means in particular that both:

x = categorical([1,2], ordered=true)
y = categorical([1,2], ordered=true)
levels!(y, [2,1])
append!(x, y)

and

x = categorical([1,2], ordered=true)
y = categorical([1,2,10], ordered=true)
append!(x, y[1:2])

should be accepted and use the order and levels of x before the operation (same applies to copyto!, push! and setindex! in similar cases).

If we agree on this I can go through all the relevant functions and sanitize them to follow those rules.

@nalimilan
Copy link
Member

Sounds right. We could imagine checking that the order of levels of the source matches that of the destination, but that could be inconvenient and I'm not sure it matters a lot in practice.

@bkamins
Copy link
Member Author

bkamins commented Nov 22, 2018

I thought about it but this is not how setindex! works now so I thought it was intended (i.e. RHS in setindex! is treated as a value).

And as you say - this is a corner case that actually could be inconvenient in some cases.
And I think that the above rule that in any assignment type operations only levels of destination are important and and source is treated as plain value is easy to understand and remember.

@bkamins
Copy link
Member Author

bkamins commented Nov 22, 2018

copyto! currently corrupts categorical arrays. I will fix all those bugs, but in the first run I will propose a slow implementation that I am sure is correct and write tests against it (and append! will be a similar implementation pattern). Next we can discuss how we can optimize its performance. OK?

An example of corruption behavior:

julia> x = categorical([1,2,3], ordered=true)
3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

julia> y = [3, 100]
2-element Array{Int64,1}:
   3
 100

julia> copyto!(x, 1, y, 1, 2)
ERROR: cannot add new level 100 since ordered pools cannot be extended implicitly. Use the levels! function to set new levels, or the ordered! function to mark the pool as unordered.

julia> x
3-element CategoricalArray{Int64,1,UInt32}:
 3
 2
 3

@nalimilan
Copy link
Member

You mean it leaves the destination half-modified in case of errors? That's also the case for Array when conversion fails. That's annoying, but I'd say it's fine unless we can avoid that without hurting performance (having an efficient copyto! is important).

@bkamins
Copy link
Member Author

bkamins commented Nov 22, 2018

But the same applies to append!. In Base it will extend the vector and leave it half-modified on error. So we should do it either "the safe way" or "the fast way" in both cases I think.

Here you have a branch with "the safe way" (but slow at this time) approach: https://github.com/bkamins/CategoricalArrays.jl/tree/better_copyto.

If we want to go the fast way actually what we currently have can be speeded up I think as we can avoid some checks that are done now (also currently there is a bug in current implementation with checking for Missing in src as it scans the whole src and not only a view of the relevant range - I have it fixed in my "safe" branch).

@bkamins
Copy link
Member Author

bkamins commented Nov 22, 2018

Also note that in the old copyto! (before my proposal) - different method was invoked when dest and src had the same dimension and different method was invoked when they had different dimensions (which lead to inconsistency in the results):

julia> x = categorical([1 2; 3 4], ordered = true)
2×2 CategoricalArray{Int64,2,UInt32}:
 1  2
 3  4

julia> y = categorical([4, 10])
2-element CategoricalArray{Int64,1,UInt32}:
 4
 10

julia> copyto!(x, 1, y, 1, 2)
ERROR: cannot add new level 10 since ordered pools cannot be extended implicitly. Use the levels! function to set new levels, or the ordered! function to mark the pool as unordered.

but

julia> x = categorical([1, 2, 3, 4], ordered = true)
4-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3
 4

julia> y = categorical([4, 10])
2-element CategoricalArray{Int64,1,UInt32}:
 4
 10

julia> copyto!(x, 1, y, 1, 2)
4-element CategoricalArray{Int64,1,UInt32}:
 4
 10
 3
 4

@nalimilan
Copy link
Member

Yes, copyto! needs to check the order of levels. When setindex! was made stricter (in #125), the code was changed in get! for CategoricalPool, but copyto! still applies the old behavior (i.e. set isordered to false).

But simplifying copyto! like you do in the branch would hurt performance too much I think. The code exists, better keep it and just update the checks to match the new behavior.

@bkamins
Copy link
Member Author

bkamins commented Nov 24, 2018

Let us concentrate on copyto! first.

I am not 100% sure what target contract you wan to this function to follow (in the parenthesis I write how I understand your intention):

  1. do we allow copyto! to fail and throw an error leave partially-modified dest? (yes, if it would improve the performance)
  2. do we want copyto! to have different behavior conditional on the fact if src has the same dimension as dest or not (no - we want them to behave the same way)
  3. do we want copyto! to strictly keep levels and ordering of dest if it is ordered? (yes - copyto! should only be sensitive to the fact that dest is ordered categorical and should not alter levels; specialization on src type should be performed only for performance reasons)
  4. do we want the behavior described in point 3. to apply only to the part of src that is being copied (yes; currently we look at the whole src if it is categorical, but only at the copied part if src is not categorical)
  5. do we want copyto! to add levels as needed if dest is not ordered? (yes)

If this is your intention I will try to implement it.

@nalimilan
Copy link
Member

Sounds right!

@bkamins
Copy link
Member Author

bkamins commented Nov 26, 2018

Slowly moving forward (the corner cases are a headache).Regarding point 5 above I want the following:

julia> x = categorical([1:255;1:255], true);

julia> y = categorical([1,1000], true);

julia> copyto!(x,1,y,1,1)
ERROR: cannot store level(s) 1000 since reference type UInt8 can only hold 255 levels. Use the decompress function to make room for more levels.

to work instead of throwing error.

This will require to dynamically determine which levels in y are actually used. This might lead to a slight performance regression (I will try to avoid this performance drop, but I have not finished the design yet).

@bkamins
Copy link
Member Author

bkamins commented Nov 26, 2018

Also I find this "fast path" problematic:

julia> x = categorical([1,2,3])
3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

julia> y = categorical([5,6,7])
3-element CategoricalArray{Int64,1,UInt32}:
 5
 6
 7

julia> levels(x)
3-element Array{Int64,1}:
 1
 2
 3

julia> copyto!(x,y)
3-element CategoricalArray{Int64,1,UInt32}:
 5
 6
 7

julia> levels(y)
3-element Array{Int64,1}:
 5
 6
 7

as user might expect that old levels of x should be retained (this happens in all other cases).

(we can discuss all this when I push a PR with the proposed changes)

@nalimilan
Copy link
Member

Yes, that's not ideal. IIRC that's because I couldn't use copy! when I wrote that. copyto! should probably always preserve existing levels, but copy! shouldn't since it's defined as discarding all previous contents.

@bkamins
Copy link
Member Author

bkamins commented Nov 26, 2018

Actually I have accidentally written levels(y) above due to my error. levels(x) seems OK. Sorry for the confusion.

As for copy! we cannot do it I think because dest in copy! can be subarray so we cannot simply drop levels then.

@bkamins
Copy link
Member Author

bkamins commented Nov 26, 2018

Also copy! does not work now if src is not categorical. I will add a fix to this.

@nalimilan
Copy link
Member

nalimilan commented Nov 26, 2018

I hadn't noticed your first comment. I don't think we can do anything about pool overflow during copyto!, as the type parameter is fixed. People should use compress=true carefully anyway, in general the UInt32 default is reasonable.

Indeed we can't drop levels for views, but does that imply we shouldn't drop levels for plain categorical arrays? Not sure. Anyway copyto! is a bit weird for categorical arrays, in general we use it only when filling empty arrays in DataFrames. BTW, that's a good reference to check what behaviors are useful when changing copyto!.

@bkamins
Copy link
Member Author

bkamins commented Nov 26, 2018

I have made a PR so we can continue a discussion there (when it passes tests :)). It introduces around 10% performance decrease, due to:

  • if dest is ordered: we check not to add levels
  • if dest is not ordered: we check to add only levels that are needed to be added

The rule for unordered means that we solve problem:

I don't think we can do anything about pool overflow

if we actually do not need to add levels even if src has more levels.

Also I think that we should make append!, copyto!andcopy!` fully correct - as we do not know how users might use them.

Thank you for discussing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants