-
Notifications
You must be signed in to change notification settings - Fork 928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Add implicit typecasting of join columns when dtypes do not match #3451
[REVIEW] Add implicit typecasting of join columns when dtypes do not match #3451
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-0.12 #3451 +/- ##
===============================================
- Coverage 86.91% 86.91% -0.01%
===============================================
Files 50 50
Lines 9464 9469 +5
===============================================
+ Hits 8226 8230 +4
- Misses 1238 1239 +1
Continue to review full report at Codecov.
|
@kkraus14 I'm working on a way of benchmarking the approach where we punt to The other thing I noticed is that the join itself seems to dominate the runtime regardless, as such for some data sizes these tweaks can make a fairly negligible difference. |
@kkraus14 It looks like punting to |
I had to do a little gymnastics to get the categorical cases right here. In the case of a categorical/non-categorical implicit merge, we want to perform the join by casting both sides to the datatype of the underlying categories first. This leaves us with the task of reconstructing the correct categorical column after the merge takes place. We need to keep track of the original cc @shwina |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @brandon-b-miller
ctgry_err = "can't implicitly cast column {} to categories \ | ||
from right during left join" | ||
ctgry_err = "can't implicitly cast column {0} to categories \ | ||
from {1} during {1} join" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to change this; just a comment that it might be preferable to do something like:
ctgry_err = ("can't implicitly cast column {column_id} to categories"
" from {how} during {how} join")
ctgry_err.format(column_id=rcol, how="right")
Closes #2230
Several notes about this. Following the discussion in the issue, pandas handling of this is a little inconsistent in some cases, specifically w.r.t. what users can expect the output's joined columns type to be. As such this PR implements our own set of casting rules that don't necessarily match pandas in all cases. A summary is below:
Inner joins
For inner joins we'll attempt to promote to a datatype that can contain all of the data in both the right and left hand columns, with the exception of categorical data. For int-int and datetime-datetime, that would be the larger/higher resolution of the two. If either side is
float
, we will getfloat32
up throughint16
on the int side andfloat64
subsequently.Left and right joins
We prioritize the 'base' table if possible in these scenarios except in the categorical case. In the case that the base datatype can contain the other, the base datatype will be the output. If not, we explicitly (and probably expensively) check to make sure all the values in the other are representable by the base datatype, and if so, select the base datatype. If not, raise a warning and cast based on the rules we'd use for an inner join. This should guard against overflowing.
Categorical data
If either column is categorical, we'll attempt to output a column that is also categorical, and contains the same categories as the original data. This makes sense in the cases where either an inner join is performed, or the categorical column belongs to the 'base' table for left and right joins. In those cases, we can have at most the original categories, so our output
CategoricalDtype
can retain the same categories/ordering that the original categorical data had. However if the left hand side is not category, and the right hand side is category (or the mirror case with a right join) we have to error otherwise we'd be attempting to construct a categorical variable that potentially has more categories then we started out with, and we'd have no way of understanding any ordering.As a side effect of allowing a categorical join to proceed implicitly we need to cast both columns to the underlying datatype of the categorical variables
categories
to actually perform the join in libcudf. This requires that we keep track of what columns were originally categorical so that we can reconstruct the correct datatype afterwards.