Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Add implicit typecasting of join columns when dtypes do not match #3451

Merged
merged 34 commits into from
Jan 12, 2020

Conversation

brandon-b-miller
Copy link
Contributor

Closes #2230

Several notes about this. Following the discussion in the issue, pandas handling of this is a little inconsistent in some cases, specifically w.r.t. what users can expect the output's joined columns type to be. As such this PR implements our own set of casting rules that don't necessarily match pandas in all cases. A summary is below:

Inner joins

For inner joins we'll attempt to promote to a datatype that can contain all of the data in both the right and left hand columns, with the exception of categorical data. For int-int and datetime-datetime, that would be the larger/higher resolution of the two. If either side is float, we will get float32 up through int16 on the int side and float64 subsequently.

Left and right joins

We prioritize the 'base' table if possible in these scenarios except in the categorical case. In the case that the base datatype can contain the other, the base datatype will be the output. If not, we explicitly (and probably expensively) check to make sure all the values in the other are representable by the base datatype, and if so, select the base datatype. If not, raise a warning and cast based on the rules we'd use for an inner join. This should guard against overflowing.

Categorical data

If either column is categorical, we'll attempt to output a column that is also categorical, and contains the same categories as the original data. This makes sense in the cases where either an inner join is performed, or the categorical column belongs to the 'base' table for left and right joins. In those cases, we can have at most the original categories, so our output CategoricalDtype can retain the same categories/ordering that the original categorical data had. However if the left hand side is not category, and the right hand side is category (or the mirror case with a right join) we have to error otherwise we'd be attempting to construct a categorical variable that potentially has more categories then we started out with, and we'd have no way of understanding any ordering.

As a side effect of allowing a categorical join to proceed implicitly we need to cast both columns to the underlying datatype of the categorical variables categories to actually perform the join in libcudf. This requires that we keep track of what columns were originally categorical so that we can reconstruct the correct datatype afterwards.

@brandon-b-miller brandon-b-miller requested a review from a team as a code owner November 25, 2019 20:26
@brandon-b-miller brandon-b-miller self-assigned this Nov 25, 2019
@brandon-b-miller brandon-b-miller added Python Affects Python cuDF API. 2 - In Progress Currently a work in progress labels Nov 25, 2019
@codecov
Copy link

codecov bot commented Nov 26, 2019

Codecov Report

Merging #3451 into branch-0.12 will decrease coverage by <.01%.
The diff coverage is 75.6%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.12    #3451      +/-   ##
===============================================
- Coverage        86.91%   86.91%   -0.01%     
===============================================
  Files               50       50              
  Lines             9464     9469       +5     
===============================================
+ Hits              8226     8230       +4     
- Misses            1238     1239       +1
Impacted Files Coverage Δ
python/cudf/cudf/core/column/numerical.py 93.95% <100%> (ø) ⬆️
python/cudf/cudf/core/index.py 89.28% <50%> (ø) ⬆️
python/cudf/cudf/core/buffer.py 90.74% <50%> (ø) ⬆️
python/cudf/cudf/utils/ioutils.py 85.52% <75%> (+0.59%) ⬆️
python/cudf/cudf/core/dataframe.py 92.56% <84.61%> (-0.05%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d6947ac...5fd8bf1. Read the comment docs.

@brandon-b-miller brandon-b-miller added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Nov 26, 2019
@brandon-b-miller brandon-b-miller changed the title [WIP] Add implicit typecasting of join columns when dtypes do not match [REVIEW] Add implicit typecasting of join columns when dtypes do not match Nov 26, 2019
@brandon-b-miller brandon-b-miller changed the base branch from branch-0.11 to branch-0.12 December 24, 2019 15:10
@brandon-b-miller
Copy link
Contributor Author

brandon-b-miller commented Dec 30, 2019

@kkraus14 I'm working on a way of benchmarking the approach where we punt to int64 and float64 in int <--> float cases. In doing so I noticed there's potentially some noise in the plots that comes (I think) from the way that variables go in and out of scope and thus free GPU memory, so I'm falling back to a different profiling approach. Details to follow.

The other thing I noticed is that the join itself seems to dominate the runtime regardless, as such for some data sizes these tweaks can make a fairly negligible difference.

@brandon-b-miller
Copy link
Contributor Author

@kkraus14 It looks like punting to float64 is the fastest of the three approaches. As such I suppose we could remove _overflow_safe_to since in that case we wouldn't need to do any checking, do we think that code is worth keeping or should it just be removed?

@brandon-b-miller brandon-b-miller added 3 - Ready for Review Ready for review by team and removed 0 - Waiting on Author Waiting for author to respond to review labels Jan 10, 2020
@brandon-b-miller
Copy link
Contributor Author

I had to do a little gymnastics to get the categorical cases right here. In the case of a categorical/non-categorical implicit merge, we want to perform the join by casting both sides to the datatype of the underlying categories first. This leaves us with the task of reconstructing the correct categorical column after the merge takes place. We need to keep track of the original codes that map each integer in the underlying categorical data to the actual category labels, and the cleanest way of doing so seemed to be to carry them along for the actual join. That way we'll always have the category index that originally corresponded to each row in the dataframe handy to use as codes when reconstructing the categorical data. Let me know if this seems reasonable.

cc @shwina

Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @brandon-b-miller

python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
ctgry_err = "can't implicitly cast column {} to categories \
from right during left join"
ctgry_err = "can't implicitly cast column {0} to categories \
from {1} during {1} join"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change this; just a comment that it might be preferable to do something like:

ctgry_err = ("can't implicitly cast column {column_id} to categories"
                   " from {how} during {how} join")

ctgry_err.format(column_id=rcol, how="right")

@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jan 11, 2020
@kkraus14 kkraus14 merged commit f639bab into rapidsai:branch-0.12 Jan 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] cudf.DataFrame.merge should support implicit type conversions on join columns
3 participants