[REVIEW] Add implicit typecasting of join columns when dtypes do not match #3451

brandon-b-miller · 2019-11-25T20:26:59Z

Several notes about this. Following the discussion in the issue, pandas handling of this is a little inconsistent in some cases, specifically w.r.t. what users can expect the output's joined columns type to be. As such this PR implements our own set of casting rules that don't necessarily match pandas in all cases. A summary is below:

Inner joins

For inner joins we'll attempt to promote to a datatype that can contain all of the data in both the right and left hand columns, with the exception of categorical data. For int-int and datetime-datetime, that would be the larger/higher resolution of the two. If either side is float, we will get float32 up through int16 on the int side and float64 subsequently.

Left and right joins

We prioritize the 'base' table if possible in these scenarios except in the categorical case. In the case that the base datatype can contain the other, the base datatype will be the output. If not, we explicitly (and probably expensively) check to make sure all the values in the other are representable by the base datatype, and if so, select the base datatype. If not, raise a warning and cast based on the rules we'd use for an inner join. This should guard against overflowing.

Categorical data

If either column is categorical, we'll attempt to output a column that is also categorical, and contains the same categories as the original data. This makes sense in the cases where either an inner join is performed, or the categorical column belongs to the 'base' table for left and right joins. In those cases, we can have at most the original categories, so our output CategoricalDtype can retain the same categories/ordering that the original categorical data had. However if the left hand side is not category, and the right hand side is category (or the mirror case with a right join) we have to error otherwise we'd be attempting to construct a categorical variable that potentially has more categories then we started out with, and we'd have no way of understanding any ordering.

As a side effect of allowing a categorical join to proceed implicitly we need to cast both columns to the underlying datatype of the categorical variables categories to actually perform the join in libcudf. This requires that we keep track of what columns were originally categorical so that we can reconstruct the correct datatype afterwards.

codecov · 2019-11-26T16:46:10Z

Codecov Report

Merging #3451 into branch-0.12 will decrease coverage by <.01%.
The diff coverage is 75.6%.

@@               Coverage Diff               @@
##           branch-0.12    #3451      +/-   ##
===============================================
- Coverage        86.91%   86.91%   -0.01%     
===============================================
  Files               50       50              
  Lines             9464     9469       +5     
===============================================
+ Hits              8226     8230       +4     
- Misses            1238     1239       +1

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/numerical.py	`93.95% <100%> (ø)`	⬆️
python/cudf/cudf/core/index.py	`89.28% <50%> (ø)`	⬆️
python/cudf/cudf/core/buffer.py	`90.74% <50%> (ø)`	⬆️
python/cudf/cudf/utils/ioutils.py	`85.52% <75%> (+0.59%)`	⬆️
python/cudf/cudf/core/dataframe.py	`92.56% <84.61%> (-0.05%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d6947ac...5fd8bf1. Read the comment docs.

python/cudf/cudf/core/dataframe.py

brandon-b-miller · 2019-12-30T13:55:00Z

@kkraus14 I'm working on a way of benchmarking the approach where we punt to int64 and float64 in int <--> float cases. In doing so I noticed there's potentially some noise in the plots that comes (I think) from the way that variables go in and out of scope and thus free GPU memory, so I'm falling back to a different profiling approach. Details to follow.

The other thing I noticed is that the join itself seems to dominate the runtime regardless, as such for some data sizes these tweaks can make a fairly negligible difference.

brandon-b-miller · 2020-01-06T20:54:58Z

@kkraus14 It looks like punting to float64 is the fastest of the three approaches. As such I suppose we could remove _overflow_safe_to since in that case we wouldn't need to do any checking, do we think that code is worth keeping or should it just be removed?

brandon-b-miller · 2020-01-10T16:33:22Z

I had to do a little gymnastics to get the categorical cases right here. In the case of a categorical/non-categorical implicit merge, we want to perform the join by casting both sides to the datatype of the underlying categories first. This leaves us with the task of reconstructing the correct categorical column after the merge takes place. We need to keep track of the original codes that map each integer in the underlying categorical data to the actual category labels, and the cleanest way of doing so seemed to be to carry them along for the actual join. That way we'll always have the category index that originally corresponded to each row in the dataframe handy to use as codes when reconstructing the categorical data. Let me know if this seems reasonable.

cc @shwina

kkraus14

Great work @brandon-b-miller

python/cudf/cudf/core/dataframe.py

shwina · 2020-01-10T21:29:57Z

python/cudf/cudf/core/dataframe.py

-            ctgry_err = "can't implicitly cast column {} to categories \
-                         from right during left join"
+            ctgry_err = "can't implicitly cast column {0} to categories \
+                         from {1} during {1} join"


No need to change this; just a comment that it might be preferable to do something like:

ctgry_err = ("can't implicitly cast column {column_id} to categories" " from {how} during {how} join") ctgry_err.format(column_id=rcol, how="right")

python/cudf/cudf/core/column/numerical.py

brandon-b-miller added 17 commits November 13, 2019 08:27

baseline numeric test

184b5ba

baseline numeric implementation

e70ef39

test for everything pandas supports, else skip

58aa9c9

BROKEN: handle left-categorical cases

57a4bd7

Merge branch 'branch-0.11' into enh-typecast-on-join

3fb5d95

add datetime only test

817410d

implement upcasting for datetime

84dd42a

abandon pandas logic and invent our own

425afc4

mixed int/float test

5dee693

refactor logic, add tests

19c5641

handle categorical-non categorical merge cases

eccaebc

maybe last categorical bug fixed, all cudf tests pass

ca10de1

remove unused code

618a9cf

handle overflow, refactor

a364ff6

merge 0.11

4abd247

fix tests

40cbf9f

style

cc74a2a

brandon-b-miller requested a review from a team as a code owner November 25, 2019 20:26

changelog

b322d10

brandon-b-miller self-assigned this Nov 25, 2019

brandon-b-miller added Python Affects Python cuDF API. 2 - In Progress Currently a work in progress labels Nov 25, 2019

brandon-b-miller added 2 commits November 26, 2019 07:12

pass colname mismatches to libcudf to error

8865df2

relocate tests to test_joining and rename

94dd0f7

brandon-b-miller added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Nov 26, 2019

brandon-b-miller changed the title ~~[WIP] Add implicit typecasting of join columns when dtypes do not match~~ [REVIEW] Add implicit typecasting of join columns when dtypes do not match Nov 26, 2019

kkraus14 reviewed Dec 3, 2019

View reviewed changes

python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved

brandon-b-miller changed the base branch from branch-0.11 to branch-0.12 December 24, 2019 15:10

Merge branch 'branch-0.12' into enh-typecast-on-join

e2da964

brandon-b-miller added 5 commits January 7, 2020 14:48

style

09447f4

style

7ecf789

Merge branch 'branch-0.12' into enh-typecast-on-join

38b6d58

fix accidental test inversion from style correction

e47b96f

merge refactor, solve categorical test failures

d6947ac

brandon-b-miller added 3 - Ready for Review Ready for review by team and removed 0 - Waiting on Author Waiting for author to respond to review labels Jan 10, 2020

Merge branch 'branch-0.12' into enh-typecast-on-join

d49d211

kkraus14 approved these changes Jan 10, 2020

View reviewed changes

kkraus14 requested changes Jan 10, 2020

View reviewed changes

python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved

brandon-b-miller added 2 commits January 10, 2020 13:13

update error formatting

8ea0e80

raise when categories do not match for a column

8203487

shwina reviewed Jan 10, 2020

View reviewed changes

python/cudf/cudf/core/column/numerical.py Outdated Show resolved Hide resolved

overflow_safe_to -> can_cast_safely

80de301

kkraus14 approved these changes Jan 11, 2020

View reviewed changes

kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jan 11, 2020

fix tests

5fd8bf1

kkraus14 merged commit f639bab into rapidsai:branch-0.12 Jan 12, 2020

shwina mentioned this pull request Jan 21, 2020

[BUG] Possible memory leak when merging DataFrames #3848

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Add implicit typecasting of join columns when dtypes do not match #3451

[REVIEW] Add implicit typecasting of join columns when dtypes do not match #3451

brandon-b-miller commented Nov 25, 2019

codecov bot commented Nov 26, 2019 •

edited

Loading

brandon-b-miller commented Dec 30, 2019 •

edited

Loading

brandon-b-miller commented Jan 6, 2020

brandon-b-miller commented Jan 10, 2020

kkraus14 left a comment

shwina Jan 10, 2020

[REVIEW] Add implicit typecasting of join columns when dtypes do not match #3451

[REVIEW] Add implicit typecasting of join columns when dtypes do not match #3451

Conversation

brandon-b-miller commented Nov 25, 2019

Inner joins

Left and right joins

Categorical data

codecov bot commented Nov 26, 2019 • edited Loading

Codecov Report

brandon-b-miller commented Dec 30, 2019 • edited Loading

brandon-b-miller commented Jan 6, 2020

brandon-b-miller commented Jan 10, 2020

kkraus14 left a comment

Choose a reason for hiding this comment

shwina Jan 10, 2020

Choose a reason for hiding this comment

codecov bot commented Nov 26, 2019 •

edited

Loading

brandon-b-miller commented Dec 30, 2019 •

edited

Loading