-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tests to check compatibility with pyarrow [databricks] #9289
Conversation
Signed-off-by: Chong Gao <[email protected]>
build |
build |
Please review the test approach first. |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general plan looks good
build |
build |
1 similar comment
build |
build |
build |
1 similar comment
build |
build |
Premerge failed:
|
Maybe above errors are related to the LRU cache, so I post a new commit link to verify the guess.
|
build |
Blocked by #9404 |
build |
@pxLi Please review the CI part files in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for CI part.
If you want to cover the case in databricks please also include updates to https://github.com/NVIDIA/spark-rapids/blob/branch-23.12/jenkins/databricks/test.sh#L92
elif isinstance(data_gen, DateGen): | ||
return pa.date32() | ||
elif isinstance(data_gen, TimestampGen): | ||
return pa.timestamp('us') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we use us
, because Spark does not support ns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some code comments here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for CI part
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor nits
|
||
|
||
# types for test_parquet_read_round_trip_for_pyarrow | ||
sub_gens = all_basic_gens_no_null + [decimal_gen_64bit, decimal_gen_128bit] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you document why no_nulls is being used here? I assume it has something to do with NaNs instead of nulls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is mainly comparing the diffs between Pyarrow and GPU.
Spark NullType is not supported by GPU Parquet, so skipped this type to compare.
If you think we should compare Pyarrow and Spark CPU for the NullType, then I'll file a follow-up issue.
I assume it has something to do with NaNs instead of nulls.
No. Has nothing to to with NaNs. The cases already tested NaNs in float types.
|
||
|
||
|
||
@pytest.mark.xfail(reason="Pyarrow reports error: Data size too small for number of values (corrupted file?). Pyarrow can not read the file generated by itself") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an issue file for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to port these changes to jenkins/databricks/test.sh as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is testing python package pyarrow, I think the pyarrow in databricks will be the same behavior.
Do we need to file an follow-up issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are going through some of the Spark platform code for both the GPU read and GPU write paths, and that code could theoretically be different on Databricks vs. Apache Spark. That's why we test reads and writes explicitly on Databricks rather than test on Apache Spark and assume it will work on Databricks too.
IMO we should at least update the script to be able to run the pyarrow tests when configured to do so, and we can debate separately whether CI pipelines should or should not run them. With the "we're just testing pyarrow" argument, we only need to test this on a single Spark version and avoid running it on all other Spark versions, Databricks or otherwise. Is that actually the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed a follow-up #9533
contributes to #9288
Add tests for write/read by pyarrow and write/read by GPU
Test scenarios:
Use existing
data gen
andassert equals
methods, and adapt the data generated bydata gen
topyarrow
table, and adapt the pyarrow reading result to CPU result to assert.