Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests to check compatibility with pyarrow [databricks] #9289

Merged
merged 18 commits into from
Oct 24, 2023

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Sep 22, 2023

contributes to #9288

Add tests for write/read by pyarrow and write/read by GPU
Test scenarios:

  • Write by pyarrow, test readings on pyarrow and GPU
  • Write by GPU, test readings on pyarrow and GPU

Use existing data gen and assert equals methods, and adapt the data generated by data gen to pyarrow table, and adapt the pyarrow reading result to CPU result to assert.

@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

build

@sameerz sameerz added the test Only impacts tests label Sep 25, 2023
@res-life
Copy link
Collaborator Author

@jlowe @revans2 Is this test approach in this PR OK?

@res-life
Copy link
Collaborator Author

Please review the test approach first.

@res-life
Copy link
Collaborator Author

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general plan looks good

@res-life res-life changed the title [WIP] Add tests for write/read by pyarrow and write/read by GPU [WIP] Add tests for write/read by pyarrow and write/read by GPU [databricks] Oct 7, 2023
@res-life
Copy link
Collaborator Author

res-life commented Oct 7, 2023

build

@res-life
Copy link
Collaborator Author

res-life commented Oct 7, 2023

build

1 similar comment
@res-life
Copy link
Collaborator Author

res-life commented Oct 7, 2023

build

@res-life
Copy link
Collaborator Author

res-life commented Oct 7, 2023

build

@res-life
Copy link
Collaborator Author

res-life commented Oct 8, 2023

build

1 similar comment
@res-life
Copy link
Collaborator Author

res-life commented Oct 8, 2023

build

@res-life
Copy link
Collaborator Author

res-life commented Oct 8, 2023

build

@res-life
Copy link
Collaborator Author

res-life commented Oct 9, 2023

Premerge failed:

[2023-10-08T14:35:30.267Z] FAILED ../../src/main/python/arithmetic_ops_test.py::test_greatest[Decimal(36,-5)][INJECT_OOM] - pyspark.sql.utils.AnalysisException: decimal can only support precision up to 38

[2023-10-08T14:35:30.267Z] FAILED ../../src/main/python/arithmetic_ops_test.py::test_greatest[Decimal(38,-10)] - pyspark.sql.utils.AnalysisException: decimal can only support precision up to 38

[2023-10-08T14:35:30.267Z] = 2 failed, 18909 passed, 2051 skipped, 633 xfailed, 276 xpassed, 70 warnings in 5943.83s (1:39:03) =

[2023-10-08T14:35:30.265Z] ________________________ test_greatest[Decimal(36,-5)] _________________________

[2023-10-08T14:35:30.265Z] [gw2] linux -- Python 3.8.10 /usr/bin/python

[2023-10-08T14:35:30.265Z] 

[2023-10-08T14:35:30.265Z] data_gen = Decimal(36,-5)

[2023-10-08T14:35:30.265Z] 

[2023-10-08T14:35:30.265Z]     @pytest.mark.parametrize('data_gen', all_basic_gens + _arith_decimal_gens, ids=idfn)

[2023-10-08T14:35:30.265Z]     def test_greatest(data_gen):

[2023-10-08T14:35:30.265Z]         num_cols = 20

[2023-10-08T14:35:30.265Z] >       s1 = gen_scalar(data_gen, force_no_nulls=not isinstance(data_gen, NullGen))

[2023-10-08T14:35:30.265Z] 

[2023-10-08T14:35:30.265Z] ../../src/main/python/arithmetic_ops_test.py:959: 

[2023-10-08T14:35:30.265Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

[2023-10-08T14:35:30.265Z] ../../src/main/python/data_gen.py:842: in gen_scalar

[2023-10-08T14:35:30.265Z]     v = list(gen_scalars(data_gen, 1, seed=seed, force_no_nulls=force_no_nulls))

[2023-10-08T14:35:30.265Z] ../../src/main/python/data_gen.py:838: in <genexpr>

[2023-10-08T14:35:30.265Z]     return (_mark_as_lit(src.gen(force_no_nulls=force_no_nulls), data_type) for i in range(0, count))

[2023-10-08T14:35:30.265Z] ../../src/main/python/data_gen.py:816: in _mark_as_lit

[2023-10-08T14:35:30.265Z]     return f.lit(data).cast(data_type)

[2023-10-08T14:35:30.265Z] ../../../.download/spark-3.1.1-bin-hadoop3.2/python/pyspark/sql/functions.py:98: in lit

[2023-10-08T14:35:30.265Z]     return col if isinstance(col, Column) else _invoke_function("lit", col)

[2023-10-08T14:35:30.265Z] ../../../.download/spark-3.1.1-bin-hadoop3.2/python/pyspark/sql/functions.py:58: in _invoke_function

[2023-10-08T14:35:30.265Z]     return Column(jf(*args))

[2023-10-08T14:35:30.265Z] /home/jenkins/agent/workspace/jenkins-rapids_premerge-github-8048-ci-2/.download/spark-3.1.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py:1304: in __call__

[2023-10-08T14:35:30.265Z]     return_value = get_return_value(

[2023-10-08T14:35:30.265Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

[2023-10-08T14:35:30.265Z] 

[2023-10-08T14:35:30.265Z] a = ('xro356', <py4j.java_gateway.GatewayClient object at 0x7f92b9367dc0>, 'z:org.apache.spark.sql.functions', 'lit')

[2023-10-08T14:35:30.265Z] kw = {}

[2023-10-08T14:35:30.265Z] converted = AnalysisException('decimal can only support precision up to 38', 'org.apache.spark.sql.AnalysisException: decimal can ...:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:238)\n\tat java.lang.Thread.run(Thread.java:750)\n', None)

[2023-10-08T14:35:30.265Z] 

[2023-10-08T14:35:30.265Z]     def deco(*a, **kw):

[2023-10-08T14:35:30.265Z]         try:

[2023-10-08T14:35:30.265Z]             return f(*a, **kw)

[2023-10-08T14:35:30.265Z]         except py4j.protocol.Py4JJavaError as e:

[2023-10-08T14:35:30.265Z]             converted = convert_exception(e.java_exception)

[2023-10-08T14:35:30.265Z]             if not isinstance(converted, UnknownException):

[2023-10-08T14:35:30.265Z]                 # Hide where the exception came from that shows a non-Pythonic

[2023-10-08T14:35:30.265Z]                 # JVM exception message.

[2023-10-08T14:35:30.265Z] >               raise converted from None

[2023-10-08T14:35:30.265Z] E               pyspark.sql.utils.AnalysisException: decimal can only support precision up to 38

[2023-10-08T14:35:30.265Z] 

[2023-10-08T14:35:30.265Z] ../../../.download/spark-3.1.1-bin-hadoop3.2/python/pyspark/sql/utils.py:117: AnalysisException

@res-life
Copy link
Collaborator Author

res-life commented Oct 9, 2023

Maybe above errors are related to the LRU cache, so I post a new commit link to verify the guess.

@lru_cache(maxsize=128, typed=True)    ########### remove this line to test.
def gen_df_help(data_gen, length, seed):
    rand = random.Random(seed)
    data_gen.start(rand)
    data = [data_gen.gen() for index in range(0, length)]
    return data

@res-life
Copy link
Collaborator Author

res-life commented Oct 9, 2023

build

@res-life
Copy link
Collaborator Author

res-life commented Oct 9, 2023

Blocked by #9404

@res-life res-life changed the base branch from branch-23.10 to branch-23.12 October 10, 2023 03:14
@res-life res-life marked this pull request as ready for review October 10, 2023 03:14
@res-life res-life changed the title [WIP] Add tests for write/read by pyarrow and write/read by GPU [databricks] Add tests to check compatibility with pyarrow [databricks] Oct 10, 2023
@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

@pxLi Please review the CI part files in jenkins directory.

Copy link
Member

@pxLi pxLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for CI part.

If you want to cover the case in databricks please also include updates to https://github.com/NVIDIA/spark-rapids/blob/branch-23.12/jenkins/databricks/test.sh#L92

elif isinstance(data_gen, DateGen):
return pa.date32()
elif isinstance(data_gen, TimestampGen):
return pa.timestamp('us')
Copy link
Collaborator Author

@res-life res-life Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we use us, because Spark does not support ns

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some code comments here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@sameerz sameerz requested a review from mythrocks October 16, 2023 17:01
Copy link
Collaborator

@NvTimLiu NvTimLiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for CI part

@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

@revans2 @jlowe Do you need another look?

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor nits



# types for test_parquet_read_round_trip_for_pyarrow
sub_gens = all_basic_gens_no_null + [decimal_gen_64bit, decimal_gen_128bit]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you document why no_nulls is being used here? I assume it has something to do with NaNs instead of nulls.

Copy link
Collaborator Author

@res-life res-life Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is mainly comparing the diffs between Pyarrow and GPU.
Spark NullType is not supported by GPU Parquet, so skipped this type to compare.
If you think we should compare Pyarrow and Spark CPU for the NullType, then I'll file a follow-up issue.

I assume it has something to do with NaNs instead of nulls.

No. Has nothing to to with NaNs. The cases already tested NaNs in float types.




@pytest.mark.xfail(reason="Pyarrow reports error: Data size too small for number of values (corrupted file?). Pyarrow can not read the file generated by itself")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an issue file for this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to port these changes to jenkins/databricks/test.sh as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is testing python package pyarrow, I think the pyarrow in databricks will be the same behavior.
Do we need to file an follow-up issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are going through some of the Spark platform code for both the GPU read and GPU write paths, and that code could theoretically be different on Databricks vs. Apache Spark. That's why we test reads and writes explicitly on Databricks rather than test on Apache Spark and assume it will work on Databricks too.

IMO we should at least update the script to be able to run the pyarrow tests when configured to do so, and we can debate separately whether CI pipelines should or should not run them. With the "we're just testing pyarrow" argument, we only need to test this on a single Spark version and avoid running it on all other Spark versions, Databricks or otherwise. Is that actually the case?

Copy link
Collaborator Author

@res-life res-life Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed a follow-up #9533

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants