Add tests to check compatibility with pyarrow [databricks] #9289

res-life · 2023-09-22T14:55:20Z

contributes to #9288

Add tests for write/read by pyarrow and write/read by GPU
Test scenarios:

Write by pyarrow, test readings on pyarrow and GPU
Write by GPU, test readings on pyarrow and GPU

Use existing data gen and assert equals methods, and adapt the data generated by data gen to pyarrow table, and adapt the pyarrow reading result to CPU result to assert.

Signed-off-by: Chong Gao <[email protected]>

res-life · 2023-09-22T14:58:23Z

build

res-life · 2023-09-25T02:16:18Z

build

res-life · 2023-09-28T08:10:23Z

@jlowe @revans2 Is this test approach in this PR OK?

res-life · 2023-09-28T08:10:48Z

Please review the test approach first.

res-life · 2023-09-28T09:15:30Z

build

integration_tests/src/main/python/parquet_pyarrow_test.py

integration_tests/src/main/python/asserts.py

integration_tests/src/main/python/parquet_pyarrow_test.py

revans2

The general plan looks good

integration_tests/src/main/python/asserts.py

integration_tests/src/main/python/parquet_pyarrow_test.py

res-life · 2023-10-07T08:14:19Z

build

res-life · 2023-10-07T10:56:53Z

build

res-life · 2023-10-07T13:17:07Z

build

res-life · 2023-10-07T13:35:39Z

build

res-life · 2023-10-08T01:36:07Z

build

res-life · 2023-10-08T03:22:17Z

build

res-life · 2023-10-08T12:43:17Z

build

res-life · 2023-10-09T01:47:51Z

Premerge failed:

[2023-10-08T14:35:30.267Z] FAILED ../../src/main/python/arithmetic_ops_test.py::test_greatest[Decimal(36,-5)][INJECT_OOM] - pyspark.sql.utils.AnalysisException: decimal can only support precision up to 38

[2023-10-08T14:35:30.267Z] FAILED ../../src/main/python/arithmetic_ops_test.py::test_greatest[Decimal(38,-10)] - pyspark.sql.utils.AnalysisException: decimal can only support precision up to 38

[2023-10-08T14:35:30.267Z] = 2 failed, 18909 passed, 2051 skipped, 633 xfailed, 276 xpassed, 70 warnings in 5943.83s (1:39:03) =


[2023-10-08T14:35:30.265Z] ________________________ test_greatest[Decimal(36,-5)] _________________________

[2023-10-08T14:35:30.265Z] [gw2] linux -- Python 3.8.10 /usr/bin/python

[2023-10-08T14:35:30.265Z] 

[2023-10-08T14:35:30.265Z] data_gen = Decimal(36,-5)

[2023-10-08T14:35:30.265Z] 

[2023-10-08T14:35:30.265Z]     @pytest.mark.parametrize('data_gen', all_basic_gens + _arith_decimal_gens, ids=idfn)

[2023-10-08T14:35:30.265Z]     def test_greatest(data_gen):

[2023-10-08T14:35:30.265Z]         num_cols = 20

[2023-10-08T14:35:30.265Z] >       s1 = gen_scalar(data_gen, force_no_nulls=not isinstance(data_gen, NullGen))

[2023-10-08T14:35:30.265Z] 

[2023-10-08T14:35:30.265Z] ../../src/main/python/arithmetic_ops_test.py:959: 

[2023-10-08T14:35:30.265Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

[2023-10-08T14:35:30.265Z] ../../src/main/python/data_gen.py:842: in gen_scalar

[2023-10-08T14:35:30.265Z]     v = list(gen_scalars(data_gen, 1, seed=seed, force_no_nulls=force_no_nulls))

[2023-10-08T14:35:30.265Z] ../../src/main/python/data_gen.py:838: in <genexpr>

[2023-10-08T14:35:30.265Z]     return (_mark_as_lit(src.gen(force_no_nulls=force_no_nulls), data_type) for i in range(0, count))

[2023-10-08T14:35:30.265Z] ../../src/main/python/data_gen.py:816: in _mark_as_lit

[2023-10-08T14:35:30.265Z]     return f.lit(data).cast(data_type)

[2023-10-08T14:35:30.265Z] ../../../.download/spark-3.1.1-bin-hadoop3.2/python/pyspark/sql/functions.py:98: in lit

[2023-10-08T14:35:30.265Z]     return col if isinstance(col, Column) else _invoke_function("lit", col)

[2023-10-08T14:35:30.265Z] ../../../.download/spark-3.1.1-bin-hadoop3.2/python/pyspark/sql/functions.py:58: in _invoke_function

[2023-10-08T14:35:30.265Z]     return Column(jf(*args))

[2023-10-08T14:35:30.265Z] /home/jenkins/agent/workspace/jenkins-rapids_premerge-github-8048-ci-2/.download/spark-3.1.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py:1304: in __call__

[2023-10-08T14:35:30.265Z]     return_value = get_return_value(

[2023-10-08T14:35:30.265Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

[2023-10-08T14:35:30.265Z] 

[2023-10-08T14:35:30.265Z] a = ('xro356', <py4j.java_gateway.GatewayClient object at 0x7f92b9367dc0>, 'z:org.apache.spark.sql.functions', 'lit')

[2023-10-08T14:35:30.265Z] kw = {}

[2023-10-08T14:35:30.265Z] converted = AnalysisException('decimal can only support precision up to 38', 'org.apache.spark.sql.AnalysisException: decimal can ...:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:238)\n\tat java.lang.Thread.run(Thread.java:750)\n', None)

[2023-10-08T14:35:30.265Z] 

[2023-10-08T14:35:30.265Z]     def deco(*a, **kw):

[2023-10-08T14:35:30.265Z]         try:

[2023-10-08T14:35:30.265Z]             return f(*a, **kw)

[2023-10-08T14:35:30.265Z]         except py4j.protocol.Py4JJavaError as e:

[2023-10-08T14:35:30.265Z]             converted = convert_exception(e.java_exception)

[2023-10-08T14:35:30.265Z]             if not isinstance(converted, UnknownException):

[2023-10-08T14:35:30.265Z]                 # Hide where the exception came from that shows a non-Pythonic

[2023-10-08T14:35:30.265Z]                 # JVM exception message.

[2023-10-08T14:35:30.265Z] >               raise converted from None

[2023-10-08T14:35:30.265Z] E               pyspark.sql.utils.AnalysisException: decimal can only support precision up to 38

[2023-10-08T14:35:30.265Z] 

[2023-10-08T14:35:30.265Z] ../../../.download/spark-3.1.1-bin-hadoop3.2/python/pyspark/sql/utils.py:117: AnalysisException

res-life · 2023-10-09T01:50:47Z

Maybe above errors are related to the LRU cache, so I post a new commit link to verify the guess.

@lru_cache(maxsize=128, typed=True)    ########### remove this line to test.
def gen_df_help(data_gen, length, seed):
    rand = random.Random(seed)
    data_gen.start(rand)
    data = [data_gen.gen() for index in range(0, length)]
    return data

res-life · 2023-10-09T01:51:13Z

build

res-life · 2023-10-09T13:00:19Z

Blocked by #9404

res-life · 2023-10-10T03:15:41Z

build

res-life · 2023-10-10T03:18:09Z

@pxLi Please review the CI part files in jenkins directory.

pxLi

LGTM for CI part.

If you want to cover the case in databricks please also include updates to https://github.com/NVIDIA/spark-rapids/blob/branch-23.12/jenkins/databricks/test.sh#L92

res-life · 2023-10-13T06:37:35Z

integration_tests/src/main/python/pyarrow_utils.py

+    elif isinstance(data_gen, DateGen):
+        return pa.date32()
+    elif isinstance(data_gen, TimestampGen):
+        return pa.timestamp('us')


Here we use us, because Spark does not support ns

Please add some code comments here.

…ToUTC to true

NvTimLiu

LGTM for CI part

res-life · 2023-10-20T07:16:04Z

build

res-life · 2023-10-23T01:10:23Z

@revans2 @jlowe Do you need another look?

revans2

Just a few minor nits

revans2 · 2023-10-23T13:25:47Z

integration_tests/src/main/python/parquet_pyarrow_test.py

+
+
+# types for test_parquet_read_round_trip_for_pyarrow
+sub_gens = all_basic_gens_no_null + [decimal_gen_64bit, decimal_gen_128bit]


nit: can you document why no_nulls is being used here? I assume it has something to do with NaNs instead of nulls.

This PR is mainly comparing the diffs between Pyarrow and GPU.
Spark NullType is not supported by GPU Parquet, so skipped this type to compare.
If you think we should compare Pyarrow and Spark CPU for the NullType, then I'll file a follow-up issue.

I assume it has something to do with NaNs instead of nulls.

No. Has nothing to to with NaNs. The cases already tested NaNs in float types.

revans2 · 2023-10-23T13:27:47Z

integration_tests/src/main/python/parquet_pyarrow_test.py

+
+
+
+@pytest.mark.xfail(reason="Pyarrow reports error: Data size too small for number of values (corrupted file?). Pyarrow can not read the file generated by itself")


Is there an issue file for this?

jlowe · 2023-10-23T15:25:43Z

jenkins/spark-tests.sh

Do we need to port these changes to jenkins/databricks/test.sh as well?

This PR is testing python package pyarrow, I think the pyarrow in databricks will be the same behavior.
Do we need to file an follow-up issue?

We are going through some of the Spark platform code for both the GPU read and GPU write paths, and that code could theoretically be different on Databricks vs. Apache Spark. That's why we test reads and writes explicitly on Databricks rather than test on Apache Spark and assume it will work on Databricks too.

IMO we should at least update the script to be able to run the pyarrow tests when configured to do so, and we can debate separately whether CI pipelines should or should not run them. With the "we're just testing pyarrow" argument, we only need to test this on a single Spark version and avoid running it on all other Spark versions, Databricks or otherwise. Is that actually the case?

Filed a follow-up #9533

Add tests for pyarrow: wrote by pyarrow, read by CPU/GPU

508bcfd

Signed-off-by: Chong Gao <[email protected]>

Revert data_gen.py

8b15249

sameerz added the test Only impacts tests label Sep 25, 2023

add assert equals between pyarrow data and GPU data

a554785

Fix

6a76668

jlowe reviewed Sep 28, 2023

View reviewed changes

integration_tests/src/main/python/parquet_pyarrow_test.py Outdated Show resolved Hide resolved

integration_tests/src/main/python/asserts.py Outdated Show resolved Hide resolved

integration_tests/src/main/python/parquet_pyarrow_test.py Outdated Show resolved Hide resolved

revans2 reviewed Sep 28, 2023

View reviewed changes

thirtiseven reviewed Oct 6, 2023

View reviewed changes

integration_tests/src/main/python/parquet_pyarrow_test.py Outdated Show resolved Hide resolved

integration_tests/src/main/python/parquet_pyarrow_test.py Outdated Show resolved Hide resolved

integration_tests/src/main/python/parquet_pyarrow_test.py Outdated Show resolved Hide resolved

Fix test cases for date/timestamp/binary types

8998aec

res-life changed the title ~~[WIP] Add tests for write/read by pyarrow and write/read by GPU~~ [WIP] Add tests for write/read by pyarrow and write/read by GPU [databricks] Oct 7, 2023

Fix test cases on Spark 31x

e633a22

Fix failed case

799c91d

Merge branch 'branch-23.10' into pyarrow

6cce6f3

Refactor

ff95c75

Call gen_df_help without cache to identify a CI error

46bdb72

Chong Gao added 2 commits October 10, 2023 10:40

Add an option to enable/disable pyarrow tests

2cbd720

Enable pyarrow tests in premerge and nightly CIs

23af7b9

res-life changed the base branch from branch-23.10 to branch-23.12 October 10, 2023 03:14

res-life marked this pull request as ready for review October 10, 2023 03:14

res-life requested review from tgravescs, GaryShen2008, NvTimLiu and zhanga5 as code owners October 10, 2023 03:14

res-life changed the title ~~[WIP] Add tests for write/read by pyarrow and write/read by GPU [databricks]~~ Add tests to check compatibility with pyarrow [databricks] Oct 10, 2023

pxLi reviewed Oct 12, 2023

View reviewed changes

res-life commented Oct 13, 2023

View reviewed changes

Add test cases with different pyarrow write parameters

818ef89

sameerz requested a review from mythrocks October 16, 2023 17:01

Chong Gao added 4 commits October 17, 2023 08:02

Add one line comment

5559ff5

Add test cases with different pyarrow write parameters, step 2

ad8b775

Rename test case names

1271c1a

Fix test cases becasue Pyarrow generated Parquet files set isAdjusted…

f331cff

…ToUTC to true

NvTimLiu approved these changes Oct 18, 2023

View reviewed changes

revans2 approved these changes Oct 23, 2023

View reviewed changes

jlowe reviewed Oct 23, 2023

View reviewed changes

res-life mentioned this pull request Oct 24, 2023

[BUG] [Pyarrow] Pyarrow can not read the file generated by itself when specify use_byte_stream_split #9526

Open

res-life merged commit bdc563e into NVIDIA:branch-23.12 Oct 24, 2023

res-life deleted the pyarrow branch October 24, 2023 08:06

This was referenced Oct 25, 2023

[FEA] Setup Databricks CI for pyarrow test #9533

Closed

Test compatibility between pyarrow and GPU #9288

Closed



		# types for test_parquet_read_round_trip_for_pyarrow
		sub_gens = all_basic_gens_no_null + [decimal_gen_64bit, decimal_gen_128bit]




		@pytest.mark.xfail(reason="Pyarrow reports error: Data size too small for number of values (corrupted file?). Pyarrow can not read the file generated by itself")

Add tests to check compatibility with pyarrow [databricks] #9289

Add tests to check compatibility with pyarrow [databricks] #9289

Conversation

res-life commented Sep 22, 2023 • edited Loading

res-life commented Sep 22, 2023

res-life commented Sep 25, 2023

res-life commented Sep 28, 2023

res-life commented Sep 28, 2023

res-life commented Sep 28, 2023

revans2 left a comment

Choose a reason for hiding this comment

res-life commented Oct 7, 2023

res-life commented Oct 7, 2023

res-life commented Oct 7, 2023

res-life commented Oct 7, 2023

res-life commented Oct 8, 2023

res-life commented Oct 8, 2023

res-life commented Oct 8, 2023

res-life commented Oct 9, 2023

res-life commented Oct 9, 2023

res-life commented Oct 9, 2023

res-life commented Oct 9, 2023

res-life commented Oct 10, 2023

res-life commented Oct 10, 2023

pxLi left a comment

Choose a reason for hiding this comment

res-life Oct 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NvTimLiu left a comment

Choose a reason for hiding this comment

res-life commented Oct 20, 2023

res-life commented Oct 23, 2023

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life Oct 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life Oct 25, 2023 • edited Loading

Choose a reason for hiding this comment

res-life commented Sep 22, 2023 •

edited

Loading

res-life Oct 13, 2023 •

edited

Loading

res-life Oct 24, 2023 •

edited

Loading

res-life Oct 25, 2023 •

edited

Loading