Spark-3.5: make `where` sql case sensitive setting alterable in rewrite data files procedure #11439

ludlows · 2024-10-31T07:48:20Z

this pr aims to make the rewriteDataFile action is aware of the user settings about sql case sensitivity in the where statement.
the implementation is simple.
we first obtain the case sensitive setting and save it as a variable in the constructor of rewriteDataFileAction.
then, we pass the variable to the tableScan .

related issue: #11438

singhpk234

@ludlows can you please also add an UT for it to future proof it ?

szehon-ho

Yes looks reasonable to me as well, agree with @singhpk234 about UT

ludlows · 2024-11-05T11:05:26Z

Hi @szehon-ho ,
please review the test cases, should you have time.
one possible problem is the type of exception is IllegalArgumentException here instead of the ValidationException mentioned in the issue.

ludlows · 2024-11-05T11:08:32Z

I think I typed the wrong version of iceberg in the issue #11438

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java

huaxingao · 2024-11-06T04:24:57Z

@ludlows
Thanks for the PR. I have a couple of questions:

It seems to me that the tests don't really test the changes in this PR; they would pass even without the fix. I think we should add some tests that would fail without the fix but can pass with it.
Do we want to make the Spark SQL configuration spark.sql.caseSensitive apply to Iceberg stored procedure parameters? If so, we probably should apply spark.sql.caseSensitive to all Iceberg stored procedure parameters. Are there other Iceberg stored procedure parameters that should also honor spark.sql.caseSensitive?

ludlows · 2024-11-06T09:42:44Z

It seems to me that the tests don't really test the changes in this PR; they would pass even without the fix. I think we should add some tests that would fail without the fix but can pass with it.
Do we want to make the Spark SQL configuration spark.sql.caseSensitive apply to Iceberg stored procedure parameters? If so, we probably should apply spark.sql.caseSensitive to all Iceberg stored procedure parameters. Are there other Iceberg stored procedure parameters that should also honor spark.sql.caseSensitive?

hi @huaxingao ,
thanks for the questions above.
about the 2nd one, as a data engineer, our users were asking me : why not all parts in the procedure are case insensitive even I have set spark.sql.caseSensitive to false? since the procedure is triggered at the sql level but why the parameter of procedure is not affected?

I think it is reasonable to apply the setting about the sql case sensitivity to all procedures, but we could take this pr as the starting point.

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java

huaxingao · 2024-11-29T06:26:07Z

@ludlows Thanks for the quick fix. Can we have a test that fails without the fix but passes with it? It seems that all your current tests pass even without the fix.

ludlows · 2024-11-30T08:09:35Z

@huaxingao I think the test method testFilterCaseSensitivityBeforeChange() (leading to validation exception) has shown the bug exists before the PR.

huaxingao · 2024-12-01T08:15:29Z

@ludlows I think you can simply reproduce the problem by something like

    createTable();
    insertData(10);
    sql("SET %s=false", SQLConf.CASE_SENSITIVE().key());
    sql("CALL %s.system.rewrite_data_files(table=>'%s', where=>'C1 > 0'), catalogName, tableIdent));

ludlows · 2024-12-01T13:22:47Z

@huaxingao thanks for the comment.
but i don't think the problem will be raised since the bug has been fixed by this PR.
please check the test code belove:

@TestTemplate
  public void testFilterCaseSensitivityAfterChange() {
    createTable();
    insertData(10);
    sql("set spark.sql.caseSensitive=false");
    assertEquals(
        "Should have done nothing but passed the schema validation, since no files are present",
        ImmutableList.of(row(0, 0, 0L, 0)),
        sql(
            "CALL %s.system.rewrite_data_files(table=>'%s', where=>'C1 > 90000000')",
            catalogName, tableIdent));
  }

the test case above has passed .

huaxingao · 2024-12-01T18:28:57Z

@ludlows Thanks for the quick reply. I know my example will pass with the PR's fix. However, the problem will arise without the fix. We need a simple test that fails without the fix and passes with it. A straightforward test like my example should suffice, with minimal changes. My goal is to keep the test as simple as possible.

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java

huaxingao

LGTM

szehon-ho · 2024-12-02T21:59:14Z

Thanks @ludlows , and also @huaxingao, @anuragmantri @singhpk234 for reviews

…ache#11439)

fix caseSensitive bug in where sql.

1ddd001

github-actions bot added the spark label Oct 31, 2024

add one space

9381e01

singhpk234 reviewed Oct 31, 2024

View reviewed changes

szehon-ho reviewed Oct 31, 2024

View reviewed changes

ludlows added 7 commits November 4, 2024 15:32

add the test case

5a2d344

make the call procedure_name recognizable

1c4034b

revert build.gradle

41861f3

move test cases to the extensions part

9f23f76

Merge branch 'apache:main' into main

3344a4e

fixed the bug

2777ca1

caseSensitive test case

6f084f5

anuragmantri approved these changes Nov 4, 2024

View reviewed changes

ludlows added 2 commits November 5, 2024 08:48

add more tests on truncate partition table

89fd569

change the exception type to IllegalArgumentException

714876e

huaxingao reviewed Nov 6, 2024

View reviewed changes

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java Outdated Show resolved Hide resolved

huaxingao reviewed Nov 6, 2024

View reviewed changes

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java Show resolved Hide resolved

huaxingao reviewed Nov 27, 2024

View reviewed changes

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java Outdated Show resolved Hide resolved

use SparkUtil to fetch caseSensitive value

9127480

ludlows added 6 commits November 30, 2024 09:05

more test cases

ff18bf6

Merge branch 'main' into rewrite-data-where-sql

a3957ee

code spotlessApply

57dc449

change method name

e1f5f8f

change method argument name

850b1c3

move test cases to spark extensions

f9a129d

change exception class name

810cecb

ludlows added 2 commits December 2, 2024 10:15

only one test case

0e7d8fd

remove empty line

1688278

huaxingao reviewed Dec 2, 2024

View reviewed changes

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java Outdated Show resolved Hide resolved

remove unused method

9e76691

huaxingao approved these changes Dec 2, 2024

View reviewed changes

szehon-ho approved these changes Dec 2, 2024

View reviewed changes

szehon-ho merged commit d8326d8 into apache:main Dec 2, 2024
31 checks passed

ludlows deleted the rewrite-data-where-sql branch December 3, 2024 12:01

ludlows mentioned this pull request Dec 4, 2024

Spark 3.3,3.4: Make where clause case sensitive in rewrite data files #11696

Merged

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024

Spark 3.5: Make where clause case sensitive in rewrite data files (ap…

d7ef1f0

…ache#11439)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark-3.5: make `where` sql case sensitive setting alterable in rewrite data files procedure #11439

Spark-3.5: make `where` sql case sensitive setting alterable in rewrite data files procedure #11439

ludlows commented Oct 31, 2024 •

edited

Loading

singhpk234 left a comment

szehon-ho left a comment

ludlows commented Nov 5, 2024

ludlows commented Nov 5, 2024

huaxingao commented Nov 6, 2024

ludlows commented Nov 6, 2024 •

edited

Loading

huaxingao commented Nov 29, 2024

ludlows commented Nov 30, 2024

huaxingao commented Dec 1, 2024

ludlows commented Dec 1, 2024

huaxingao commented Dec 1, 2024

huaxingao left a comment

szehon-ho commented Dec 2, 2024

Spark-3.5: make where sql case sensitive setting alterable in rewrite data files procedure #11439

Spark-3.5: make where sql case sensitive setting alterable in rewrite data files procedure #11439

Conversation

ludlows commented Oct 31, 2024 • edited Loading

singhpk234 left a comment

Choose a reason for hiding this comment

szehon-ho left a comment

Choose a reason for hiding this comment

ludlows commented Nov 5, 2024

ludlows commented Nov 5, 2024

huaxingao commented Nov 6, 2024

ludlows commented Nov 6, 2024 • edited Loading

huaxingao commented Nov 29, 2024

ludlows commented Nov 30, 2024

huaxingao commented Dec 1, 2024

ludlows commented Dec 1, 2024

huaxingao commented Dec 1, 2024

huaxingao left a comment

Choose a reason for hiding this comment

szehon-ho commented Dec 2, 2024

Spark-3.5: make `where` sql case sensitive setting alterable in rewrite data files procedure #11439

Spark-3.5: make `where` sql case sensitive setting alterable in rewrite data files procedure #11439

ludlows commented Oct 31, 2024 •

edited

Loading

ludlows commented Nov 6, 2024 •

edited

Loading