Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Bump Spark minor versions for 3.3 and 3.4 #9187

Merged
merged 2 commits into from
Dec 6, 2023

Conversation

ajantha-bhat
Copy link
Member

Spark 3.4.2 released yesterday with security and correctness fixes: https://spark.apache.org/news/spark-3-4-2-released.html

Spark 3.3.3 was released two months ago
https://spark.apache.org/news/spark-3-3-3-released.html

Spark 3.4.2 released yesterday with security and correctness fixes: https://spark.apache.org/news/spark-3-4-2-released.html

Spark 3.3.3 was released two months ago
https://spark.apache.org/news/spark-3-3-3-released.html
@ajantha-bhat ajantha-bhat changed the title Bump Spark minor versions for 3.3 and 3.4 Spark: Bump Spark minor versions for 3.3 and 3.4 Dec 1, 2023
@@ -77,8 +77,8 @@ scala-collection-compat = "2.11.0"
slf4j = "1.7.36"
snowflake-jdbc = "3.14.3"
spark-hive32 = "3.2.2"
spark-hive33 = "3.3.2"
spark-hive34 = "3.4.1"
spark-hive33 = "3.3.3"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why dependabot didn't update it.

@RussellSpitzer
Copy link
Member

    org.apache.spark.sql.AnalysisException: Cannot write incompatible data to table '`spark_catalog`.`default`.`source_table`':
    - Cannot safely cast 'id': string to int.```

@ajantha-bhat
Copy link
Member Author

org.apache.spark.sql.AnalysisException: Cannot write incompatible data to table 'spark_catalog.default.source_table':
- Cannot safely cast 'id': string to int.```

I saw that. But didn't know which exact change in spark-3.4.2 has caused it. I was occupied with other work. Will analyze this today.

@github-actions github-actions bot added the spark label Dec 2, 2023
@ajantha-bhat
Copy link
Member Author

@RussellSpitzer and @aokolnychyi : I have spent some time.

The issue is with only SparkCatalogConfig.SPARK catalog type.
That too for tables created with 'location' option.

I suspect clean up issue for tables created with 'location'.
I tried adding PURGE to dropTables() of these test cases and the test case can pass.

I don't closely work with Spark. So, I am not sure what caused it (was it https://issues.apache.org/jira/browse/SPARK-43203 ?). Feel free to add your analysis.

@@ -77,7 +77,7 @@ public void setupTempDirs() {

@After
public void dropTables() {
sql("DROP TABLE IF EXISTS %s", sourceTableName);
sql("DROP TABLE IF EXISTS %s PURGE", sourceTableName);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many test cases used to fail with 3.4.2 version bump.

More details: #9187 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes so the issue here is that non-iceberg tables are being created inorder to test conversion and addition of files from non-iceberg tables into iceberg tables. The Spark Session catalog is probably now (correctly) treating them as external tables (not hive managed tables) and only dropping metadata instead of clearing out the whole table.

This is solely a Spark side change and has no impacts on any Iceberg code that I know of.

@ajantha-bhat
Copy link
Member Author

ping

@RussellSpitzer RussellSpitzer merged commit 70ec4e5 into apache:main Dec 6, 2023
45 checks passed
@RussellSpitzer
Copy link
Member

Thanks @ajantha-bhat for the PR and @Fokko for review

lisirrx pushed a commit to lisirrx/iceberg that referenced this pull request Jan 4, 2024
@manuzhang
Copy link
Collaborator

@RussellSpitzer @ajantha-bhat Will this break Iceberg applications running on Spark 3.4.1 and Spark 3.4.0 due to https://issues.apache.org/jira/browse/SPARK-43203?

@ajantha-bhat
Copy link
Member Author

ajantha-bhat commented Jan 22, 2024

I think testcase was failing for non-iceberg tables managed by spark session catalog. So, should not be an impact for iceberg (tables)/users?

@chinnaraolalam
Copy link

@ajantha-bhat @RussellSpitzer Testcase was failing for non-iceberg table, but it should not effect non-iceberg tables. This might be an issue

I can see 2 cases:

CASE 1: Launching spark-sql session with session catalog(which is spark default, here iceberg tables will not work) and can manage non-iceberg tables. when drop the non-iceberg table like parquet table, it will purge data and it will not leave any data on disk.

CASE 2: Launching spark-sql session with Spark session catalog(Iceberg provided, here iceberg and non-iceberg tables can be managed) will work fine for iceberg tables, when operate on non-iceberg tables like parquet table, drop non-icerberg table will not purge the data and data will be leaked until manual cleanup.

So here CASE-1 and CASE-2 behaviour is different, more over launching spark-sql with spark session catalog is bringing behavioural change for non-iceberg tables(Its an issue).

In CASE 2: CREATE` TABLE parquettable (id bigint, data string) USING parquet;
INSERT INTO parquettable VALUES (1,'A),(2,'B'),(3,'C');
SELECT id,data FROM parquettable WHERE lenght(data) = 1;
DROP TABLE parquettable;
CREATE TABLE parquettable (id bigint, data string) USING parquet; --> This query will fail and throw exception like [LOCATION_ALREADY_EXIST] (as drop table purge not happended)

Where as in case 1, It is passing and it will not throw any error

Please add you are thoughts.

@manuzhang
Copy link
Collaborator

@chinnaraolalam which versions of Iceberg and Spark are you using in test cases?

@RussellSpitzer
Copy link
Member

We have always in our internal builds defaulted to purge off for iceberg (even before there was an option) for safety. I prefer that behavior and don't really mind that it's different for non Iceberg tables.

What is the suggestion here on what we should do? Also should we start a new issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants