-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark: Bump Spark minor versions for 3.3 and 3.4 #9187
Conversation
Spark 3.4.2 released yesterday with security and correctness fixes: https://spark.apache.org/news/spark-3-4-2-released.html Spark 3.3.3 was released two months ago https://spark.apache.org/news/spark-3-3-3-released.html
@@ -77,8 +77,8 @@ scala-collection-compat = "2.11.0" | |||
slf4j = "1.7.36" | |||
snowflake-jdbc = "3.14.3" | |||
spark-hive32 = "3.2.2" | |||
spark-hive33 = "3.3.2" | |||
spark-hive34 = "3.4.1" | |||
spark-hive33 = "3.3.3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why dependabot didn't update it.
|
I saw that. But didn't know which exact change in spark-3.4.2 has caused it. I was occupied with other work. Will analyze this today. |
@RussellSpitzer and @aokolnychyi : I have spent some time. The issue is with only I suspect clean up issue for tables created with 'location'. I don't closely work with Spark. So, I am not sure what caused it (was it https://issues.apache.org/jira/browse/SPARK-43203 ?). Feel free to add your analysis. |
@@ -77,7 +77,7 @@ public void setupTempDirs() { | |||
|
|||
@After | |||
public void dropTables() { | |||
sql("DROP TABLE IF EXISTS %s", sourceTableName); | |||
sql("DROP TABLE IF EXISTS %s PURGE", sourceTableName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many test cases used to fail with 3.4.2 version bump.
More details: #9187 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes so the issue here is that non-iceberg tables are being created inorder to test conversion and addition of files from non-iceberg tables into iceberg tables. The Spark Session catalog is probably now (correctly) treating them as external tables (not hive managed tables) and only dropping metadata instead of clearing out the whole table.
This is solely a Spark side change and has no impacts on any Iceberg code that I know of.
ping |
Thanks @ajantha-bhat for the PR and @Fokko for review |
@RussellSpitzer @ajantha-bhat Will this break Iceberg applications running on Spark 3.4.1 and Spark 3.4.0 due to https://issues.apache.org/jira/browse/SPARK-43203? |
I think testcase was failing for non-iceberg tables managed by spark session catalog. So, should not be an impact for iceberg (tables)/users? |
@ajantha-bhat @RussellSpitzer Testcase was failing for non-iceberg table, but it should not effect non-iceberg tables. This might be an issue I can see 2 cases: CASE 1: Launching spark-sql session with session catalog(which is spark default, here iceberg tables will not work) and can manage non-iceberg tables. when drop the non-iceberg table like parquet table, it will purge data and it will not leave any data on disk. CASE 2: Launching spark-sql session with Spark session catalog(Iceberg provided, here iceberg and non-iceberg tables can be managed) will work fine for iceberg tables, when operate on non-iceberg tables like parquet table, drop non-icerberg table will not purge the data and data will be leaked until manual cleanup. So here CASE-1 and CASE-2 behaviour is different, more over launching spark-sql with spark session catalog is bringing behavioural change for non-iceberg tables(Its an issue). In CASE 2: CREATE` TABLE parquettable (id bigint, data string) USING parquet; Where as in case 1, It is passing and it will not throw any error Please add you are thoughts. |
@chinnaraolalam which versions of Iceberg and Spark are you using in test cases? |
We have always in our internal builds defaulted to purge off for iceberg (even before there was an option) for safety. I prefer that behavior and don't really mind that it's different for non Iceberg tables. What is the suggestion here on what we should do? Also should we start a new issue? |
Spark 3.4.2 released yesterday with security and correctness fixes: https://spark.apache.org/news/spark-3-4-2-released.html
Spark 3.3.3 was released two months ago
https://spark.apache.org/news/spark-3-3-3-released.html