Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't delete from Spark DeltaLake table if it has a timestamp field #1478

Closed
bhuacret opened this issue Jun 20, 2023 · 4 comments
Closed

Can't delete from Spark DeltaLake table if it has a timestamp field #1478

bhuacret opened this issue Jun 20, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@bhuacret
Copy link

Environment

Delta-rs version:0.12.0
Binding: Rust

Environment:dev

  • Cloud provider:Azure
  • OS:Linux/Windows
  • Other:

Bug

What happened:
I'm trying to delete some data from a delta table in Rust, using with_predicate.

let (table,_) = DeltaOps(table).delete().with_predicate(col("fechabloqueo").eq(cast(lit("2022-11-06"),Date32))).await.unwrap();

but im getting "Execution error: Failed to map column projection for field fechacreacion. Incompatible data types Timestamp(Nanosecond, None) and Timestamp(Microsecond, None)"

What you expected to happen:

Delete successfully with that condition.

How to reproduce it:

More details:
The table was created using default INT96 outputTimestampType option.

If i rewrite the table using spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
still fails:
"Execution error: Failed to map column projection for field fechacreacion. Incompatible data types Timestamp(Microsecond, Some(\"UTC\")) and Timestamp(Microsecond, None)"

@bhuacret bhuacret added the bug Something isn't working label Jun 20, 2023
@cmackenzie1
Copy link
Contributor

Out of curiosity, is spark.sql.parquet.timestampNTZ.enabled set to true of false?

@bhuacret
Copy link
Author

Just tested with both spark.sql.parquet.timestampNTZ.enabled true and false at Spark 3.4.0, i keep getting the same error.

@Blajda
Copy link
Collaborator

Blajda commented Jun 29, 2023

This is caused by ambiguity on how to store timestamps in the serialization section of the protocol. The older version of the protocol specified that timestamp columns do not specified the timezone but this change has modified it to be adjusted to UTC.

I did some local testing. Updating the code here to include the timezone of UTC and performing of casts in the user supplied query (e.g (cast(lit("2022-11-06"),arrow_schema::DataType::Timestamp(arrow_schema::TimeUnit::Microsecond, Some("UTC".into())))) will fix this issue. I'm just not certain if this change is a breaking change.

related: delta-io/delta#643

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Aug 19, 2024

Resolved by #2615, and using logical plans in delete and datafusion/arrow-rs can do type coercion during parquet reading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants