-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
branch schema affected by main table schema #9737
Comments
When fetching data from a branch, the schema which was associated with branch should be issued not table. But for operations like cherry pick from branch to main, it should resolve conflicts between main and branch and possibly consider main branch table schema and re-concile. Let me know your thoughts @rdblue @nastra |
@namrathamyske I've opened #10055 to clarify which schema is being used when. |
@namrathamyske you can force reading with the snapshot id on a branch by using the time travel statement.
This is use the current snapshot and is equivalent to reading from the head of the branch with the snapshot schema. However, the branches write schema will track with the table schema. |
I just checked this workaround and it actually returns the latest snapshot id of the iceberg/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java Lines 224 to 225 in 2a39af8
|
Thanks @danielcweeks @nastra ! |
@namrathamyske it was pointed out to me that workaround may not be working correctly for branches, which is something we might need to address. |
Looks like we are disabling the workaround from #10059. But there is one more issue with above solution:
Do we want to created a no-op snapshot when a branch is created on current snapshot? Also suggested in #7075 (comment) |
The reason for #10059 is because we don't support time travel on branches themselves, because there's no history tracking on branches available. The workaround that you can use is documented in #10055, where you can fetch the latest snapshot id of the given branch and then use that snapshot id in the |
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. |
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' |
Apache Iceberg version
main (development)
Query engine
None
Please describe the bug 🐞
regarding this PR: #9131 - the change reads as: Schema for a branch should return table schema
Shouldn't the Schema of a branch be the same as when the branch was created - as opposed to the above change - ie., to move it to a future state of schema change on the table? isn't the concept of branching to create a baseline based on the state of data and metadata of the table - as to - when it was branched? can you pl. help me understand the rationale behind this change?
Please consider this example:
Describe and Query the table & branch:
Alter the table - using the below statement to diverge the definition of the table:
Behavior before the above PR: [Please NOTE that the changes in the main branch - DID NOT IMPACT the data and metadata on the branch - which lookslike is the desirable behavior for any branching concept]
Behavior after the above PR: [Please NOTE that a schema change in the main branch - IMPACTED the data and metadata available on the branch - this feels like an undesirable behavior;]
Unit test to replicate the issue:
@SreeramGarlapati @jackieo168
cc: @rdblue @nastra
The text was updated successfully, but these errors were encountered: