Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

branch schema affected by main table schema #9737

Closed
namrathamyske opened this issue Feb 16, 2024 · 10 comments
Closed

branch schema affected by main table schema #9737

namrathamyske opened this issue Feb 16, 2024 · 10 comments
Labels

Comments

@namrathamyske
Copy link
Contributor

namrathamyske commented Feb 16, 2024

Apache Iceberg version

main (development)

Query engine

None

Please describe the bug 🐞

regarding this PR: #9131 - the change reads as: Schema for a branch should return table schema
Shouldn't the Schema of a branch be the same as when the branch was created - as opposed to the above change - ie., to move it to a future state of schema change on the table? isn't the concept of branching to create a baseline based on the state of data and metadata of the table - as to - when it was branched? can you pl. help me understand the rationale behind this change?

Please consider this example:

-- create a table with a single column and insert a value
spark-sql (default)> create table t (s string);
spark-sql (default)> insert into t values ('foo');
-- create a branch, the schema is the same as the original table
spark-sql (default)> alter table t create branch b1;

Describe and Query the table & branch:

spark-sql (default)> describe default.t;
s                       string
spark-sql (default)> select * from default.t;
s
foo

spark-sql (default)> describe default.t.branch_b1;
s                       string
spark-sql (default)> select * from default.t.branch_b1;
s
foo

Alter the table - using the below statement to diverge the definition of the table:

spark-sql (default)> alter table t add column i int;
spark-sql (default)> alter table t del column s;

spark-sql (default)> insert into t values (111);

Behavior before the above PR: [Please NOTE that the changes in the main branch - DID NOT IMPACT the data and metadata on the branch - which lookslike is the desirable behavior for any branching concept]

spark-sql (default)> describe default.t;
i                       int
spark-sql (default)> select * from default.t;
i
111

spark-sql (default)> describe default.t.branch_b1;
s                       string
spark-sql (default)> select * from default.t.branch_b1;
s
foo

Behavior after the above PR: [Please NOTE that a schema change in the main branch - IMPACTED the data and metadata available on the branch - this feels like an undesirable behavior;]

spark-sql (default)> describe default.t;
i                       int
spark-sql (default)> select * from default.t;
i
111

spark-sql (default)> describe default.t.branch_b1;
i                       int
spark-sql (default)> select * from default.t.branch_b1;
i
--no-data--

Unit test to replicate the issue:

@Test
  public void testSchemaChange() throws Exception {
    Assume.assumeFalse("Avro does not support metadata delete", fileFormat.equals("avro"));
    createAndInitUnpartitionedTable();

    sql("INSERT INTO TABLE %s VALUES (1, 'hr'), (2, 'hardware'), (null, 'hr')", tableName);
    createBranchIfNeeded();

    String sql = String.format("SELECT * FROM %s ORDER BY id", selectTarget());
    spark.sql(sql).show();
    /**
     * +----+--------+
     * |  id|     dep|
     * +----+--------+
     * |NULL|      hr|
     * |   1|      hr|
     * |   2|hardware|
     * +----+--------+
     */
    // Metadata Delete
    Table table = Spark3Util.loadIcebergTable(spark, tableName);
    table.refresh();
    table.updateSchema().deleteColumn("dep").commit();
    sql = String.format("SELECT * FROM %s ORDER BY id", selectTarget());
    spark.sql(sql).show();
    /**
     * Data loss in branch, impacted as we consume schema from table schema
     * +----+
     * |  id|
     * +----+
     * |NULL|
     * |   1|
     * |   2|
     * +----+
     */
    sql = String.format("SELECT * FROM %s ORDER BY id", tableName);
    spark.sql(sql).show();
    /**
     * +----+
     * |  id|
     * +----+
     * |NULL|
     * |   1|
     * |   2|
     * +----+
     */
  }

@SreeramGarlapati @jackieo168

cc: @rdblue @nastra

@namrathamyske
Copy link
Contributor Author

namrathamyske commented Feb 19, 2024

When fetching data from a branch, the schema which was associated with branch should be issued not table. But for operations like cherry pick from branch to main, it should resolve conflicts between main and branch and possibly consider main branch table schema and re-concile. Let me know your thoughts @rdblue @nastra

@nastra
Copy link
Contributor

nastra commented Mar 28, 2024

@namrathamyske I've opened #10055 to clarify which schema is being used when.

@danielcweeks
Copy link
Contributor

@namrathamyske you can force reading with the snapshot id on a branch by using the time travel statement.

select * from default.t.branch_b1 for timestamp as of now();

This is use the current snapshot and is equivalent to reading from the head of the branch with the snapshot schema. However, the branches write schema will track with the table schema.

@nastra
Copy link
Contributor

nastra commented Mar 28, 2024

@namrathamyske you can force reading with the snapshot id on a branch by using the time travel statement.

select * from default.t.branch_b1 for timestamp as of now();

This is use the current snapshot and is equivalent to reading from the head of the branch with the snapshot schema. However, the branches write schema will track with the table schema.

I just checked this workaround and it actually returns the latest snapshot id of the main branch rather than the latest snapshot id of b1 because there's no logic in

long snapshotId = SnapshotUtil.snapshotIdAsOfTime(sparkTable.table(), timestampMillis);
return sparkTable.copyWithSnapshotId(snapshotId);
that takes the branch into account. I've opened #10058 to address that

@namrathamyske
Copy link
Contributor Author

Thanks @danielcweeks @nastra !

@danielcweeks
Copy link
Contributor

@namrathamyske it was pointed out to me that workaround may not be working correctly for branches, which is something we might need to address.

@namrathamyske
Copy link
Contributor Author

namrathamyske commented Mar 28, 2024

Looks like we are disabling the workaround from #10059.
More future fix is where we track schemaId from snapshot in SnapshotRef object. For tagBuilder we may not pass schemaId, but for branchBuilder we can pass schemaId.

But there is one more issue with above solution:

snapshot1 created at t1 points to schema1
schema progresses from schema1 - schema2, but snapshot1 still points to schema1
branch1 should be from schema1 or schema2 ?

Do we want to created a no-op snapshot when a branch is created on current snapshot?
The no-op now points to latest schema. branch now can we created on top of this to point to latest schema

Also suggested in #7075 (comment)

@nastra
Copy link
Contributor

nastra commented Mar 30, 2024

Looks like we are disabling the workaround from #10059.

The reason for #10059 is because we don't support time travel on branches themselves, because there's no history tracking on branches available.

The workaround that you can use is documented in #10055, where you can fetch the latest snapshot id of the given branch and then use that snapshot id in the VERSION AS OF stmt: SELECT * FROM db.table VERSION AS OF 8109744798576441359

Copy link

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

@github-actions github-actions bot added the stale label Oct 21, 2024
Copy link

github-actions bot commented Nov 5, 2024

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants