branch schema affected by main table schema #9737

namrathamyske · 2024-02-16T17:25:13Z

Apache Iceberg version

main (development)

Query engine

None

Please describe the bug 🐞

regarding this PR: #9131 - the change reads as: Schema for a branch should return table schema
Shouldn't the Schema of a branch be the same as when the branch was created - as opposed to the above change - ie., to move it to a future state of schema change on the table? isn't the concept of branching to create a baseline based on the state of data and metadata of the table - as to - when it was branched? can you pl. help me understand the rationale behind this change?

Please consider this example:

-- create a table with a single column and insert a value
spark-sql (default)> create table t (s string);
spark-sql (default)> insert into t values ('foo');
-- create a branch, the schema is the same as the original table
spark-sql (default)> alter table t create branch b1;

Describe and Query the table & branch:

spark-sql (default)> describe default.t;
s                       string
spark-sql (default)> select * from default.t;
s
foo

spark-sql (default)> describe default.t.branch_b1;
s                       string
spark-sql (default)> select * from default.t.branch_b1;
s
foo

Alter the table - using the below statement to diverge the definition of the table:

spark-sql (default)> alter table t add column i int;
spark-sql (default)> alter table t del column s;

spark-sql (default)> insert into t values (111);

Behavior before the above PR: [Please NOTE that the changes in the main branch - DID NOT IMPACT the data and metadata on the branch - which lookslike is the desirable behavior for any branching concept]

spark-sql (default)> describe default.t;
i                       int
spark-sql (default)> select * from default.t;
i
111

spark-sql (default)> describe default.t.branch_b1;
s                       string
spark-sql (default)> select * from default.t.branch_b1;
s
foo

Behavior after the above PR: [Please NOTE that a schema change in the main branch - IMPACTED the data and metadata available on the branch - this feels like an undesirable behavior;]

spark-sql (default)> describe default.t;
i                       int
spark-sql (default)> select * from default.t;
i
111

spark-sql (default)> describe default.t.branch_b1;
i                       int
spark-sql (default)> select * from default.t.branch_b1;
i
--no-data--

Unit test to replicate the issue:

@Test
  public void testSchemaChange() throws Exception {
    Assume.assumeFalse("Avro does not support metadata delete", fileFormat.equals("avro"));
    createAndInitUnpartitionedTable();

    sql("INSERT INTO TABLE %s VALUES (1, 'hr'), (2, 'hardware'), (null, 'hr')", tableName);
    createBranchIfNeeded();

    String sql = String.format("SELECT * FROM %s ORDER BY id", selectTarget());
    spark.sql(sql).show();
    /**
     * +----+--------+
     * |  id|     dep|
     * +----+--------+
     * |NULL|      hr|
     * |   1|      hr|
     * |   2|hardware|
     * +----+--------+
     */
    // Metadata Delete
    Table table = Spark3Util.loadIcebergTable(spark, tableName);
    table.refresh();
    table.updateSchema().deleteColumn("dep").commit();
    sql = String.format("SELECT * FROM %s ORDER BY id", selectTarget());
    spark.sql(sql).show();
    /**
     * Data loss in branch, impacted as we consume schema from table schema
     * +----+
     * |  id|
     * +----+
     * |NULL|
     * |   1|
     * |   2|
     * +----+
     */
    sql = String.format("SELECT * FROM %s ORDER BY id", tableName);
    spark.sql(sql).show();
    /**
     * +----+
     * |  id|
     * +----+
     * |NULL|
     * |   1|
     * |   2|
     * +----+
     */
  }

@SreeramGarlapati @jackieo168

cc: @rdblue @nastra

The text was updated successfully, but these errors were encountered:

namrathamyske · 2024-02-19T21:28:12Z

When fetching data from a branch, the schema which was associated with branch should be issued not table. But for operations like cherry pick from branch to main, it should resolve conflicts between main and branch and possibly consider main branch table schema and re-concile. Let me know your thoughts @rdblue @nastra

nastra · 2024-03-28T10:21:56Z

@namrathamyske I've opened #10055 to clarify which schema is being used when.

danielcweeks · 2024-03-28T15:40:15Z

@namrathamyske you can force reading with the snapshot id on a branch by using the time travel statement.

select * from default.t.branch_b1 for timestamp as of now();

This is use the current snapshot and is equivalent to reading from the head of the branch with the snapshot schema. However, the branches write schema will track with the table schema.

nastra · 2024-03-28T16:45:21Z

@namrathamyske you can force reading with the snapshot id on a branch by using the time travel statement.
select * from default.t.branch_b1 for timestamp as of now();
This is use the current snapshot and is equivalent to reading from the head of the branch with the snapshot schema. However, the branches write schema will track with the table schema.

I just checked this workaround and it actually returns the latest snapshot id of the main branch rather than the latest snapshot id of b1 because there's no logic in

iceberg/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java

Lines 224 to 225 in 2a39af8

    
           long snapshotId = SnapshotUtil.snapshotIdAsOfTime(sparkTable.table(), timestampMillis); 
        
           return sparkTable.copyWithSnapshotId(snapshotId);

that takes the branch into account. I've opened #10058 to address that

namrathamyske · 2024-03-28T19:56:20Z

Thanks @danielcweeks @nastra !

danielcweeks · 2024-03-28T20:00:40Z

@namrathamyske it was pointed out to me that workaround may not be working correctly for branches, which is something we might need to address.

namrathamyske · 2024-03-28T20:26:13Z

Looks like we are disabling the workaround from #10059.
More future fix is where we track schemaId from snapshot in SnapshotRef object. For tagBuilder we may not pass schemaId, but for branchBuilder we can pass schemaId.

But there is one more issue with above solution:

snapshot1 created at t1 points to schema1
schema progresses from schema1 - schema2, but snapshot1 still points to schema1
branch1 should be from schema1 or schema2 ?

Do we want to created a no-op snapshot when a branch is created on current snapshot?
The no-op now points to latest schema. branch now can we created on top of this to point to latest schema

Also suggested in #7075 (comment)

nastra · 2024-03-30T08:12:51Z

Looks like we are disabling the workaround from #10059.

The reason for #10059 is because we don't support time travel on branches themselves, because there's no history tracking on branches available.

The workaround that you can use is documented in #10055, where you can fetch the latest snapshot id of the given branch and then use that snapshot id in the VERSION AS OF stmt: SELECT * FROM db.table VERSION AS OF 8109744798576441359

github-actions · 2024-10-21T00:15:50Z

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions · 2024-11-05T00:14:40Z

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions bot added the stale label Oct 21, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

branch schema affected by main table schema #9737

branch schema affected by main table schema #9737

namrathamyske commented Feb 16, 2024 •

edited

Loading

namrathamyske commented Feb 19, 2024 •

edited

Loading

nastra commented Mar 28, 2024

danielcweeks commented Mar 28, 2024

nastra commented Mar 28, 2024

namrathamyske commented Mar 28, 2024

danielcweeks commented Mar 28, 2024

namrathamyske commented Mar 28, 2024 •

edited

Loading

nastra commented Mar 30, 2024

github-actions bot commented Oct 21, 2024

github-actions bot commented Nov 5, 2024

branch schema affected by main table schema #9737

branch schema affected by main table schema #9737

Comments

namrathamyske commented Feb 16, 2024 • edited Loading

Apache Iceberg version

Query engine

Please describe the bug 🐞

namrathamyske commented Feb 19, 2024 • edited Loading

nastra commented Mar 28, 2024

danielcweeks commented Mar 28, 2024

nastra commented Mar 28, 2024

namrathamyske commented Mar 28, 2024

danielcweeks commented Mar 28, 2024

namrathamyske commented Mar 28, 2024 • edited Loading

nastra commented Mar 30, 2024

github-actions bot commented Oct 21, 2024

github-actions bot commented Nov 5, 2024

namrathamyske commented Feb 16, 2024 •

edited

Loading

namrathamyske commented Feb 19, 2024 •

edited

Loading

namrathamyske commented Mar 28, 2024 •

edited

Loading