[Managed Iceberg] unbounded source #33504

ahmedabu98 · 2025-01-06T18:16:10Z

Unbounded (streaming) source for Managed Iceberg

…erg_streaming_source

github-actions · 2025-01-30T22:06:51Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

ahmedabu98 · 2025-02-03T15:51:02Z

R: @kennknowles
R: @regadas

Can y'all take a look? I still have to write some tests, but it's at a good spot for a first round of reviews. I ran a bunch of pipelines (w/Legacy DataflowRunner) at different scales and the throughput/scalability looks good.

github-actions · 2025-02-03T15:52:21Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

…erg_streaming_source

kennknowles

Overall, I think all the pieces are in the right place. Just a question about why an SDF is the way it is and a couple code-level comments.

This seems like something you want to test a lot of different ways before it gets into a release. Maybe get another set of eyes like @chamikaramj or @Abacn too. But I'm approving and leaving to your judgment.

sdks/java/io/iceberg/bqms/build.gradle

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/SnapshotRange.java

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/ReadTask.java

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/ReadTaskDescriptor.java

kennknowles · 2025-02-05T00:50:41Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java

+              .setFromSnapshotExclusive(getFromSnapshotExclusive())
+              .setToSnapshot(getToSnapshot())
+              .build();
+      if (getTriggeringFrequency() != null


I'm not too convinced this should be what controls the incremental scan source. I think it might be best if the user very explicitly says they want to read unbounded rows, versus reading the table as a bounded data set.

Outlined the design here: https://s.apache.org/beam-iceberg-incremental-source

But specifically for this comment, here's what the decision tree looks like:

If none of the above options [frequency, to, from] are set, use the existing (old) bounded scan source.

If any of these options are set, use the new incremental scan source.

If triggering_frequency_seconds is set, use the unbounded implementation.

Otherwise, use the bounded implementation.

I think an explicit "streaming=true" parameter has its merits but can be unnecessary, since a triggering frequency is needed anyways to determine the poll interval.

kennknowles · 2025-02-05T00:54:21Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/ReadFromGroupedTasks.java

+ * <p>An SDF that takes a batch of {@link ReadTask}s. For each task, reads Iceberg {@link Record}s,
+ * and converts to Beam {@link Row}s.
+ *
+ * <p>The split granularity is set to the incoming batch size, i.e. the number of potential splits


Below, you simply read the tasks one by one in a loop, so it is rather the same as if each ReadTask were an element. So at the level of Beam's semantics, it doesn't unlock anything. And we do get splitting at that level automatically when reading from shuffle. I cannot recall - will there always be a shuffle upstream of this?

I haven't written an SDF like this, so I can see how it may be necessary to express this way. But are the batches going to be large and meaningful or are they also just arbitrary small sets of read tasks?

Mentioning this in https://s.apache.org/beam-iceberg-incremental-source

I cannot recall - will there always be a shuffle upstream of this

Yes, currently there is a GroupIntoBatches step which contains a shuffle.

I found that a Reshuffle/Redistribute into processing individual ReadTasks will almost always lead to OOMs (and stuck pipeline) for a decent read size because the worker will try buffering all files concurrently. We could suggest setting a large number of workers to start with, but I think that's not a great user experience. They would have to experiment a bit to figure that one out.

Grouping into batches is much easier to work with because we're not buffering all files at once. For Dataflow, GiB will also send signals to increase key fanout based on backlog, which helps with dynamic worker autoscaling.

are the batches going to be large and meaningful or are they also just arbitrary small sets of read tasks?

This depends on the amount of appended data within a given snapshot range. In most cases, we can expect a large set of read tasks in the beginning as the pipeline reads everything in the table

kennknowles

Wait actually I forgot I want to have the discussion about the high level toggle between incremental scan source and bounded source.

…erg_streaming_source

initial

bb87511

github-actions bot added java io labels Jan 6, 2025

ahmedabu98 marked this pull request as draft January 6, 2025 18:16

ahmedabu98 added 7 commits January 7, 2025 15:50

let CombinedScanTask do splitting (based on Parquet row groups)

853de4d

perf improv

69fd988

create one read task descriptor per snapshot range

da2f33f

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

73c8992

…erg_streaming_source

some improvements

81ca709

use GiB for streaming, Redistribute for batch; update docs

e319d76

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

c25cd75

…erg_streaming_source

ahmedabu98 marked this pull request as ready for review January 30, 2025 21:09

use static value

af1ec85

ahmedabu98 added 3 commits February 3, 2025 16:27

add some test

f5d3268

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

df40239

…erg_streaming_source

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

43ab88f

…erg_streaming_source

kennknowles approved these changes Feb 5, 2025

View reviewed changes

kennknowles requested changes Feb 5, 2025

View reviewed changes

ahmedabu98 added 3 commits February 7, 2025 08:32

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

20db0ee

…erg_streaming_source

add a java doc; don't use static block to create coder

622625f

spotless

4c25d3f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Managed Iceberg] unbounded source #33504

[Managed Iceberg] unbounded source #33504

ahmedabu98 commented Jan 6, 2025 •

edited

Loading

github-actions bot commented Jan 30, 2025

ahmedabu98 commented Feb 3, 2025

github-actions bot commented Feb 3, 2025

kennknowles left a comment

kennknowles Feb 5, 2025

ahmedabu98 Feb 10, 2025

kennknowles Feb 5, 2025

ahmedabu98 Feb 10, 2025

ahmedabu98 Feb 10, 2025

kennknowles left a comment

[Managed Iceberg] unbounded source #33504

Are you sure you want to change the base?

[Managed Iceberg] unbounded source #33504

Conversation

ahmedabu98 commented Jan 6, 2025 • edited Loading

github-actions bot commented Jan 30, 2025

ahmedabu98 commented Feb 3, 2025

github-actions bot commented Feb 3, 2025

kennknowles left a comment

Choose a reason for hiding this comment

kennknowles Feb 5, 2025

Choose a reason for hiding this comment

ahmedabu98 Feb 10, 2025

Choose a reason for hiding this comment

kennknowles Feb 5, 2025

Choose a reason for hiding this comment

ahmedabu98 Feb 10, 2025

Choose a reason for hiding this comment

ahmedabu98 Feb 10, 2025

Choose a reason for hiding this comment

kennknowles left a comment

Choose a reason for hiding this comment

ahmedabu98 commented Jan 6, 2025 •

edited

Loading