Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Managed Iceberg] unbounded source #33504

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

ahmedabu98
Copy link
Contributor

@ahmedabu98 ahmedabu98 commented Jan 6, 2025

Unbounded (streaming) source for Managed Iceberg

Fixes #33092

@ahmedabu98 ahmedabu98 marked this pull request as draft January 6, 2025 18:16
@ahmedabu98 ahmedabu98 marked this pull request as ready for review January 30, 2025 21:09
Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@ahmedabu98
Copy link
Contributor Author

R: @kennknowles
R: @regadas

Can y'all take a look? I still have to write some tests, but it's at a good spot for a first round of reviews. I ran a bunch of pipelines (w/Legacy DataflowRunner) at different scales and the throughput/scalability looks good.

Copy link
Contributor

github-actions bot commented Feb 3, 2025

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

Copy link
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think all the pieces are in the right place. Just a question about why an SDF is the way it is and a couple code-level comments.

This seems like something you want to test a lot of different ways before it gets into a release. Maybe get another set of eyes like @chamikaramj or @Abacn too. But I'm approving and leaving to your judgment.

sdks/java/io/iceberg/bqms/build.gradle Outdated Show resolved Hide resolved
.setFromSnapshotExclusive(getFromSnapshotExclusive())
.setToSnapshot(getToSnapshot())
.build();
if (getTriggeringFrequency() != null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too convinced this should be what controls the incremental scan source. I think it might be best if the user very explicitly says they want to read unbounded rows, versus reading the table as a bounded data set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outlined the design here: https://s.apache.org/beam-iceberg-incremental-source

But specifically for this comment, here's what the decision tree looks like:

  • If none of the above options [frequency, to, from] are set, use the existing (old) bounded scan source.
  • If any of these options are set, use the new incremental scan source.
    • If triggering_frequency_seconds is set, use the unbounded implementation.
    • Otherwise, use the bounded implementation.

I think an explicit "streaming=true" parameter has its merits but can be unnecessary, since a triggering frequency is needed anyways to determine the poll interval.

* <p>An SDF that takes a batch of {@link ReadTask}s. For each task, reads Iceberg {@link Record}s,
* and converts to Beam {@link Row}s.
*
* <p>The split granularity is set to the incoming batch size, i.e. the number of potential splits
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below, you simply read the tasks one by one in a loop, so it is rather the same as if each ReadTask were an element. So at the level of Beam's semantics, it doesn't unlock anything. And we do get splitting at that level automatically when reading from shuffle. I cannot recall - will there always be a shuffle upstream of this?

I haven't written an SDF like this, so I can see how it may be necessary to express this way. But are the batches going to be large and meaningful or are they also just arbitrary small sets of read tasks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioning this in https://s.apache.org/beam-iceberg-incremental-source

I cannot recall - will there always be a shuffle upstream of this

Yes, currently there is a GroupIntoBatches step which contains a shuffle.

I found that a Reshuffle/Redistribute into processing individual ReadTasks will almost always lead to OOMs (and stuck pipeline) for a decent read size because the worker will try buffering all files concurrently. We could suggest setting a large number of workers to start with, but I think that's not a great user experience. They would have to experiment a bit to figure that one out.

Grouping into batches is much easier to work with because we're not buffering all files at once. For Dataflow, GiB will also send signals to increase key fanout based on backlog, which helps with dynamic worker autoscaling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the batches going to be large and meaningful or are they also just arbitrary small sets of read tasks?

This depends on the amount of appended data within a given snapshot range. In most cases, we can expect a large set of read tasks in the beginning as the pipeline reads everything in the table

Copy link
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait actually I forgot I want to have the discussion about the high level toggle between incremental scan source and bounded source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request]: {Managed IO Iceberg} - Allow users to run streaming reads
2 participants