-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Stream Executor with Emit on Window Close Semantics #51
Conversation
Co-authored-by: Eric Fu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latter half of this document does not conform to Markdown format. (It requires 2 line breaks to open a new paragraph)
|
||
### GroupAgg | ||
|
||
If there is watermark in group key, the EMIT ON WINDOW CLOSE GroupAgg calculate the aggregation with the same logic with streamGroupAgg, But it only emit the complete part of the agg result. A simple idea is add a sortExecutor after a normal streamGroupAgg and we can find that the streamGroupAgg's result table and sortExecutor's stateTable exactly have the same schema and meaning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be unacceptable to me to use 2 caches in memory, which effectively means the memory is cut into 1/2, as well as the 2x IO cost of cache miss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the SortBufferCache can be omitted here. It doesn't worth to maintain a full row copy in memory just for emitting when a Window
event is received.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, what you say makes sense here, because every record will just be read one time and then will be deleted soon. The cache is meaningless here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
It likes window join in Flink. | ||
When there are watermarks in the first join key of the both two sides. We can use two SortBuffers(table and cache) to implement a sortMergeJoin. SortBuffer’s `peek` method is used here. | ||
|
||
### IntervalJoin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #32 (comment)
In short, Flink puts join key before time column, rather than 2 copies of data. I think we may start with it first so that hopefully it can reuse the source code of the normal version of interval join operator.
With emit on close semantics, it seems we can implement event time temporal join as well, because this semantics can ensure all the data we need are complete. |
It likes window join in Flink. | ||
When there are watermarks in the first join key of the both two sides. We can use two SortBuffers(table and cache) to implement a sortMergeJoin. SortBuffer’s `peek` method is used here. | ||
|
||
### IntervalJoin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #32 (comment)
In short, Flink puts join key before time column, rather than 2 copies of data. I think we may start with it first so that hopefully it can reuse the source code of the normal version of interval join operator.
No description provided.