-
Notifications
You must be signed in to change notification settings - Fork 484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-262: [C++] Support async I/O prefetch #2048
Conversation
|
It is totally decided by users to choose whether to prefetch the whole orc file or single/multiple columns in single stripe or single column in single/multiple stripes. It is better letting user invoke |
@wgtmac That's a great work. We could do more improvements on IO latency hiding after it is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just finished the initial review. Thanks @taiyang-li! Please see my inline comments. My main concern is the usability that it requires user to call preBuffer
instead of automatically prefetching required data.
Co-authored-by: Gang Wu <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes! It generally looks good. I think the test cases need to be polished.
Co-authored-by: Gang Wu <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
7941f29
to
0d56a7e
Compare
Thanks for detailed reviews. I had improved the test cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you @wgtmac @taiyang-li ~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
cc @williamhyun , too
Merged to main for Apache ORC 2.1.0 on January 2025. I added you to the Apache ORC contributor group and assigned ORC-262 to you, @taiyang-li . Also, I updated ORC-1767 JIRA issue by assigning to you. Thank you and welcome to the Apache ORC community again! |
Thank you very much @dongjoon-hyun, I am very happy to join the Apache ORC Contributor Group. Apache Gluten relies heavily on this library, and I'm looking forward to contributing more in the future. |
What changes were proposed in this pull request?
Support async io prefetch for orc c++ lib. Close https://issues.apache.org/jira/browse/ORC-262
Changes:
InputStream::readAsync
(default unimplemented). It reads io asynchronously within the specified range.ReadRangeCache
to cache async io results. This borrows from a similar design of Parquet Reader in https://github.com/apache/arrowReader::preBuffer
to trigger io prefetch. In the specific implementation ofReaderImpl::preBuffer
, the io ranges will be calculated according to the selected stripe and columns, and then these ranges will be merged and sorted, andReadRangeCache::cache
will be called to trigger the asynchronous io in the background, waiting for the use of the upper layerReader::releaseBuffer
, which is used to release all cached io ranges before an offsetWhy are the changes needed?
Async io prefetch could hide io latency during reading orc files, which improves performance of scan operators in ClickHouse.
How was this patch tested?
Pass the CIs.
Was this patch authored or co-authored using generative AI tooling?
No.