Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-262: [C++] Support async I/O prefetch #2048

Closed
wants to merge 48 commits into from

Conversation

taiyang-li
Copy link
Contributor

@taiyang-li taiyang-li commented Oct 9, 2024

What changes were proposed in this pull request?

Support async io prefetch for orc c++ lib. Close https://issues.apache.org/jira/browse/ORC-262

Changes:

  • Added new interface InputStream::readAsync(default unimplemented). It reads io asynchronously within the specified range.
  • Added IO Cache implementation ReadRangeCache to cache async io results. This borrows from a similar design of Parquet Reader in https://github.com/apache/arrow
  • Added interface Reader::preBuffer to trigger io prefetch. In the specific implementation of ReaderImpl::preBuffer, the io ranges will be calculated according to the selected stripe and columns, and then these ranges will be merged and sorted, and ReadRangeCache::cache will be called to trigger the asynchronous io in the background, waiting for the use of the upper layer
  • Added the interface Reader::releaseBuffer, which is used to release all cached io ranges before an offset

Why are the changes needed?

Async io prefetch could hide io latency during reading orc files, which improves performance of scan operators in ClickHouse.

How was this patch tested?

Pass the CIs.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the CPP label Oct 9, 2024
@taiyang-li taiyang-li changed the title Support async io prefetch for orc c++ lib ORC-262: [C++] Support async io prefetch for orc c++ lib Oct 9, 2024
c++/src/io/Cache.hh Outdated Show resolved Hide resolved
c++/src/MemoryPool.cc Outdated Show resolved Hide resolved
c++/src/Reader.cc Show resolved Hide resolved
c++/src/io/Cache.hh Show resolved Hide resolved
c++/src/io/Cache.hh Outdated Show resolved Hide resolved
c++/src/io/Cache.hh Outdated Show resolved Hide resolved
c++/src/io/Cache.hh Outdated Show resolved Hide resolved
@ffacs
Copy link
Contributor

ffacs commented Oct 11, 2024

Reader::preBuffer prefetch stripes as a unit which might be too large. For those users who don't want to prefetch entire file one-shot, they have to know the structure of the file. Do you think it is a good idea to make prefetch transparent to users and let the orc reader prefetch data(eg. 1MB for each column at a time) when it's proper.
What's more, we could make enable async IO a option and expose a cache interface for users so they can implement their eviction policy.

@taiyang-li
Copy link
Contributor Author

taiyang-li commented Oct 11, 2024

Reader::preBuffer prefetch stripes as a unit which might be too large. For those users who don't want to prefetch entire file one-shot, they have to know the structure of the file. Do you think it is a good idea to make prefetch transparent to users and let the orc reader prefetch data(eg. 1MB for each column at a time) when it's proper. What's more, we could make enable async IO a option and expose a cache interface for users so they can implement their eviction policy.

It is totally decided by users to choose whether to prefetch the whole orc file or single/multiple columns in single stripe or single column in single/multiple stripes. Reader::preBuffer already supported all those options.

It is better letting user invoke Reader::preBuffer explicitly because only user knows which stripe/columns to read. Thus they could find the best change to prefetch to hide io latency sufficiently. e.g. the orc prefetch implementation in ClickHouse relying on current PR: ClickHouse/ClickHouse#70534 (speed up 1.47x). Besides, the parquet reader in apache arrow also has similar design.

@taiyang-li
Copy link
Contributor Author

@ffacs @wgtmac any more comments ? Thanks!

@wgtmac
Copy link
Member

wgtmac commented Oct 16, 2024

Sorry that I'm a little bit overwhelmed these days. Will take a look when I get the chance.

BTW, @luffy-zh is implementing exposing RowIndex positions: #2005. Perhaps there is an opportunity to further prefetch io together with predicate pushdown.

@taiyang-li
Copy link
Contributor Author

@wgtmac That's a great work. We could do more improvements on IO latency hiding after it is merged.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just finished the initial review. Thanks @taiyang-li! Please see my inline comments. My main concern is the usability that it requires user to call preBuffer instead of automatically prefetching required data.

c++/include/orc/OrcFile.hh Outdated Show resolved Hide resolved
c++/include/orc/OrcFile.hh Outdated Show resolved Hide resolved
c++/include/orc/OrcFile.hh Outdated Show resolved Hide resolved
c++/include/orc/Reader.hh Outdated Show resolved Hide resolved
c++/src/MemoryPool.cc Outdated Show resolved Hide resolved
c++/src/Reader.hh Outdated Show resolved Hide resolved
c++/src/StripeStream.hh Outdated Show resolved Hide resolved
c++/src/StripeStream.hh Outdated Show resolved Hide resolved
c++/src/io/Cache.hh Outdated Show resolved Hide resolved
c++/include/orc/Reader.hh Outdated Show resolved Hide resolved
c++/src/io/Cache.hh Outdated Show resolved Hide resolved
c++/src/io/Cache.hh Outdated Show resolved Hide resolved
c++/src/io/Cache.hh Outdated Show resolved Hide resolved
c++/src/io/Cache.hh Outdated Show resolved Hide resolved
c++/src/io/Cache.hh Outdated Show resolved Hide resolved
c++/src/io/Cache.hh Show resolved Hide resolved
c++/src/io/Cache.hh Show resolved Hide resolved
c++/src/io/Cache.hh Outdated Show resolved Hide resolved
c++/src/io/Cache.hh Show resolved Hide resolved
c++/src/io/Cache.cc Show resolved Hide resolved
c++/src/io/Cache.cc Outdated Show resolved Hide resolved
c++/src/io/Cache.cc Show resolved Hide resolved
c++/src/io/Cache.cc Show resolved Hide resolved
@taiyang-li taiyang-li marked this pull request as ready for review November 27, 2024 07:11
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! It generally looks good. I think the test cases need to be polished.

c++/include/orc/Reader.hh Outdated Show resolved Hide resolved
c++/include/orc/Reader.hh Outdated Show resolved Hide resolved
c++/src/io/Cache.hh Outdated Show resolved Hide resolved
c++/src/io/Cache.hh Outdated Show resolved Hide resolved
c++/test/TestReader.cc Outdated Show resolved Hide resolved
@taiyang-li
Copy link
Contributor Author

Thanks for the changes! It generally looks good. I think the test cases need to be polished.

Thanks for detailed reviews. I had improved the test cases.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@wgtmac wgtmac changed the title ORC-262: [C++] Support async io prefetch for orc c++ lib ORC-262: [C++] Support async I/O prefetch Nov 29, 2024
Copy link
Contributor

@ffacs ffacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you @wgtmac @taiyang-li ~

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

cc @williamhyun , too

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 2, 2024

Merged to main for Apache ORC 2.1.0 on January 2025.

I added you to the Apache ORC contributor group and assigned ORC-262 to you, @taiyang-li .

Also, I updated ORC-1767 JIRA issue by assigning to you.

Thank you and welcome to the Apache ORC community again!

@taiyang-li
Copy link
Contributor Author

Thank you very much @dongjoon-hyun, I am very happy to join the Apache ORC Contributor Group. Apache Gluten relies heavily on this library, and I'm looking forward to contributing more in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants