-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[native] Add support for ORC reader #23037
base: master
Are you sure you want to change the base?
Conversation
55a8d5b
to
7325337
Compare
Hi @majetideepak @aditi-pandit could you please help review this PR? Thanks! |
@wypb can you add some end-to-end tests? Thanks! |
@wypb : Would be great to use ORC with the QueryRunners (https://github.com/prestodb/presto/blob/master/presto-native-execution/src/test/java/com/facebook/presto/nativeworker/PrestoNativeQueryRunnerUtils.java) in an e2e test. The test should highlight differences of ORC wrt Parquet, demonstrate filter pushdown as well. Using ORC with Hive and as a format with Iceberg is perfect. |
Hi @majetideepak @aditi-pandit I added TPCH tests for ORC, including the Iceberg data source. The TPCDS test for ORC is not added because some types of Velox's ORC reader currently do not implement fast path, which will cause exceptions when reading data.
|
@wypb : Your code looks fine. When I search for ORC in the presto-native-execution directory I also see the following usage. https://github.com/prestodb/presto/blob/master/presto-native-execution/src/test/java/com/facebook/presto/nativeworker/AbstractTestWriter.java#L71 needs a fix as well Please can you check about it. |
0d3570c
to
9615017
Compare
Good catch, thank you @aditi-pandit I've fixed it. |
@aditi-pandit I looked at the code again and found that this should not be removed. |
@@ -73,6 +73,7 @@ private static void createTpcdsCallCenter(QueryRunner queryRunner, Session sessi | |||
if (!queryRunner.tableExists(session, "call_center")) { | |||
switch (storageFormat) { | |||
case "PARQUET": | |||
case "ORC": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per https://orc.apache.org/docs/types.html ORC supports DATE type. The DWRF reader doesn't support DATE as a first-class and so we coerced all those columns to VARCHAR in tests. Do you have a plan for those ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DWRF does not support the DATE type, but Velox queries ORC's DATE type using SelectiveIntegerDirectColumnReader. My test shows that the DATE type data can be read correctly. So I don't think it is necessary to convert the DATE type to VARCHAR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recently added a parameter to createAllTables to not do the DATE -> VARCHAR casting for TPCH tables https://github.com/prestodb/presto/blob/master/presto-native-execution/src/test/java/com/facebook/presto/nativeworker/NativeQueryRunnerUtils.java#L69. You can use it in your tests.
@wypb : Had a question about this point you raised... You are saying that HiveQueryRunner can't read TPC-DS tables, but handles. That seems odd. Did you look deeper into what TPC-H is doing different ? The main difference is that in TPC-H all date columns were exposed as VARCHAR. But wonder if there is anything else ? Would be great to see which particular column here is problematic. |
@wypb is this PR still being worked on? |
Hi @tdcmeehan sorry for the late reply. Yes, I'm still keeping an eye on this. I've been working on a few Velox PRs lately, so I haven't had time to work on this yet. I'll update this PR later this week. |
Let's add ORC as a supported file format in Supported Use Cases (we can also mention that Parquet is a supported format). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wypb. Have a question:
As per https://orc.apache.org/docs/types.html ORC supports DATE type. The DWRF reader doesn't support DATE as a first-class and so we coerced all those columns to VARCHAR in tests. Do you have a plan for those ?
@Override | ||
protected ExpectedQueryRunner createExpectedQueryRunner() throws Exception | ||
{ | ||
this.storageFormat = "ORC"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a member variable and the same value used in all methods, you can initialize it at the class level outside the methods and use this.storageFormat each place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already refactored, thank you.
Should this be mentioned in the doc, maybe in Supported Use Cases or Presto C++ Features? |
Hi @tdcmeehan, @aditi-pandit sorry for the late reply.
I was also curious about this question before, but I didn't check the reason. Today I checked why most of the TPCDS queries failed, while all the TPCH queries passed. I debugged the code and found that the integer fields of the TPCDS table (such as the For related code, see
For the TPCH table,
If we modify the implementation of
|
DWRF does not support the DATE type, but Velox queries ORC's DATE type using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (docs)
Pull updated branch, new local docs build, looks good. Thanks!
@aditi-pandit I added the orc tpcds test. The fast path problem I encountered before has been solved in facebookincubator/velox#10939. However, I have encountered a new problem now. When reading ORC Decimal with a decimal filter, the
|
@aditi-pandit I wanted to solve the error reported above today, and then I found that facebookincubator/velox#11067 has already solved it. If I want to add all the tests of TPCDS, I have to wait until that PR is solved. Or should we not add the tpcds test first? Do you have any suggestions? |
@wypb : Yes, facebookincubator/velox#11067 should fix your issue. Lets add the test to this PR so that its complete. Do we know how many queries fail ? We can disable them until the Velox PR is checked in which should be quite soon. |
@aditi-pandit 12 out of 102 SQL statements failed due to the above problems. I added the other SQL statements that could run successfully to the test. |
|
0d7b926
to
18c6209
Compare
@wypb : Thanks for iterating so quickly. Your PR is looking good. Please can you merge your commits into a single one. We can then submit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ORC Reader support and TPCDS iceberg query updates should belong to separate commits/PRs.
|
The motivation to keep TPCDS test updates minimal is that if due to any reason we need to revert a test, the ORC Reader support will not be impacted. |
Also, Native tables seem to be using the same test for Parquet and ORC. Why not do the same for Iceberg? |
@majetideepak : Didn't follow the question entirely... Can you give more specifics ? |
@majetideepak : Am a bit conflicted on this. The ORC Reader commit should submit the functional tests that establish the feature I feel. This work takes care of that. If there are specific tests that show issues (like the current Decimal one), then we should disable and fix them individually. |
Hi @majetideepak deepak, Do you mean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wypb, @aditi-pandit I now see that doDeletes got refactored and is not new.
Makes sense to cover all the existing testing for ORC.
I just have one comment.
testTpcdsQ18(); | ||
testTpcdsQ19(); | ||
testTpcdsQ20(); | ||
// testTpcdsQ21(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to comment here given we have a check inside AbstractTestNativeTpcdsQueries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Nice catch. Yeah these can be avoided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wypb : Seems like there was a misunderstanding about this review comment.
Since testTpcdsQ21 has the following condition
if (!storageFormat.equals("ORC")) {
assertQuery(session, getTpcdsQuery("33"));
}
then it should pass in runAllQueries. We don't need to comment it in this function. We can uncomment the test call at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I will refactor the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@majetideepak @aditi-pandit I have moved runAllQueries()
from TestPrestoNativeIcebergTpcdsQueriesOrcUsingThrift.java
and TestPrestoNativeIcebergTpcdsQueriesParquetUsingThrift.java
to AbstractTestNativeTpcdsQueries.java
.
684cdfc
to
99f7c7b
Compare
@wypb : Thanks for your quick turnaround. Seems like there was a misunderstanding about Deepak's comment. Please fix it. Else this PR is looking good for approval. |
Description
We have recently merged the PR for reading ORC statistics and implementing OrcReader based on DwrfReader on the velox side. Now it is time to add support for ORC reader it in Prestissimo.