-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add compressed file support for ORCRecordReader #9884
feat: add compressed file support for ORCRecordReader #9884
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the test case for this? You can refer ORCRecordReaderTest
. We can probably add the data.orc.gz
to the resource (we already have data.orc
). Also, are we going to cover ParquetAvroRecordReader
from this PR?
@@ -106,6 +110,19 @@ public void init(File dataFile, @Nullable Set<String> fieldsToRead, @Nullable Re | |||
_nextRowId = 0; | |||
} | |||
|
|||
private File unzipIfRequired(File dataFile) throws IOException { | |||
if (dataFile.getName().endsWith(".gz")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the better approach is to add a helper function in RecordReaderUtils
to identify whether the file is gzipped
instead of depending on the extension. I have seen the cases where people deal with gzipped files (but not end with .gz
).
There are multiple approaches: https://stackoverflow.com/questions/30507653/how-to-check-whether-file-is-gzip-or-not-in-java
I think that we can try to open the file with GZIPInputStream
and check for the exception.
hi @snleee, While I was implementing the solution was a bit concerned about the code duplication, but not sure if extracting Regards, Eugene |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you set up Pinot Code Style
and apply it to pass the lint test?
Please refer to the following:
https://docs.pinot.apache.org/developers/developers-and-contributors/code-setup
@@ -106,6 +111,18 @@ public void init(File dataFile, @Nullable Set<String> fieldsToRead, @Nullable Re | |||
_nextRowId = 0; | |||
} | |||
|
|||
private File unpackIfRequired(File dataFile) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this to RecordReaderUtils
class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
I decided to add extension
as a function parameter explicitly, as you've mentioned above the archive will not necessarily have .gz
suffix.
Let's try to run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LTGM otherwise. I triggered the test again. Let's see how this goes :)
@@ -17,9 +17,15 @@ | |||
* specific language governing permissions and limitations | |||
* under the License. | |||
*/ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) remove line
Codecov Report
@@ Coverage Diff @@
## master #9884 +/- ##
============================================
+ Coverage 68.65% 70.40% +1.75%
- Complexity 5049 5060 +11
============================================
Files 1973 1982 +9
Lines 106008 106490 +482
Branches 16060 16140 +80
============================================
+ Hits 72775 74970 +2195
+ Misses 28110 26279 -1831
- Partials 5123 5241 +118
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
This is WIP on #9847
The aim of this change is to support gz compressed files for
ORCRecordReader
andParquetAvroRecordReader
in order to achieve the feature parity with other types like csv/json format.