-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
collect file info like mtime, length while listing files for free #9466
collect file info like mtime, length while listing files for free #9466
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great enhancement! We should implement it for all PinotFS
pinot-spi/src/main/java/org/apache/pinot/spi/filesystem/FileInfo.java
Outdated
Show resolved
Hide resolved
pinot-spi/src/main/java/org/apache/pinot/spi/filesystem/FileInfo.java
Outdated
Show resolved
Hide resolved
Included S3 and Local in this PR, and plan to extend the others in next PR. Lemme know if you want to have all of them here. I just wanted to iterate in small steps faster. |
Codecov Report
@@ Coverage Diff @@
## master #9466 +/- ##
=============================================
+ Coverage 34.73% 68.40% +33.67%
- Complexity 194 4826 +4632
=============================================
Files 1910 1911 +1
Lines 101850 101895 +45
Branches 15452 15453 +1
=============================================
+ Hits 35379 69703 +34324
+ Misses 63483 27248 -36235
- Partials 2988 4944 +1956
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
e22435a
to
1a4e822
Compare
1a4e822
to
599e109
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
// TODO: Looks like S3PinotFS filters out directories, inconsistent with the other implementations. | ||
// Only add files and not directories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like this would break if we enable recursive file ingestion on the ingestion task config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate more on how it will break?
In S3, there is no folder concept under a bucket (flat structure), and that's probably the reason why folder is not returned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the question was about recursive listing, which actually works as expected.
The comment may be misleading. But it means that the returned list of Strings are just file paths, w/o subfolders' paths; it doesn't mean subfolders are skipped during recursive listing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. even though S3 is an object store it does provide the file-system like APIs. Users might not know the differences since they are all PinotFS impl to users and thus it is reasonable to expect same behavior
i think for here, either we dont allow recursive on S3 (explicitly throw exception if that config is given); or we can do the recursive properly (i dont know how or even if that's possible though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recursive listing is implemented in visitFiles() method below.
This PR extends the PinotFS interface a bit so that one can get file info like mtime and length for free while calling listFiles(), otherwise one has to call length(), lastModificationTime() to get the info with an extra round trip, which can be costly when the input folder is large.
Plan to get quick feedbacks with this PR, and extend the other implementations in following PRs.