Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collect file info like mtime, length while listing files for free #9466

Merged
merged 6 commits into from
Sep 27, 2022

Conversation

klsince
Copy link
Contributor

@klsince klsince commented Sep 26, 2022

This PR extends the PinotFS interface a bit so that one can get file info like mtime and length for free while calling listFiles(), otherwise one has to call length(), lastModificationTime() to get the info with an extra round trip, which can be costly when the input folder is large.

Plan to get quick feedbacks with this PR, and extend the other implementations in following PRs.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great enhancement! We should implement it for all PinotFS

@klsince
Copy link
Contributor Author

klsince commented Sep 26, 2022

... we should implement it for all PinotFS

Included S3 and Local in this PR, and plan to extend the others in next PR. Lemme know if you want to have all of them here. I just wanted to iterate in small steps faster.

@codecov-commenter
Copy link

codecov-commenter commented Sep 27, 2022

Codecov Report

Merging #9466 (599e109) into master (3818397) will increase coverage by 33.67%.
The diff coverage is 90.74%.

@@              Coverage Diff              @@
##             master    #9466       +/-   ##
=============================================
+ Coverage     34.73%   68.40%   +33.67%     
- Complexity      194     4826     +4632     
=============================================
  Files          1910     1911        +1     
  Lines        101850   101895       +45     
  Branches      15452    15453        +1     
=============================================
+ Hits          35379    69703    +34324     
+ Misses        63483    27248    -36235     
- Partials       2988     4944     +1956     
Flag Coverage Δ
integration2 24.75% <0.00%> (+0.02%) ⬆️
unittests1 67.10% <89.65%> (?)
unittests2 15.51% <42.59%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../java/org/apache/pinot/spi/filesystem/PinotFS.java 50.00% <0.00%> (+50.00%) ⬆️
.../org/apache/pinot/spi/filesystem/FileMetadata.java 90.47% <90.47%> (ø)
.../org/apache/pinot/plugin/filesystem/S3PinotFS.java 41.69% <92.00%> (+2.98%) ⬆️
.../org/apache/pinot/spi/filesystem/LocalPinotFS.java 82.25% <100.00%> (+82.25%) ⬆️
...roller/helix/core/relocation/SegmentRelocator.java 13.51% <0.00%> (-21.63%) ⬇️
...mmon/assignment/InstanceAssignmentConfigUtils.java 7.31% <0.00%> (-4.88%) ⬇️
...ller/validation/OfflineSegmentIntervalChecker.java 25.84% <0.00%> (-2.25%) ⬇️
...nMaxValueBasedSelectionOrderByCombineOperator.java 70.07% <0.00%> (-2.19%) ⬇️
...r/validation/RealtimeSegmentValidationManager.java 71.62% <0.00%> (-1.36%) ⬇️
...ces/PinotSegmentUploadDownloadRestletResource.java 48.96% <0.00%> (-0.83%) ⬇️
... and 1026 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@klsince klsince force-pushed the get_file_info_on_listing_s3 branch 2 times, most recently from e22435a to 1a4e822 Compare September 27, 2022 18:58
@klsince klsince force-pushed the get_file_info_on_listing_s3 branch from 1a4e822 to 599e109 Compare September 27, 2022 20:21
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Jackie-Jiang Jackie-Jiang merged commit 9f92392 into apache:master Sep 27, 2022
@klsince klsince deleted the get_file_info_on_listing_s3 branch September 27, 2022 21:42
Comment on lines +442 to +443
// TODO: Looks like S3PinotFS filters out directories, inconsistent with the other implementations.
// Only add files and not directories
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this would break if we enable recursive file ingestion on the ingestion task config?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate more on how it will break?
In S3, there is no folder concept under a bucket (flat structure), and that's probably the reason why folder is not returned

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the question was about recursive listing, which actually works as expected.

The comment may be misleading. But it means that the returned list of Strings are just file paths, w/o subfolders' paths; it doesn't mean subfolders are skipped during recursive listing.

Copy link
Contributor

@walterddr walterddr Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. even though S3 is an object store it does provide the file-system like APIs. Users might not know the differences since they are all PinotFS impl to users and thus it is reasonable to expect same behavior

i think for here, either we dont allow recursive on S3 (explicitly throw exception if that config is given); or we can do the recursive properly (i dont know how or even if that's possible though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recursive listing is implemented in visitFiles() method below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants