-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BaseDataset and FileSystemDataset classes #76
Add BaseDataset and FileSystemDataset classes #76
Conversation
Thank you @zhenyu for the review so far! Since we have made some changes together, let me get another review when you can (especially for the new factory class and functions)! |
wicker/core/config.py
Outdated
@@ -40,6 +40,7 @@ def from_json(cls, data: Dict[str, Any]) -> BotoS3Config: | |||
|
|||
@dataclasses.dataclass(frozen=True) | |||
class WickerAwsS3Config: | |||
loaded: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it is not introduced by you, but I think the WickerAwsS3Config should be singleton instead of class.
@zhenyu I fixed our new builder function to accept a |
@zhenyu I think I have got it. Now you can have
There is still just the one |
return None | ||
|
||
|
||
def build_dataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 . Minor thing lean to you is from the testable/readable perspective, I would think make 2 private sub function like _build_s3_dataset _build_filesystem_dataset.
Anyway, much better than the original S3Dataset interface. Thanks
@convexquad Thanks so much for this PR. The class design is much better now although we could improve even further, let us do it little by little. For this PR, I am fine with the class design now. Please make sure enough test, the CI pipeline is not so convincing now. |
@zhenyu thanks for all the reviews! I am running test cloud jobs to make sure everything is good with this. I might leave this PR open for a short while. |
@zhenyu could you give me an approving review for this PR? I have completed test training jobs with both FUSE-mounted local I cannot add a link to the training job since this is a public GitHub repository, but I have tested this PR in our ML repository on my branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @convexquad for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this PR @convexquad! 🙇🏻♂️
Running test training jobs is in progress; they will be linked here.
What is the status on this front? 👀
@marccarre thanks for checking, I updated the PR description with the following:
|
Incorporate the changes from the reviews for #57 in this PR:
AbstractDataset
class alone so that it remains a pure interface.BaseDataset
class as a concrete implementation ofAbstractDataset
that is orthogonal to any filesystem or bucket-specific assumptions (it will rely on thestorage
and column file reader instances for this logic).The
S3Dataset
class is updated so that it is a subclass ofBaseDataset
(there is still a little bit of S3-specific logic that is left in this class for backwards compatibility).We create the new
FileSystemDataset
class, but it only exists so that its constructor arguments can be constrained to be compatible with local storage (e.g. thestorage
argument must have the typeFileSystemDataStorage
and the Arrow filesystem has to be aLocalFileSystem
). However, I would be ok to remove this class if the reviewers feel it is unnecessary.Testing:
FileSystemDataset
class and its new behavior to support local storage (together withBaseDataset
).dev/abain/test_wicker_s3
with run ID9eb6efad-6c64-4260-abec-a5d3c0c59779
. The test training job loads manyS3Dataset
instances. The test training job completed successfully.I have been able to make this PR is totally orthogonal from #73 . Although I think #73 has good changes for Wicker, actually this PR is the one we really need to enable reading Wicker datasets from local filesystems.
Note: One thing Zhenyu proposed in his review of Isaak's PR is to change the user experience so that they use a factory to return instances of
AbstractDataset
. We want to leave this new user experience up to the platform team. However, the new factory methods should work well with the changes in this PR!