Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BaseDataset and FileSystemDataset classes #76

Merged

Conversation

convexquad
Copy link
Collaborator

@convexquad convexquad commented Oct 10, 2024

Incorporate the changes from the reviews for #57 in this PR:

  • We leave the AbstractDataset class alone so that it remains a pure interface.
  • As suggested in the previous review, we create the BaseDataset class as a concrete implementation of AbstractDataset that is orthogonal to any filesystem or bucket-specific assumptions (it will rely on the storage and column file reader instances for this logic).

The S3Dataset class is updated so that it is a subclass of BaseDataset (there is still a little bit of S3-specific logic that is left in this class for backwards compatibility).

We create the new FileSystemDataset class, but it only exists so that its constructor arguments can be constrained to be compatible with local storage (e.g. the storage argument must have the type FileSystemDataStorage and the Arrow filesystem has to be a LocalFileSystem). However, I would be ok to remove this class if the reviewers feel it is unnecessary.

Testing:

  • Unit tests added to cover the new FileSystemDataset class and its new behavior to support local storage (together with BaseDataset).
  • Test training jobs: I cannot add a link to the training job since this is a public GitHub repository, but I have tested this PR in our ML repository on my branch dev/abain/test_wicker_s3 with run ID 9eb6efad-6c64-4260-abec-a5d3c0c59779. The test training job loads many S3Dataset instances. The test training job completed successfully.

I have been able to make this PR is totally orthogonal from #73 . Although I think #73 has good changes for Wicker, actually this PR is the one we really need to enable reading Wicker datasets from local filesystems.

Note: One thing Zhenyu proposed in his review of Isaak's PR is to change the user experience so that they use a factory to return instances of AbstractDataset. We want to leave this new user experience up to the platform team. However, the new factory methods should work well with the changes in this PR!

@convexquad convexquad added the enhancement New feature or request label Oct 10, 2024
@convexquad convexquad self-assigned this Oct 10, 2024
Alex Bain (Woven by Toyota added 2 commits October 10, 2024 11:16
@convexquad
Copy link
Collaborator Author

Thank you @zhenyu for the review so far! Since we have made some changes together, let me get another review when you can (especially for the new factory class and functions)!

@@ -40,6 +40,7 @@ def from_json(cls, data: Dict[str, Any]) -> BotoS3Config:

@dataclasses.dataclass(frozen=True)
class WickerAwsS3Config:
loaded: bool = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it is not introduced by you, but I think the WickerAwsS3Config should be singleton instead of class.

@convexquad
Copy link
Collaborator Author

@zhenyu I fixed our new builder function to accept a dataset_config parameter to enable the user to easily specify the dataset type (and I updated the wickerconfig.json file to support FileSystemDataset configuration). Everything looks good. But now as you can see from my last comment I am wondering if the wickerconfig.json file should support multiple WickerAwsS3Config or WickerFileSystemConfig named entries, do you think this is a usecase or I am over-thinking it.

@convexquad
Copy link
Collaborator Author

convexquad commented Oct 22, 2024

@zhenyu I think I have got it. Now you can have wickerconfig.json like this:

{
  "aws_s3_config": {
    "s3_datasets_path": "s3://fake_data/",
    "region": "us-west-2",
    "boto_config": {
      "max_pool_connections":10,
      "read_timeout_s": 140,
      "connect_timeout_s": 140
    }
  },
  "filesystem_configs": [
    {
      "config_name": "filesystem_1",
      "root_datasets_path": "/mnt/bucket_1/"
    },
    {
      "config_name": "filesystem_2",
      "root_datasets_path": "/mnt/bucket_2/"
    }
  ],
  ... (other stuff)
}

There is still just the one aws_s3_config, but now you can have multiple filesystem configs for mounting multiple FileSystemDataset volumes into the training job. In the config, you have to give them a configuration_name property so that in the builder function you can know for which one you want to return a FileSystemDataset object.

return None


def build_dataset(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 . Minor thing lean to you is from the testable/readable perspective, I would think make 2 private sub function like _build_s3_dataset _build_filesystem_dataset.
Anyway, much better than the original S3Dataset interface. Thanks

@zhenyu
Copy link
Collaborator

zhenyu commented Oct 22, 2024

@convexquad Thanks so much for this PR. The class design is much better now although we could improve even further, let us do it little by little. For this PR, I am fine with the class design now. Please make sure enough test, the CI pipeline is not so convincing now.
Ping me for an approval, once you think the test enough. Thanks again

@convexquad
Copy link
Collaborator Author

@zhenyu thanks for all the reviews! I am running test cloud jobs to make sure everything is good with this. I might leave this PR open for a short while.

@convexquad
Copy link
Collaborator Author

convexquad commented Oct 29, 2024

@zhenyu could you give me an approving review for this PR? I have completed test training jobs with both FUSE-mounted local FileSystemDataset instances as well as S3Dataset instances from this branch.

I cannot add a link to the training job since this is a public GitHub repository, but I have tested this PR in our ML repository on my branch dev/abain/test_wicker_s3 with run ID 9eb6efad-6c64-4260-abec-a5d3c0c59779. The test training job loads many S3Dataset instances. The test training job completed successfully.

Copy link
Collaborator

@zhenyu zhenyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @convexquad for this PR

Copy link
Collaborator

@marccarre marccarre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this PR @convexquad! 🙇🏻‍♂️

Running test training jobs is in progress; they will be linked here.

What is the status on this front? 👀

@convexquad
Copy link
Collaborator Author

Thanks a lot for this PR @convexquad! 🙇🏻‍♂️

Running test training jobs is in progress; they will be linked here.

What is the status on this front? 👀

@marccarre thanks for checking, I updated the PR description with the following:

I cannot add a link to the training job since this is a public GitHub repository, but I have tested this PR in our ML repository on my branch dev/abain/test_wicker_s3 with run ID 9eb6efad-6c64-4260-abec-a5d3c0c59779. The test training job loads many S3Dataset instances. The test training job completed successfully.

@convexquad convexquad merged commit babcf9a into woven-planet:main Dec 11, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants