Add BaseDataset and FileSystemDataset classes #76

convexquad · 2024-10-10T17:08:22Z

Incorporate the changes from the reviews for #57 in this PR:

We leave the AbstractDataset class alone so that it remains a pure interface.
As suggested in the previous review, we create the BaseDataset class as a concrete implementation of AbstractDataset that is orthogonal to any filesystem or bucket-specific assumptions (it will rely on the storage and column file reader instances for this logic).

The S3Dataset class is updated so that it is a subclass of BaseDataset (there is still a little bit of S3-specific logic that is left in this class for backwards compatibility).

We create the new FileSystemDataset class, but it only exists so that its constructor arguments can be constrained to be compatible with local storage (e.g. the storage argument must have the type FileSystemDataStorage and the Arrow filesystem has to be a LocalFileSystem). However, I would be ok to remove this class if the reviewers feel it is unnecessary.

Testing:

Unit tests added to cover the new FileSystemDataset class and its new behavior to support local storage (together with BaseDataset).
Test training jobs: I cannot add a link to the training job since this is a public GitHub repository, but I have tested this PR in our ML repository on my branch dev/abain/test_wicker_s3 with run ID 9eb6efad-6c64-4260-abec-a5d3c0c59779. The test training job loads many S3Dataset instances. The test training job completed successfully.

I have been able to make this PR is totally orthogonal from #73 . Although I think #73 has good changes for Wicker, actually this PR is the one we really need to enable reading Wicker datasets from local filesystems.

Note: One thing Zhenyu proposed in his review of Isaak's PR is to change the user experience so that they use a factory to return instances of AbstractDataset. We want to leave this new user experience up to the platform team. However, the new factory methods should work well with the changes in this PR!

tests/test_storage.py

wicker/core/datasets.py

wicker/core/storage.py

wicker/core/datasets.py

convexquad · 2024-10-21T19:38:53Z

Thank you @zhenyu for the review so far! Since we have made some changes together, let me get another review when you can (especially for the new factory class and functions)!

wicker/core/datasets.py

zhenyu · 2024-10-22T04:43:13Z

wicker/core/config.py

@@ -40,6 +40,7 @@ def from_json(cls, data: Dict[str, Any]) -> BotoS3Config:

 @dataclasses.dataclass(frozen=True)
 class WickerAwsS3Config:
+    loaded: bool = False


I know it is not introduced by you, but I think the WickerAwsS3Config should be singleton instead of class.

wicker/core/datasets.py

…e tests

convexquad · 2024-10-22T18:04:13Z

@zhenyu I fixed our new builder function to accept a dataset_config parameter to enable the user to easily specify the dataset type (and I updated the wickerconfig.json file to support FileSystemDataset configuration). Everything looks good. But now as you can see from my last comment I am wondering if the wickerconfig.json file should support multiple WickerAwsS3Config or WickerFileSystemConfig named entries, do you think this is a usecase or I am over-thinking it.

convexquad · 2024-10-22T22:24:49Z

@zhenyu I think I have got it. Now you can have wickerconfig.json like this:

{
  "aws_s3_config": {
    "s3_datasets_path": "s3://fake_data/",
    "region": "us-west-2",
    "boto_config": {
      "max_pool_connections":10,
      "read_timeout_s": 140,
      "connect_timeout_s": 140
    }
  },
  "filesystem_configs": [
    {
      "config_name": "filesystem_1",
      "root_datasets_path": "/mnt/bucket_1/"
    },
    {
      "config_name": "filesystem_2",
      "root_datasets_path": "/mnt/bucket_2/"
    }
  ],
  ... (other stuff)
}

There is still just the one aws_s3_config, but now you can have multiple filesystem configs for mounting multiple FileSystemDataset volumes into the training job. In the config, you have to give them a configuration_name property so that in the builder function you can know for which one you want to return a FileSystemDataset object.

zhenyu · 2024-10-22T22:41:42Z

wicker/core/datasets.py

+    return None
+
+
+def build_dataset(


👍 . Minor thing lean to you is from the testable/readable perspective, I would think make 2 private sub function like _build_s3_dataset _build_filesystem_dataset.
Anyway, much better than the original S3Dataset interface. Thanks

zhenyu · 2024-10-22T22:45:15Z

@convexquad Thanks so much for this PR. The class design is much better now although we could improve even further, let us do it little by little. For this PR, I am fine with the class design now. Please make sure enough test, the CI pipeline is not so convincing now.
Ping me for an approval, once you think the test enough. Thanks again

convexquad · 2024-10-26T01:09:02Z

@zhenyu thanks for all the reviews! I am running test cloud jobs to make sure everything is good with this. I might leave this PR open for a short while.

convexquad · 2024-10-29T21:48:16Z

@zhenyu could you give me an approving review for this PR? I have completed test training jobs with both FUSE-mounted local FileSystemDataset instances as well as S3Dataset instances from this branch.

I cannot add a link to the training job since this is a public GitHub repository, but I have tested this PR in our ML repository on my branch dev/abain/test_wicker_s3 with run ID 9eb6efad-6c64-4260-abec-a5d3c0c59779. The test training job loads many S3Dataset instances. The test training job completed successfully.

zhenyu

Thanks @convexquad for this PR

marccarre

Thanks a lot for this PR @convexquad! 🙇🏻‍♂️

Running test training jobs is in progress; they will be linked here.

What is the status on this front? 👀

convexquad · 2024-12-11T02:33:04Z

Thanks a lot for this PR @convexquad! 🙇🏻‍♂️

Running test training jobs is in progress; they will be linked here.

What is the status on this front? 👀

@marccarre thanks for checking, I updated the PR description with the following:

I cannot add a link to the training job since this is a public GitHub repository, but I have tested this PR in our ML repository on my branch dev/abain/test_wicker_s3 with run ID 9eb6efad-6c64-4260-abec-a5d3c0c59779. The test training job loads many S3Dataset instances. The test training job completed successfully.

Add BaseDataset and FileSystemDataset classes

e922f19

convexquad added the enhancement New feature or request label Oct 10, 2024

convexquad self-assigned this Oct 10, 2024

convexquad commented Oct 10, 2024

View reviewed changes

tests/test_storage.py Show resolved Hide resolved

convexquad commented Oct 10, 2024

View reviewed changes

wicker/core/datasets.py Show resolved Hide resolved

convexquad commented Oct 10, 2024

View reviewed changes

wicker/core/datasets.py Show resolved Hide resolved

convexquad commented Oct 10, 2024

View reviewed changes

wicker/core/datasets.py Outdated Show resolved Hide resolved

convexquad commented Oct 10, 2024

View reviewed changes

wicker/core/datasets.py Show resolved Hide resolved

convexquad commented Oct 10, 2024

View reviewed changes

wicker/core/datasets.py Show resolved Hide resolved

convexquad commented Oct 10, 2024

View reviewed changes

wicker/core/storage.py Show resolved Hide resolved

Fix tests

82cf3f2

convexquad marked this pull request as ready for review October 10, 2024 18:08

convexquad requested review from aalavian, anantsimran, chrisochoatri, marccarre and pickles-bread-and-butter as code owners October 10, 2024 18:08

Alex Bain (Woven by Toyota added 2 commits October 10, 2024 11:16

Fix tests

53b2a35

Type fix

88f32e4

zhenyu reviewed Oct 11, 2024

View reviewed changes

wicker/core/datasets.py Show resolved Hide resolved

zhenyu reviewed Oct 11, 2024

View reviewed changes

wicker/core/datasets.py Show resolved Hide resolved

zhenyu reviewed Oct 11, 2024

View reviewed changes

wicker/core/datasets.py Show resolved Hide resolved

zhenyu reviewed Oct 11, 2024

View reviewed changes

wicker/core/datasets.py Outdated Show resolved Hide resolved

zhenyu reviewed Oct 11, 2024

View reviewed changes

wicker/core/datasets.py Outdated Show resolved Hide resolved

Add column_bytes_file_reader parameter to BaseDataset init

d341763

zhenyu reviewed Oct 21, 2024

View reviewed changes

wicker/core/datasets.py Show resolved Hide resolved

Alex Bain (Woven by Toyota added 2 commits October 21, 2024 12:08

Add DatasetFactory class to hide implementation details

601134a

Fix test

631a2b4

zhenyu reviewed Oct 21, 2024

View reviewed changes

wicker/core/datasets.py Outdated Show resolved Hide resolved

zhenyu reviewed Oct 21, 2024

View reviewed changes

wicker/core/datasets.py Outdated Show resolved Hide resolved

zhenyu reviewed Oct 21, 2024

View reviewed changes

wicker/core/datasets.py Outdated Show resolved Hide resolved

Replace factory class with builder function

c84a0a0

zhenyu reviewed Oct 22, 2024

View reviewed changes

wicker/core/datasets.py Outdated Show resolved Hide resolved

Alex Bain (Woven by Toyota added 2 commits October 22, 2024 10:51

Fix builder function to accept a dataset configuration type and updat…

9ca953a

…e tests

Fix type-check

5b9060e

Support config for one S3Dataset and multiple FileSystemDatasets

806c95b

convexquad mentioned this pull request Oct 22, 2024

Add FileSystemDataset & Remove S3 ColumnBytes and Cache Assumptions #57

Closed

zhenyu reviewed Oct 22, 2024

View reviewed changes

zhenyu approved these changes Oct 29, 2024

View reviewed changes

marccarre approved these changes Nov 18, 2024

View reviewed changes

Version bump

7351e28

convexquad merged commit babcf9a into woven-planet:main Dec 11, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BaseDataset and FileSystemDataset classes #76

Add BaseDataset and FileSystemDataset classes #76

convexquad commented Oct 10, 2024 •

edited

Loading

convexquad commented Oct 21, 2024

zhenyu Oct 22, 2024

convexquad commented Oct 22, 2024

convexquad commented Oct 22, 2024 •

edited

Loading

zhenyu Oct 22, 2024

zhenyu commented Oct 22, 2024

convexquad commented Oct 26, 2024

convexquad commented Oct 29, 2024 •

edited

Loading

zhenyu left a comment

marccarre left a comment

convexquad commented Dec 11, 2024

Add BaseDataset and FileSystemDataset classes #76

Add BaseDataset and FileSystemDataset classes #76

Conversation

convexquad commented Oct 10, 2024 • edited Loading

convexquad commented Oct 21, 2024

zhenyu Oct 22, 2024

Choose a reason for hiding this comment

convexquad commented Oct 22, 2024

convexquad commented Oct 22, 2024 • edited Loading

zhenyu Oct 22, 2024

Choose a reason for hiding this comment

zhenyu commented Oct 22, 2024

convexquad commented Oct 26, 2024

convexquad commented Oct 29, 2024 • edited Loading

zhenyu left a comment

Choose a reason for hiding this comment

marccarre left a comment

Choose a reason for hiding this comment

convexquad commented Dec 11, 2024

convexquad commented Oct 10, 2024 •

edited

Loading

convexquad commented Oct 22, 2024 •

edited

Loading

convexquad commented Oct 29, 2024 •

edited

Loading