added dataset_size attribute to minari datasets #158

shreyansjainn · 2023-10-29T13:17:10Z

Description

added dataset size attribute to minari dataset generation function and cli. create a complimentary PR #143 as refactoring of code introduced lot of conflicts.

Fixes # (issue), Depends on # (pull request)

Type of change

Please delete options that are not relevant.

New feature (non-breaking change which adds functionality)

Screenshots

Please attach before and after screenshots of the change if applicable.
To upload images to a PR -- simply drag and drop or copy paste.

Checklist:

I have run the pre-commit checks with pre-commit run --all-files (see CONTRIBUTING.md instructions to set it up)
I have run pytest -v and no errors are present.
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I solved any possible warnings that pytest -v has generated that are related to my code to the best of my knowledge.
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

shreyansjainn · 2023-10-29T13:27:28Z

hey @balisujohn @younik i've created this complimentary PR for #143 as there were a lot of conflicts due to code refactoring so i thought of doing the changes on the new structure from scratch. let me know if any more changes are required or not.

also, regarding the last change that John requested of getting the dataset size on the fly for a local dataset if the size is not present, i've not implemented that yet as i had to introduce a flag for _show_dataset_table method and it felt a bit cluttered. let me know if you'd like me to go ahead with that change or not.

do let me know if any more changes are required in this.

younik

Thanks for this PR; the code looks good, but I have a high-level doubt on how we want this to works.

The two possibilities are:

adding the size to the metadata, and just read it from it
Always "compute" it. It is not a real computation as the size is already stored in file metadata, but you have to iterate through files.

For now you are doing the first option (however, in the get_dataset_size you are also computing for the cloud, that it would never use it as you call get_dataset_size only at the dataset creation).

I am personally more incline to the second simply because it is more robust to dataset changes however, it may slow down minari list local and minari list remote?
Can you maybe time the get_dataset_size method for datasets with different number of file? (You can set a low max_episode_buffer on DataCollectorV0 to create datasets with many files)

minari/dataset/minari_dataset.py

minari/dataset/minari_storage.py

shreyansjainn · 2023-10-31T17:00:31Z

For now you are doing the first option (however, in the get_dataset_size you are also computing for the cloud, that it would never use it as you call get_dataset_size only at the dataset creation)

yes, that is something we can remove i think, i included in the initial iteration of the method, but post that the requirements and the implementation changed but i forgot to remove it. in practice we will be mainly computing the dataset size locally only.

I am personally more incline to the second simply because it is more robust to dataset changes however, it may slow down minari list local and minari list remote?

yes, this is why we went ahead with the first approach where they can just fetch the info from the metadata without any impact on the performance.

Can you maybe time the get_dataset_size method for datasets with different number of file? (You can set a low max_episode_buffer on DataCollectorV0 to create datasets with many files)

i fear that timing might impact the bigger size datasets and return incomplete dataset size? from what i'm understanding is that if someone adds more files to the existing dataset (i wasn't sure that was possible, i thought the only way you can update the dataset is by creating a new version from scratch for it), this method will capture that but the first won't?

shreyansjainn · 2023-10-31T19:14:33Z

@younik, have implemented some corrections. one thing left to correct is the way the dataset size is getting written in metadata for create_dataset_from_collector_env, so give me some time to correct that, will correct the unit tests accordingly too

younik · 2023-11-02T21:56:33Z

@younik, have implemented some corrections. one thing left to correct is the way the dataset size is getting written in metadata for create_dataset_from_collector_env, so give me some time to correct that, will correct the unit tests accordingly too

Sounds good, let me know when you are already for another review

…nged unit tests

shreyansjainn · 2023-11-04T16:08:00Z

@younik have implemented the changes in both the create dataset methods and changed the unit tests accordingly so that it tests that action too. do let me know if any other change is needed.

minari/dataset/minari_storage.py

younik · 2023-11-21T07:27:03Z

Hey @shreyansjainn, what's the status of this? Do you need any help from my side?

shreyansjainn · 2023-11-21T09:16:57Z

@younik will work on this today or by tomorrow, was on break due to diwali festival in india and post that the work rush kept me occupied, will finish this asap...

…nder the name get_size

younik

Looks good to me, just if you can do a small cleanup on the test code as suggested

tests/dataset/test_minari_storage.py

younik

LGTM, thanks!

added dataset_size attribute to minari datasets

222e6af

shreyansjainn marked this pull request as draft October 29, 2023 13:30

shreyansjainn marked this pull request as ready for review October 29, 2023 13:30

younik reviewed Oct 29, 2023

View reviewed changes

minari/dataset/minari_dataset.py Outdated Show resolved Hide resolved

minari/dataset/minari_storage.py Outdated Show resolved Hide resolved

minari/dataset/minari_storage.py Outdated Show resolved Hide resolved

minari/dataset/minari_storage.py Outdated Show resolved Hide resolved

made changes as per review comments

380662d

Shreyans Jain and others added 3 commits November 4, 2023 21:10

changed the way dataset_size is integrated with both the methods, cha…

5f1c436

…nged unit tests

Merge branch 'main' into added_dataset_size_v2

bc5194d

corrected pre-commit failures

6f0747e

younik requested changes Nov 9, 2023

View reviewed changes

minari/dataset/minari_storage.py Outdated Show resolved Hide resolved

shreyansjainn and others added 2 commits November 21, 2023 19:49

Merge branch 'main' into added_dataset_size_v2

4732744

transferred the get_dataset_size method in side MinariStorage class u…

9e0d0fd

…nder the name get_size

shreyansjainn requested a review from younik November 21, 2023 17:36

removed duplciate data_path attribute

f707fb3

younik reviewed Nov 21, 2023

View reviewed changes

code cleanup in test file

201ae2c

shreyansjainn requested a review from younik November 22, 2023 04:20

removed more comments

f2eeb85

younik approved these changes Nov 22, 2023

View reviewed changes

younik merged commit ee53b4c into Farama-Foundation:main Nov 22, 2023
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added dataset_size attribute to minari datasets #158

added dataset_size attribute to minari datasets #158

shreyansjainn commented Oct 29, 2023 •

edited

Loading

shreyansjainn commented Oct 29, 2023

younik left a comment

shreyansjainn commented Oct 31, 2023

shreyansjainn commented Oct 31, 2023

younik commented Nov 2, 2023

shreyansjainn commented Nov 4, 2023

younik commented Nov 21, 2023

shreyansjainn commented Nov 21, 2023

younik left a comment

younik left a comment

added dataset_size attribute to minari datasets #158

added dataset_size attribute to minari datasets #158

Conversation

shreyansjainn commented Oct 29, 2023 • edited Loading

Description

Type of change

Screenshots

Checklist:

shreyansjainn commented Oct 29, 2023

younik left a comment

Choose a reason for hiding this comment

shreyansjainn commented Oct 31, 2023

shreyansjainn commented Oct 31, 2023

younik commented Nov 2, 2023

shreyansjainn commented Nov 4, 2023

younik commented Nov 21, 2023

shreyansjainn commented Nov 21, 2023

younik left a comment

Choose a reason for hiding this comment

younik left a comment

Choose a reason for hiding this comment

shreyansjainn commented Oct 29, 2023 •

edited

Loading