-
-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added dataset_size attribute to minari datasets #158
added dataset_size attribute to minari datasets #158
Conversation
hey @balisujohn @younik i've created this complimentary PR for #143 as there were a lot of conflicts due to code refactoring so i thought of doing the changes on the new structure from scratch. let me know if any more changes are required or not. also, regarding the last change that John requested of getting the dataset size on the fly for a local dataset if the size is not present, i've not implemented that yet as i had to introduce a flag for do let me know if any more changes are required in this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR; the code looks good, but I have a high-level doubt on how we want this to works.
The two possibilities are:
- adding the size to the metadata, and just read it from it
- Always "compute" it. It is not a real computation as the size is already stored in file metadata, but you have to iterate through files.
For now you are doing the first option (however, in the get_dataset_size you are also computing for the cloud, that it would never use it as you call get_dataset_size
only at the dataset creation).
I am personally more incline to the second simply because it is more robust to dataset changes however, it may slow down minari list local
and minari list remote
?
Can you maybe time the get_dataset_size method for datasets with different number of file? (You can set a low max_episode_buffer on DataCollectorV0 to create datasets with many files)
yes, that is something we can remove i think, i included in the initial iteration of the method, but post that the requirements and the implementation changed but i forgot to remove it. in practice we will be mainly computing the dataset size locally only.
yes, this is why we went ahead with the first approach where they can just fetch the info from the metadata without any impact on the performance.
i fear that timing might impact the bigger size datasets and return incomplete dataset size? from what i'm understanding is that if someone adds more files to the existing dataset (i wasn't sure that was possible, i thought the only way you can update the dataset is by creating a new version from scratch for it), this method will capture that but the first won't? |
@younik, have implemented some corrections. one thing left to correct is the way the dataset size is getting written in metadata for |
Sounds good, let me know when you are already for another review |
@younik have implemented the changes in both the create dataset methods and changed the unit tests accordingly so that it tests that action too. do let me know if any other change is needed. |
Hey @shreyansjainn, what's the status of this? Do you need any help from my side? |
@younik will work on this today or by tomorrow, was on break due to diwali festival in india and post that the work rush kept me occupied, will finish this asap... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just if you can do a small cleanup on the test code as suggested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Description
added dataset size attribute to minari dataset generation function and cli. create a complimentary PR #143 as refactoring of code introduced lot of conflicts.
Fixes # (issue), Depends on # (pull request)
Type of change
Screenshots
Checklist:
pre-commit
checks withpre-commit run --all-files
(seeCONTRIBUTING.md
instructions to set it up)pytest -v
and no errors are present.pytest -v
has generated that are related to my code to the best of my knowledge.