Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added option to copy data when combining datasets #82

Merged
merged 5 commits into from
May 31, 2023

Conversation

Howuhh
Copy link
Contributor

@Howuhh Howuhh commented May 28, 2023

Description

Related to #56

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist:

  • I have run the pre-commit checks with pre-commit run --all-files (see CONTRIBUTING.md instructions to set it up)
  • I have run pytest -v and no errors are present.
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I solved any possible warnings that pytest -v has generated that are related to my code to the best of my knowledge.
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@Howuhh Howuhh marked this pull request as draft May 28, 2023 14:21
@Howuhh
Copy link
Contributor Author

Howuhh commented May 28, 2023

a bit uglier than initially, but unfortunately because of the limitations of h5py you can't do it like in the first commit

@Howuhh Howuhh marked this pull request as ready for review May 28, 2023 14:43
@balisujohn balisujohn self-requested a review May 28, 2023 21:31
Copy link
Collaborator

@balisujohn balisujohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it looks good, and it seems to perform as intended, creating combined datasets that are not broken when the source datasets are deleted.

I think also it would be good to include a test that makes sure the true-copied combined dataset is still intact after deleting the source datasets.

In a future PR, I think the true copy behavior should be made the default for the combine_datasets() function, and also the default for the combine command in the cli, though it should still be possible to enable the external link combination behavior in the cli with the optional flag --external-link I think we should make there be a check to see if combined datasets depend on other datasets, because right now you can break combined datasets with no warning by deleting their dependency datasets using minari delete.

Curious to hear your thoughts on this.

@Howuhh
Copy link
Contributor Author

Howuhh commented May 30, 2023

@balisujohn yup, I will add the test for deletion. I think the default behavior should still be to create a link, so that it does not take up too much disk space. Usually it is still a combination of datasets for the convenience of working with them, and not to move it somewhere or delete old ones.

@Howuhh Howuhh requested a review from balisujohn May 30, 2023 17:07
@balisujohn
Copy link
Collaborator

LGTM :^)

@balisujohn balisujohn merged commit 302288c into Farama-Foundation:main May 31, 2023
@balisujohn
Copy link
Collaborator

@Howuhh Thanks for your thoughts on this. I can understand the argument against switching to true copy as the primary method, so I'll hold off on making that change for the time being. I will make a task to add am optional command line flag to enable true copy to the cli command minari combine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants