Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shard Dataset at specific indices #7415

Open
nikonikolov opened this issue Feb 20, 2025 · 2 comments
Open

Shard Dataset at specific indices #7415

nikonikolov opened this issue Feb 20, 2025 · 2 comments

Comments

@nikonikolov
Copy link

I have a dataset of sequences, where each example in the sequence is a separate row in the dataset (similar to LeRobotDataset). When running Dataset.save_to_disk how can I provide indices where it's possible to shard the dataset such that no episode spans more than 1 shard. Consequently, when I run Dataset.load_from_disk, how can I load just a subset of the shards to save memory and time on different ranks?

I guess an alternative to this would be, given a loaded Dataset, how can I run Dataset.shard such that sharding doesn't split any episode across shards?

@lhoestq
Copy link
Member

lhoestq commented Feb 20, 2025

Hi ! if it's an option I'd suggest to have one sequence per row instead.

Otherwise you'd have to make your own save/load mechanism

@nikonikolov
Copy link
Author

Saving one sequence per row is very difficult and heavy and makes all the optimizations pointless. How would a custom save/load mechanism look like?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants