Shard Dataset at specific indices #7415

nikonikolov · 2025-02-20T10:43:10Z

I have a dataset of sequences, where each example in the sequence is a separate row in the dataset (similar to LeRobotDataset). When running Dataset.save_to_disk how can I provide indices where it's possible to shard the dataset such that no episode spans more than 1 shard. Consequently, when I run Dataset.load_from_disk, how can I load just a subset of the shards to save memory and time on different ranks?

I guess an alternative to this would be, given a loaded Dataset, how can I run Dataset.shard such that sharding doesn't split any episode across shards?

The text was updated successfully, but these errors were encountered:

lhoestq · 2025-02-20T14:16:00Z

Hi ! if it's an option I'd suggest to have one sequence per row instead.

Otherwise you'd have to make your own save/load mechanism

nikonikolov · 2025-02-20T15:24:52Z

Saving one sequence per row is very difficult and heavy and makes all the optimizations pointless. How would a custom save/load mechanism look like?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shard Dataset at specific indices #7415

Shard Dataset at specific indices #7415

nikonikolov commented Feb 20, 2025

lhoestq commented Feb 20, 2025

nikonikolov commented Feb 20, 2025

Shard Dataset at specific indices #7415

Shard Dataset at specific indices #7415

Comments

nikonikolov commented Feb 20, 2025

lhoestq commented Feb 20, 2025

nikonikolov commented Feb 20, 2025