-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to split large datasets #935
Comments
Not sure if this will help but this reminds of the BEP35 MEGA that tries to deal with datasets of datasets: https://docs.google.com/document/d/10RpFFsG_ESj0orGIqJ0yK4DTpeoyrmd4RjX2ER-P31k/edit?usp=sharing To my knowledge, there is no dedicated |
could you provide some more detail in which situations performance issues arise? On what level are these difficulties? Only with specific software? Or on certain operating systems? |
@Remi-Gau, very interesting BEP035, thanks! I think our question here is a lot simpler though. In BEP035, they need to track "new" information about the different studies included in their dataset. In our case, the only additional information would be the relationship between datasets that were split. A dedicated field in dataset description could help, but maybe it is not absolutely necessary, the info could simply be included in the readme file.
@sappelhoff, I'm afraid I don't have much practical details about this. In one case it was for version tracking with datalad with ten of thousands of images in a single dataset. So the question is more: when researchers need to split the dataset across folders for efficiency reasons with their workflow, should we provide a mechanism to keep track of this? In any case, this is not a problem for the microscopy extension (hence the separate issue) but I wanted to share the question here to open the discussion. |
Assuming you aren't running out of inodes and just need an extra level of hierarchy for logistical purposes, I might go with session. It's intended to be very flexible, though "I ran out of
If you already had sessions
In either case, you could make each session directory a datalad subdataset, to avoid git limits. Now you have a coherent view on the data, although a data host would need to be prepared to accept hierarchical datasets of this sort. That said, a reader/tool would need to know to treat groups of sessions as a unit. We don't really have a concept that is understood to mean "you can treat these two directories as if they were combined". If that is something that's generally useful, I might propose the entity
You could imagine doing this within That said, being able to associate multiple datasets with relations seems like a good idea. This seems like it maps onto the idea of |
@effigies, thanks for the feedback! I like the
I like |
While working on the Microscopy BEP, it was brought to our attention that some very large microscopy datasets sometimes need to be split across different folders. For example because of limitations or performance issue with large files or large number of files in a single repository.
I was wondering if this issue has come up in BIDS in the past and if there is an official mechanism for dealing with such situations?
Here is an example to illustrate my thoughts.
In this example, one subject (
sub-01
) has 2000 samples (sample-0001 to sample-2000
), and each of the sample has 20 chunks (chunk-01 to chunk-20
), as illustrated below:Let’s say that the dataset needs to be split in 2, I would suggest to split the dataset with the first 1000 samples in one dataset (
dataset1
) and the samples 1001 to 2000 in another dataset (dataset2
), as follow:Would that splitting method make sense with BIDS?
And in a case like this, is there a way to "link" the 2 datasets together, in
dataset_description.json
for example?Thank you!
The text was updated successfully, but these errors were encountered: