How to split large datasets #935

mariehbourget · 2021-11-19T00:13:35Z

While working on the Microscopy BEP, it was brought to our attention that some very large microscopy datasets sometimes need to be split across different folders. For example because of limitations or performance issue with large files or large number of files in a single repository.

I was wondering if this issue has come up in BIDS in the past and if there is an official mechanism for dealing with such situations?

Here is an example to illustrate my thoughts.
In this example, one subject (sub-01) has 2000 samples (sample-0001 to sample-2000), and each of the sample has 20 chunks (chunk-01 to chunk-20), as illustrated below:

dataset
└── sub-01
     └── microscopy
            ├── sub-01_sample-0001_chunk-01_BF.tif
            ├── sub-01_sample-0001_chunk-02_BF.tif
            ├── ...
            ├── sub-01_sample-0001_chunk-20_BF.tif
            ├── ...
            ├── sub-01_sample-2000_chunk-01_BF.tif
            ├── sub-01_sample-2000_chunk-02_BF.tif
            ├── ...
            └── sub-01_sample-2000_chunk-20_BF.tif

Let’s say that the dataset needs to be split in 2, I would suggest to split the dataset with the first 1000 samples in one dataset (dataset1) and the samples 1001 to 2000 in another dataset (dataset2), as follow:

dataset-01
└── sub-01
     └── microscopy
            ├── sub-01_sample-0001_chunk-01_BF.tif
            ├── sub-01_sample-0001_chunk-02_BF.tif
            ├── ...
            ├── sub-01_sample-0001_chunk-20_BF.tif
            ├── ...
            ├── sub-01_sample-1000_chunk-01_BF.tif
            ├── sub-01_sample-1000_chunk-02_BF.tif
            ├── ...
            └── sub-01_sample-1000_chunk-20_BF.tif

dataset-02
└── sub-01
     └── microscopy
            ├── sub-01_sample-1001_chunk-01_BF.tif
            ├── sub-01_sample-1001_chunk-02_BF.tif
            ├── ...
            ├── sub-01_sample-1001_chunk-20_BF.tif
            ├── ...
            ├── sub-01_sample-2000_chunk-01_BF.tif
            ├── sub-01_sample-2000_chunk-02_BF.tif
            ├── ...
            └── sub-01_sample-2000_chunk-20_BF.tif

Would that splitting method make sense with BIDS?
And in a case like this, is there a way to "link" the 2 datasets together, in dataset_description.json for example?
Thank you!

The text was updated successfully, but these errors were encountered:

Remi-Gau · 2021-11-19T04:57:23Z

Not sure if this will help but this reminds of the BEP35 MEGA that tries to deal with datasets of datasets: https://docs.google.com/document/d/10RpFFsG_ESj0orGIqJ0yK4DTpeoyrmd4RjX2ER-P31k/edit?usp=sharing

To my knowledge, there is no dedicated IsAssociatedWith field (or something of the same spirit) in the dataset description.

sappelhoff · 2021-11-19T10:50:44Z

some very large microscopy datasets sometimes need to be split across different folders. For example because of limitations or performance issue with large files or large number of files in a single repository.

could you provide some more detail in which situations performance issues arise? On what level are these difficulties? Only with specific software? Or on certain operating systems?

mariehbourget · 2021-11-19T16:08:10Z

Not sure if this will help but this reminds of the BEP35 MEGA that tries to deal with datasets of datasets: https://docs.google.com/document/d/10RpFFsG_ESj0orGIqJ0yK4DTpeoyrmd4RjX2ER-P31k/edit?usp=sharing

To my knowledge, there is no dedicated IsAssociatedWith field (or something of the same spirit) in the dataset description.

@Remi-Gau, very interesting BEP035, thanks! I think our question here is a lot simpler though. In BEP035, they need to track "new" information about the different studies included in their dataset. In our case, the only additional information would be the relationship between datasets that were split. A dedicated field in dataset description could help, but maybe it is not absolutely necessary, the info could simply be included in the readme file.

could you provide some more detail in which situations performance issues arise? On what level are these difficulties? Only with specific software? Or on certain operating systems?

@sappelhoff, I'm afraid I don't have much practical details about this. In one case it was for version tracking with datalad with ten of thousands of images in a single dataset. So the question is more: when researchers need to split the dataset across folders for efficiency reasons with their workflow, should we provide a mechanism to keep track of this?

In any case, this is not a problem for the microscopy extension (hence the separate issue) but I wanted to share the question here to open the discussion.
Thanks!

effigies · 2021-11-19T17:00:58Z

Assuming you aren't running out of inodes and just need an extra level of hierarchy for logistical purposes, I might go with session. It's intended to be very flexible, though "I ran out of dirents" is rather stretching the original intent. Still, here would be one approach:

dataset/
  sub-01/
    ses-shard1/
      micr/
        sub-01_ses-shard1_sample-0001_chunk-01_BF.tif
        ...
    ses-shard2/
      micr/
        sub-01_ses-shard2_sample-1001_chunk-01_BF.tif
        ...

If you already had sessions 1 and 2, you could do something like:

dataset/
  sub-01/
    ses-1a/
    ses-1b/
    ses-2a/
    ses-2b/

In either case, you could make each session directory a datalad subdataset, to avoid git limits. Now you have a coherent view on the data, although a data host would need to be prepared to accept hierarchical datasets of this sort.

That said, a reader/tool would need to know to treat groups of sessions as a unit. We don't really have a concept that is understood to mean "you can treat these two directories as if they were combined". If that is something that's generally useful, I might propose the entity shard, which could be used to split a dataset at pretty much any level, such as:

Across subjects:

dataset/
  shard-1/
    sub-001/
      micr/
        sub-001_sample-0001_chunk-01_BF.tif
        ...
  shard-2/
    sub-501/
      micr/
        sub-501_sample-1001_chunk-01_BF.tif
        ...

Within subjects:

dataset/
  sub-01/
    shard-1/
      micr/
        sub-01_sample-0001_chunk-01_BF.tif
        ...
    shard-2/
      micr/
        sub-01_sample-1001_chunk-01_BF.tif
        ...
  sub-02/
    shard-1/
      micr/
        sub-02_sample-0001_chunk-01_BF.tif
        ...
    shard-2/
      micr/
        sub-02_sample-1001_chunk-01_BF.tif
        ...

You could imagine doing this within micr/ even, if you don't want to shard anat/, for instance.

That said, being able to associate multiple datasets with relations seems like a good idea. This seems like it maps onto the idea of Continues/isContinuedBy or perhaps hasPart/isPartOf.

mariehbourget · 2021-11-19T20:37:59Z

@effigies, thanks for the feedback!

I like the session approach, although stretching the intent, it could work quite well for those particular cases without adding a lot of complexity. And sessions’ relationships could be indicated in sessions.tsv.

shard could work to. If I understand correctly, the session solution is the same as a shard entity within subjects. It seems to me that shard may add quite a bit of complexity to the BIDS schema though, especially if it could be applied a any level. I may be wrong, but I thought that having a tool/reader combining sessions of even datasets would be easier than implementing a “changing” folder structure.

I like Continues/isContinuedBy as it is kind of what I was looking for in my example (multiple datasets, no subdatasets). But hasPart/isPartOf would be more suitable with the session/shard example (1 main dataset with subdatasets), provided that HasPart can list multiple subdatasets.

sappelhoff added the question Further information is requested label Jan 15, 2022

jcohenadad mentioned this issue Jan 13, 2023

Add chunk entity to MRI datatype #1382

Closed

sappelhoff mentioned this issue Aug 8, 2023

Add CHUNCK entity for (spinal) MRI #1574

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to split large datasets #935

How to split large datasets #935

mariehbourget commented Nov 19, 2021

Remi-Gau commented Nov 19, 2021

sappelhoff commented Nov 19, 2021

mariehbourget commented Nov 19, 2021

effigies commented Nov 19, 2021

mariehbourget commented Nov 19, 2021

How to split large datasets #935

How to split large datasets #935

Comments

mariehbourget commented Nov 19, 2021

Remi-Gau commented Nov 19, 2021

sappelhoff commented Nov 19, 2021

mariehbourget commented Nov 19, 2021

effigies commented Nov 19, 2021

mariehbourget commented Nov 19, 2021