Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to split large datasets #935

Open
mariehbourget opened this issue Nov 19, 2021 · 5 comments
Open

How to split large datasets #935

mariehbourget opened this issue Nov 19, 2021 · 5 comments
Labels
question Further information is requested

Comments

@mariehbourget
Copy link
Collaborator

While working on the Microscopy BEP, it was brought to our attention that some very large microscopy datasets sometimes need to be split across different folders. For example because of limitations or performance issue with large files or large number of files in a single repository.

I was wondering if this issue has come up in BIDS in the past and if there is an official mechanism for dealing with such situations?

Here is an example to illustrate my thoughts.
In this example, one subject (sub-01) has 2000 samples (sample-0001 to sample-2000), and each of the sample has 20 chunks (chunk-01 to chunk-20), as illustrated below:

dataset
└── sub-01
     └── microscopy
            ├── sub-01_sample-0001_chunk-01_BF.tif
            ├── sub-01_sample-0001_chunk-02_BF.tif
            ├── ...
            ├── sub-01_sample-0001_chunk-20_BF.tif
            ├── ...
            ├── sub-01_sample-2000_chunk-01_BF.tif
            ├── sub-01_sample-2000_chunk-02_BF.tif
            ├── ...
            └── sub-01_sample-2000_chunk-20_BF.tif

Let’s say that the dataset needs to be split in 2, I would suggest to split the dataset with the first 1000 samples in one dataset (dataset1) and the samples 1001 to 2000 in another dataset (dataset2), as follow:

dataset-01
└── sub-01
     └── microscopy
            ├── sub-01_sample-0001_chunk-01_BF.tif
            ├── sub-01_sample-0001_chunk-02_BF.tif
            ├── ...
            ├── sub-01_sample-0001_chunk-20_BF.tif
            ├── ...
            ├── sub-01_sample-1000_chunk-01_BF.tif
            ├── sub-01_sample-1000_chunk-02_BF.tif
            ├── ...
            └── sub-01_sample-1000_chunk-20_BF.tif

dataset-02
└── sub-01
     └── microscopy
            ├── sub-01_sample-1001_chunk-01_BF.tif
            ├── sub-01_sample-1001_chunk-02_BF.tif
            ├── ...
            ├── sub-01_sample-1001_chunk-20_BF.tif
            ├── ...
            ├── sub-01_sample-2000_chunk-01_BF.tif
            ├── sub-01_sample-2000_chunk-02_BF.tif
            ├── ...
            └── sub-01_sample-2000_chunk-20_BF.tif

Would that splitting method make sense with BIDS?
And in a case like this, is there a way to "link" the 2 datasets together, in dataset_description.json for example?
Thank you!

@Remi-Gau
Copy link
Collaborator

Not sure if this will help but this reminds of the BEP35 MEGA that tries to deal with datasets of datasets: https://docs.google.com/document/d/10RpFFsG_ESj0orGIqJ0yK4DTpeoyrmd4RjX2ER-P31k/edit?usp=sharing

To my knowledge, there is no dedicated IsAssociatedWith field (or something of the same spirit) in the dataset description.

@sappelhoff
Copy link
Member

some very large microscopy datasets sometimes need to be split across different folders. For example because of limitations or performance issue with large files or large number of files in a single repository.

could you provide some more detail in which situations performance issues arise? On what level are these difficulties? Only with specific software? Or on certain operating systems?

@mariehbourget
Copy link
Collaborator Author

Not sure if this will help but this reminds of the BEP35 MEGA that tries to deal with datasets of datasets: https://docs.google.com/document/d/10RpFFsG_ESj0orGIqJ0yK4DTpeoyrmd4RjX2ER-P31k/edit?usp=sharing

To my knowledge, there is no dedicated IsAssociatedWith field (or something of the same spirit) in the dataset description.

@Remi-Gau, very interesting BEP035, thanks! I think our question here is a lot simpler though. In BEP035, they need to track "new" information about the different studies included in their dataset. In our case, the only additional information would be the relationship between datasets that were split. A dedicated field in dataset description could help, but maybe it is not absolutely necessary, the info could simply be included in the readme file.

could you provide some more detail in which situations performance issues arise? On what level are these difficulties? Only with specific software? Or on certain operating systems?

@sappelhoff, I'm afraid I don't have much practical details about this. In one case it was for version tracking with datalad with ten of thousands of images in a single dataset. So the question is more: when researchers need to split the dataset across folders for efficiency reasons with their workflow, should we provide a mechanism to keep track of this?

In any case, this is not a problem for the microscopy extension (hence the separate issue) but I wanted to share the question here to open the discussion.
Thanks!

@effigies
Copy link
Collaborator

Assuming you aren't running out of inodes and just need an extra level of hierarchy for logistical purposes, I might go with session. It's intended to be very flexible, though "I ran out of dirents" is rather stretching the original intent. Still, here would be one approach:

dataset/
  sub-01/
    ses-shard1/
      micr/
        sub-01_ses-shard1_sample-0001_chunk-01_BF.tif
        ...
    ses-shard2/
      micr/
        sub-01_ses-shard2_sample-1001_chunk-01_BF.tif
        ...

If you already had sessions 1 and 2, you could do something like:

dataset/
  sub-01/
    ses-1a/
    ses-1b/
    ses-2a/
    ses-2b/

In either case, you could make each session directory a datalad subdataset, to avoid git limits. Now you have a coherent view on the data, although a data host would need to be prepared to accept hierarchical datasets of this sort.

That said, a reader/tool would need to know to treat groups of sessions as a unit. We don't really have a concept that is understood to mean "you can treat these two directories as if they were combined". If that is something that's generally useful, I might propose the entity shard, which could be used to split a dataset at pretty much any level, such as:

  1. Across subjects:
dataset/
  shard-1/
    sub-001/
      micr/
        sub-001_sample-0001_chunk-01_BF.tif
        ...
  shard-2/
    sub-501/
      micr/
        sub-501_sample-1001_chunk-01_BF.tif
        ...
  1. Within subjects:
dataset/
  sub-01/
    shard-1/
      micr/
        sub-01_sample-0001_chunk-01_BF.tif
        ...
    shard-2/
      micr/
        sub-01_sample-1001_chunk-01_BF.tif
        ...
  sub-02/
    shard-1/
      micr/
        sub-02_sample-0001_chunk-01_BF.tif
        ...
    shard-2/
      micr/
        sub-02_sample-1001_chunk-01_BF.tif
        ...

You could imagine doing this within micr/ even, if you don't want to shard anat/, for instance.


That said, being able to associate multiple datasets with relations seems like a good idea. This seems like it maps onto the idea of Continues/isContinuedBy or perhaps hasPart/isPartOf.

@mariehbourget
Copy link
Collaborator Author

@effigies, thanks for the feedback!

I like the session approach, although stretching the intent, it could work quite well for those particular cases without adding a lot of complexity. And sessions’ relationships could be indicated in sessions.tsv.

shard could work to. If I understand correctly, the session solution is the same as a shard entity within subjects. It seems to me that shard may add quite a bit of complexity to the BIDS schema though, especially if it could be applied a any level. I may be wrong, but I thought that having a tool/reader combining sessions of even datasets would be easier than implementing a “changing” folder structure.

I like Continues/isContinuedBy as it is kind of what I was looking for in my example (multiple datasets, no subdatasets). But hasPart/isPartOf would be more suitable with the session/shard example (1 main dataset with subdatasets), provided that HasPart can list multiple subdatasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants