Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] lack of recommendations for datafiles that differ only by extension #1487

Open
Remi-Gau opened this issue May 4, 2023 · 5 comments · May be fixed by #1492
Open

[BUG] lack of recommendations for datafiles that differ only by extension #1487

Remi-Gau opened this issue May 4, 2023 · 5 comments · May be fixed by #1492
Labels
bug Something isn't working

Comments

@Remi-Gau
Copy link
Collaborator

Remi-Gau commented May 4, 2023

Describe your problem in detail.

Note that this issue may apply to more datatype in BIDS but I have not checked it systematically.

As far I can tell it is not mentioned in the specification that files cannot differ just by their extension.

For example, modifying the micr_SEM bids example to have 2 times the same data that differ only by extension:

/home/remi/github/bids/examples/micr_SEM
├── dataset_description.json
├── participants.json
├── participants.tsv
├── README
├── samples.json
├── samples.tsv
└── sub-01
    ├── ses-01
    │   └── micr
    │       ├── sub-01_ses-01_sample-A_photo.jpg  < -- data: file 1
    │       ├── sub-01_ses-01_sample-A_photo.json
    │       ├── sub-01_ses-01_sample-A_photo.tif  < -- data: file 2
    │       ├── sub-01_ses-01_sample-A_SEM.json
    │       └── sub-01_ses-01_sample-A_SEM.png
    ├── ses-02
    ├── sub-01_sessions.json
    └── sub-01_sessions.tsv

From my current reading of the spec, this could be valid.

And also the bids validator does not complain about this: except from sayaing that not all subject have the same number of files.

I have mostly checked with picture files *_photo.* (eeg, meg, micr) but it also seems to be the case for eeg files:

bids/examples/eeg_ds000117/sub-01/eeg
├── sub-01_coordsystem.json
├── sub-01_electrodes.tsv
├── sub-01_task-facerecognition_run-1_eeg.eeg <--- duplicate data file with different extension
├── sub-01_task-facerecognition_run-1_eeg.fdt
├── sub-01_task-facerecognition_run-1_eeg.set
├── sub-01_task-facerecognition_run-1_events.tsv
...

Am I missing something but maybe this type of potential data duplication should be disallowed?

Describe what you expected.

I would expect an error like for example in the case of .nii and .nii.gz where the validator throws this error:

[ERR] NIfTI file exist with both '.nii' and '.nii.gz' extensions. (code: 74 - DUPLICATE_NIFTI_FILES)
                ./sub-Sub103/perf/sub-Sub103_asl.nii
                ./sub-Sub103/perf/sub-Sub103_asl.nii.gz

BIDS specification section

No response

@Remi-Gau Remi-Gau added the bug Something isn't working label May 4, 2023
@Remi-Gau
Copy link
Collaborator Author

Remi-Gau commented May 4, 2023

If this type of data duplication is to be disallowed, it may be a good thing to:

  • mention this in the spec: in the part where extensions are defined? Somewhere else?
  • improve the way files that allow several extension are rendered by the filename pattern macros:

For example, the following rendering may suggest that all 3 files can co-exist in the same dataset

https://bids-specification.readthedocs.io/en/latest/modality-specific-files/electroencephalography.html#landmark-photos-_photoextension

Screenshot from 2023-05-04 21-48-08

maybe better to have something like:

sub-<label>[_ses-<label>][_acq-<label>]_photo.[tif|png|jpg]

@effigies
Copy link
Collaborator

Okay, here's a proposal:

photo:
  suffixes:
    - photo
  extensions:
    - [.jpg, .png, .tif]
  datatypes:
    - eeg
    - ieeg
    - meg
    - nirs
  entities:
    subject: required
    session: optional
    acquisition: optional

photo__micr:
  $ref: rules.files.raw.photo.photo
  extensions:
    - [.jpg, .png, .tif]
    - .json
  datatypes:
    - micr
  entities:
    $ref: rules.files.raw.photo.photo.entities
    sample: required

Here, the extensions that are in a list together are "the same kind" and so mutually exclusive and distinguishable from supplementary entries, such as .json.

For NIfTI, we would do - [.nii, .nii.gz].

@Remi-Gau
Copy link
Collaborator Author

BUT...

For EEG:

  • those work as triplet that go together: .vhdr, .vmrk, .eeg
  • and .set file with an OPTIONAL .fdt
eeg:
  suffixes:
    - eeg
  extensions:
    - .json
    - .edf
    - .vhdr
    - .vmrk
    - .eeg
    - .set
    - .fdt
    - .bdf
  datatypes:
    - eeg
  entities:
    subject: required
    session: optional
    task: required
    acquisition: optional
    run: optional

@effigies
Copy link
Collaborator

I think we could do something like:

extensions:
  - .json
  - [ .edf, .eeg, .set, .bdf ]
  - .vhdr
  - .vmrk
  - .fdt

And then just use a couple checks to say that if any of .eeg, .vhdr or .vmrk exist, then they all exist. And if .fdt exists, then .set exists.

@sappelhoff
Copy link
Member

👍

and for:

  • eeg, vhdr, vmrk --> vhdr SHOULD be listed in scans.tsv
  • set, fdt --> set SHOULD be listed in scans.tsv

For file formats that are based on several files of different extensions, or a directory of files with different extensions (multi-file file formats), only that file SHOULD be listed that would also be passed to analysis software for reading the data. For example for BrainVision data (.vhdr, .vmrk, .eeg), only the .vhdr SHOULD be listed; for EEGLAB data (.set, .fdt), only the .set file SHOULD be listed; and for CTF data (.ds), the whole .ds directory SHOULD be listed, and not the individual files in that directory.

(see: https://bids-specification.readthedocs.io/en/latest/modality-agnostic-files.html#scans-file)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants