Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EEG] extract archive script #970

Merged
merged 2 commits into from
Jun 19, 2023

Conversation

laemtl
Copy link
Contributor

@laemtl laemtl commented May 4, 2023

This PR requires the mysql patch in LORIS #8685

Test cases:

  • With/without upload_id(electrophysiology_uploader): extract_eeg_bids_archive.py -p <profile> -u <upload_id>
  • With a S3/regular path for EEGUploadIncomingPath
  • With/without a value for EEGS3DataPath

@laemtl laemtl changed the base branch from main to 24.1-release May 4, 2023 14:43
@laemtl laemtl changed the title Eeg decompressing script 24 [EEG] extract archive script May 4, 2023
@laemtl laemtl force-pushed the eeg-decompressing-script-24 branch from 4f547c1 to dc5855c Compare May 4, 2023 15:00
Copy link
Collaborator

@cmadjar cmadjar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laemtl see my initial review.

A few general comments (in addition to the ones in line with the code changes):

  • path should be assembly_bids not bids_assembly (if you want to copy the data with the MRI BIDS structure)
  • I think the notifications will be show under the imaging uploader and not the EEG uploader (see comment in the write_to_notification_table function of Imaging IO)
  • what happens if people do not have the Config EEGS3DataPath? Maybe you could add an if statement early in the script to check if the config exists, if not, exit with error?

I did not test it yet, wanted to do the review first. Let me know your thoughts.

python/extract_eeg_bids_archive.py Outdated Show resolved Hide resolved
python/extract_eeg_bids_archive.py Outdated Show resolved Hide resolved
python/extract_eeg_bids_archive.py Outdated Show resolved Hide resolved
python/extract_eeg_bids_archive.py Outdated Show resolved Hide resolved
python/extract_eeg_bids_archive.py Outdated Show resolved Hide resolved
python/extract_eeg_bids_archive.py Outdated Show resolved Hide resolved
python/extract_eeg_bids_archive.py Show resolved Hide resolved
python/extract_eeg_bids_archive.py Outdated Show resolved Hide resolved
python/extract_eeg_bids_archive.py Outdated Show resolved Hide resolved
python/lib/imaging_io.py Show resolved Hide resolved
@cmadjar cmadjar added this to the 24.1.11 milestone May 4, 2023
@laemtl laemtl force-pushed the eeg-decompressing-script-24 branch from 94ec764 to cab2357 Compare May 4, 2023 21:15
@laemtl
Copy link
Contributor Author

laemtl commented May 5, 2023

what happens if people do not have the Config EEGS3DataPath? Maybe you could add an if statement early in the script to check if the config exists, if not, exit with error?

The script checks if the config exists and is set otherwise {dataDirBasepath}/assembly_bids is used.

@cmadjar
Copy link
Collaborator

cmadjar commented May 5, 2023

@laemtl I encountered a few issues when testing. See below:

  • When I try running the pipeline with the file on the S3 bucket, I get the following error:
(loris-mri-py39) lorisadmin@cecile-dev:/opt/loris/bin/mri/python$ python extract_eeg_bids_archive.py -p database_config.py -u 120
Downloading s3://eegarchive/PIUMN0013_967379_V03_bids.tar.gz from https://s3.msi.umn.edu/loris_demo_test_cmadjar to /data/tmp/extract_eeg_bids_archive_2023-05-05_12h50m52s_6w8_g9w6/PIUMN0013_967379_V03_bids.tar.gz

[ERROR   ] s3://eegarchive/PIUMN0013_967379_V03_bids.tar.gz could not be downloaded from S3 bucket. Error was
s3://eegarchive/PIUMN0013_967379_V03_bids.tar.gz download failure = An error occurred (404) when calling the HeadObject operation: Not Found

However, the file exists in my s3:

(loris-mri-py39) lorisadmin@cecile-dev:/opt/loris/bin/mri/python$ s3cmd -c ~/s3config_files/msi_cecile_sandbox_s3.s3cfg ls s3://eegarchive/PIUMN0013_967379_V03_bids.tar.gz
2023-05-05 12:45 725.6496305465698M  s3://eegarchive/PIUMN0013_967379_V03_bids.tar.gz

Could it be because the default bucket is pointing to the S3 bucket with the assembly_bids files? Do I need to add a config in my database_config.py file?

  • Status 'Extracted' does not exist in my electrophysiology_uploader table.**

Is there a patch somewhere to add the enum? Also, in general, we try to avoid enum in LORIS so we are not depending on doing a major or minor release for a data change. Instead, it might be best to create a separate table with the list of status and use the ID of the statuses in electrophysiology_uploader. Then later on, if you want to add a new status, this is just an INSERT statement instead of an ALTER TABLE.

Executing query:
	UPDATE electrophysiology_uploader SET Status = 'Extracted' WHERE UploadLocation = %s
With arguments:
	('PIUMN0013_967379_V03_bids.tar.gz',)

Traceback (most recent call last):
  File "/opt/loris/bin/mri/python/lib/database.py", line 213, in update
    cursor.execute(query, args)
  File "/home/lorisadmin/miniconda3/envs/loris-mri-py39/lib/python3.9/site-packages/MySQLdb/cursors.py", line 206, in execute
    res = self._query(query)
  File "/home/lorisadmin/miniconda3/envs/loris-mri-py39/lib/python3.9/site-packages/MySQLdb/cursors.py", line 319, in _query
    db.query(q)
  File "/home/lorisadmin/miniconda3/envs/loris-mri-py39/lib/python3.9/site-packages/MySQLdb/connections.py", line 254, in query
    _mysql.connection.query(self, query)
MySQLdb.DataError: (1265, "Data truncated for column 'Status' at row 120")

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/loris/bin/mri/python/extract_eeg_bids_archive.py", line 217, in <module>
    main()
  File "/opt/loris/bin/mri/python/extract_eeg_bids_archive.py", line 210, in main
    db.update(
  File "/opt/loris/bin/mri/python/lib/database.py", line 216, in update
    raise Exception("Update query failure: " + format(err))
Exception: Update query failure: (1265, "Data truncated for column 'Status' at row 120")
  • Extract archive using PSCID and not CandID

Once I modified the ENUM status column, I rerun the script and the extraction happens, however, it extracts it with the PSCID instead of the CandID which will be an issue for HBCD as all data are under CandID. Is there a config to use the CandID when extracting instead of the PSCID? If so, need to document that for deployment

  • No error thrown when use a bad URL for EEG S3 data path (wrong bucket name)

The script still prints:

Uploading s3://loris_demo_f/sub-PIUMN0013/ses-V03/eeg/sub-PIUMN0013_ses-V03_task-MMN_acq-eeg_eeg.set to https://s3.msi.umn.edu/loris_demo_test_cmadjar
Uploading s3://loris_demo_f/sub-PIUMN0013/ses-V03/eeg/sub-PIUMN0013_ses-V03_task-MMN_acq-eeg_channels.tsv to https://s3.msi.umn.edu/loris_demo_test_cmadjar

However, nothing get uploaded since s3://loris_demo_f/ does not exist.

Also, seems like the message Uploading <wrong bucket name> to <default bucket name> is misleading since in theory, it uploads the content of the tmp directory where the data was extracted into the S3 in this case.

  • Need to change permission of the extract script to be executable

@cmadjar
Copy link
Collaborator

cmadjar commented May 5, 2023

FYI: list of tests performed:

  • run extract_eeg_bids_archive.py with UploadID option
  • run extract_eeg_bids_archive.py without UploadID option
  • run extract_eeg_bids_archive.py with regular path in EEGUploadIncomingPath
  • run extract_eeg_bids_archive.py with S3 path in EEGUploadIncomingPath - FAILURE, see above point 1
  • run extract_eeg_bids_archive.py without Config EEGS3DataPath
  • run extract_eeg_bids_archive.py with Config set to an existing S3 URL EEGS3DataPath
  • run extract_eeg_bids_archive.py with Config set to a bad S3 URL EEGS3DataPath - FAILURE, see point 4 above
  • run extract_eeg_bids_archive.py on an UploadID that was already extracted -> print archive already extracted
  • CAVEAT for HBCD, extract data with PSCID and not CandID, see above point 3

@cmadjar
Copy link
Collaborator

cmadjar commented May 5, 2023

@laemtl I just remembered about the permissions of the extract script. See point 5 in #970 (comment)

@laemtl
Copy link
Contributor Author

laemtl commented Jun 16, 2023

Once I modified the ENUM status column, I rerun the script and the extraction happens, however, it extracts it with the PSCID instead of the CandID which will be an issue for HBCD as all data are under CandID. Is there a config to use the CandID when extracting instead of the PSCID? If so, need to document that for deployment

Extracting using the CandID will be possible if the BIDS dataset is set to use the CandId as the BIDs subject_id. The dataset is probably not setup this way. To cover the conversion of subject-id for the previously uploaded dataset, another script can be created.

@laemtl
Copy link
Contributor Author

laemtl commented Jun 19, 2023

Is there a patch somewhere to add the enum? Also, in general, we try to avoid enum in LORIS so we are not depending on doing a major or minor release for a data change. Instead, it might be best to create a separate table with the list of status and use the ID of the statuses in electrophysiology_uploader. Then later on, if you want to add a new status, this is just an INSERT statement instead of an ALTER TABLE.

Yes, here is the link to the PR for the patch: https://github.com/aces/HBCD/pull/895
I will keep this at an enum for now since it makes it easier to handle DEFAULT values. We can rediscuss later about that :)

@laemtl laemtl force-pushed the eeg-decompressing-script-24 branch from cab2357 to 6fc4c4f Compare June 19, 2023 19:00
@laemtl laemtl merged commit fab7d1b into aces:24.1-release Jun 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants