[EEG] extract archive script #970

laemtl · 2023-05-04T14:42:15Z

This PR requires the mysql patch in LORIS #8685

Test cases:

With/without upload_id(electrophysiology_uploader): extract_eeg_bids_archive.py -p <profile> -u <upload_id>
With a S3/regular path for EEGUploadIncomingPath
With/without a value for EEGS3DataPath

cmadjar

@laemtl see my initial review.

A few general comments (in addition to the ones in line with the code changes):

path should be assembly_bids not bids_assembly (if you want to copy the data with the MRI BIDS structure)
I think the notifications will be show under the imaging uploader and not the EEG uploader (see comment in the write_to_notification_table function of Imaging IO)
what happens if people do not have the Config EEGS3DataPath? Maybe you could add an if statement early in the script to check if the config exists, if not, exit with error?

I did not test it yet, wanted to do the review first. Let me know your thoughts.

python/extract_eeg_bids_archive.py

python/lib/imaging_io.py

laemtl · 2023-05-05T03:25:55Z

what happens if people do not have the Config EEGS3DataPath? Maybe you could add an if statement early in the script to check if the config exists, if not, exit with error?

The script checks if the config exists and is set otherwise {dataDirBasepath}/assembly_bids is used.

cmadjar · 2023-05-05T13:13:06Z

@laemtl I encountered a few issues when testing. See below:

When I try running the pipeline with the file on the S3 bucket, I get the following error:

(loris-mri-py39) lorisadmin@cecile-dev:/opt/loris/bin/mri/python$ python extract_eeg_bids_archive.py -p database_config.py -u 120
Downloading s3://eegarchive/PIUMN0013_967379_V03_bids.tar.gz from https://s3.msi.umn.edu/loris_demo_test_cmadjar to /data/tmp/extract_eeg_bids_archive_2023-05-05_12h50m52s_6w8_g9w6/PIUMN0013_967379_V03_bids.tar.gz

[ERROR   ] s3://eegarchive/PIUMN0013_967379_V03_bids.tar.gz could not be downloaded from S3 bucket. Error was
s3://eegarchive/PIUMN0013_967379_V03_bids.tar.gz download failure = An error occurred (404) when calling the HeadObject operation: Not Found

However, the file exists in my s3:

(loris-mri-py39) lorisadmin@cecile-dev:/opt/loris/bin/mri/python$ s3cmd -c ~/s3config_files/msi_cecile_sandbox_s3.s3cfg ls s3://eegarchive/PIUMN0013_967379_V03_bids.tar.gz
2023-05-05 12:45 725.6496305465698M  s3://eegarchive/PIUMN0013_967379_V03_bids.tar.gz

Could it be because the default bucket is pointing to the S3 bucket with the assembly_bids files? Do I need to add a config in my database_config.py file?

Status 'Extracted' does not exist in my electrophysiology_uploader table.**

Is there a patch somewhere to add the enum? Also, in general, we try to avoid enum in LORIS so we are not depending on doing a major or minor release for a data change. Instead, it might be best to create a separate table with the list of status and use the ID of the statuses in electrophysiology_uploader. Then later on, if you want to add a new status, this is just an INSERT statement instead of an ALTER TABLE.

Executing query:
	UPDATE electrophysiology_uploader SET Status = 'Extracted' WHERE UploadLocation = %s
With arguments:
	('PIUMN0013_967379_V03_bids.tar.gz',)

Traceback (most recent call last):
  File "/opt/loris/bin/mri/python/lib/database.py", line 213, in update
    cursor.execute(query, args)
  File "/home/lorisadmin/miniconda3/envs/loris-mri-py39/lib/python3.9/site-packages/MySQLdb/cursors.py", line 206, in execute
    res = self._query(query)
  File "/home/lorisadmin/miniconda3/envs/loris-mri-py39/lib/python3.9/site-packages/MySQLdb/cursors.py", line 319, in _query
    db.query(q)
  File "/home/lorisadmin/miniconda3/envs/loris-mri-py39/lib/python3.9/site-packages/MySQLdb/connections.py", line 254, in query
    _mysql.connection.query(self, query)
MySQLdb.DataError: (1265, "Data truncated for column 'Status' at row 120")

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/loris/bin/mri/python/extract_eeg_bids_archive.py", line 217, in <module>
    main()
  File "/opt/loris/bin/mri/python/extract_eeg_bids_archive.py", line 210, in main
    db.update(
  File "/opt/loris/bin/mri/python/lib/database.py", line 216, in update
    raise Exception("Update query failure: " + format(err))
Exception: Update query failure: (1265, "Data truncated for column 'Status' at row 120")

Extract archive using PSCID and not CandID

Once I modified the ENUM status column, I rerun the script and the extraction happens, however, it extracts it with the PSCID instead of the CandID which will be an issue for HBCD as all data are under CandID. Is there a config to use the CandID when extracting instead of the PSCID? If so, need to document that for deployment

No error thrown when use a bad URL for EEG S3 data path (wrong bucket name)

The script still prints:

Uploading s3://loris_demo_f/sub-PIUMN0013/ses-V03/eeg/sub-PIUMN0013_ses-V03_task-MMN_acq-eeg_eeg.set to https://s3.msi.umn.edu/loris_demo_test_cmadjar
Uploading s3://loris_demo_f/sub-PIUMN0013/ses-V03/eeg/sub-PIUMN0013_ses-V03_task-MMN_acq-eeg_channels.tsv to https://s3.msi.umn.edu/loris_demo_test_cmadjar

However, nothing get uploaded since s3://loris_demo_f/ does not exist.

Also, seems like the message Uploading <wrong bucket name> to <default bucket name> is misleading since in theory, it uploads the content of the tmp directory where the data was extracted into the S3 in this case.

Need to change permission of the extract script to be executable

cmadjar · 2023-05-05T13:23:11Z

FYI: list of tests performed:

run extract_eeg_bids_archive.py with UploadID option
run extract_eeg_bids_archive.py without UploadID option
run extract_eeg_bids_archive.py with regular path in EEGUploadIncomingPath
run extract_eeg_bids_archive.py with S3 path in EEGUploadIncomingPath - FAILURE, see above point 1
run extract_eeg_bids_archive.py without Config EEGS3DataPath
run extract_eeg_bids_archive.py with Config set to an existing S3 URL EEGS3DataPath
run extract_eeg_bids_archive.py with Config set to a bad S3 URL EEGS3DataPath - FAILURE, see point 4 above
run extract_eeg_bids_archive.py on an UploadID that was already extracted -> print archive already extracted
CAVEAT for HBCD, extract data with PSCID and not CandID, see above point 3

cmadjar · 2023-05-05T15:34:41Z

@laemtl I just remembered about the permissions of the extract script. See point 5 in #970 (comment)

laemtl · 2023-06-16T20:53:50Z

Once I modified the ENUM status column, I rerun the script and the extraction happens, however, it extracts it with the PSCID instead of the CandID which will be an issue for HBCD as all data are under CandID. Is there a config to use the CandID when extracting instead of the PSCID? If so, need to document that for deployment

Extracting using the CandID will be possible if the BIDS dataset is set to use the CandId as the BIDs subject_id. The dataset is probably not setup this way. To cover the conversion of subject-id for the previously uploaded dataset, another script can be created.

laemtl · 2023-06-19T17:07:02Z

Is there a patch somewhere to add the enum? Also, in general, we try to avoid enum in LORIS so we are not depending on doing a major or minor release for a data change. Instead, it might be best to create a separate table with the list of status and use the ID of the statuses in electrophysiology_uploader. Then later on, if you want to add a new status, this is just an INSERT statement instead of an ALTER TABLE.

Yes, here is the link to the PR for the patch: https://github.com/aces/HBCD/pull/895
I will keep this at an enum for now since it makes it easier to handle DEFAULT values. We can rediscuss later about that :)

laemtl changed the base branch from main to 24.1-release May 4, 2023 14:43

laemtl changed the title ~~Eeg decompressing script 24~~ [EEG] extract archive script May 4, 2023

[EEG] uncompress archive script

dc5855c

laemtl force-pushed the eeg-decompressing-script-24 branch from 4f547c1 to dc5855c Compare May 4, 2023 15:00

cmadjar requested changes May 4, 2023

View reviewed changes

cmadjar added this to the 24.1.11 milestone May 4, 2023

laemtl force-pushed the eeg-decompressing-script-24 branch from 94ec764 to cab2357 Compare May 4, 2023 21:15

Review comments

6fc4c4f

laemtl force-pushed the eeg-decompressing-script-24 branch from cab2357 to 6fc4c4f Compare June 19, 2023 19:00

laemtl merged commit fab7d1b into aces:24.1-release Jun 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EEG] extract archive script #970

[EEG] extract archive script #970

laemtl commented May 4, 2023 •

edited

Loading

cmadjar left a comment

laemtl commented May 5, 2023 •

edited

Loading

cmadjar commented May 5, 2023 •

edited by laemtl

Loading

cmadjar commented May 5, 2023

cmadjar commented May 5, 2023

laemtl commented Jun 16, 2023 •

edited

Loading

laemtl commented Jun 19, 2023 •

edited

Loading

[EEG] extract archive script #970

[EEG] extract archive script #970

Conversation

laemtl commented May 4, 2023 • edited Loading

cmadjar left a comment

Choose a reason for hiding this comment

laemtl commented May 5, 2023 • edited Loading

cmadjar commented May 5, 2023 • edited by laemtl Loading

cmadjar commented May 5, 2023

cmadjar commented May 5, 2023

laemtl commented Jun 16, 2023 • edited Loading

laemtl commented Jun 19, 2023 • edited Loading

laemtl commented May 4, 2023 •

edited

Loading

laemtl commented May 5, 2023 •

edited

Loading

cmadjar commented May 5, 2023 •

edited by laemtl

Loading

laemtl commented Jun 16, 2023 •

edited

Loading

laemtl commented Jun 19, 2023 •

edited

Loading