You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running Auto3d with instance22 works with all networks. When I wanted to duplicate the data json in order to simulate larger dataset all networks worked except SwinUnet.
To Reproduce
1 - use Auto3d with instance22 using the dataset.json attached. I changed the extension to txt as json was not supported for uploading
2 - run script below to only trigger swinunet
train_1_node(){
FOLDER="/workspace/${WORK_DIR}/${MODEL}_${FOLD}"
rm -r $FOLDER/model_fold$FOLD
CONF_FOLDER=${FOLDER}"/configs"
rm ${FOLDER}/${MODEL}.log
(time \
torchrun --nnodes=1 --nproc_per_node=8 \
${SCRIPT} run \
--config_file "['${CONF_FOLDER}/hyper_parameters.yaml','${CONF_FOLDER}/network.yaml','${CONF_FOLDER}/transforms_train.yaml','${CONF_FOLDER}/transforms_validate.yaml']" \
$EXTRA_PRAMS ) 2>&1 | tee -i -p ${FOLDER}/${MODEL}.log
}
swinunetr(){
MODEL="swinunetr"
SCRIPT="-m ${WORK_DIR}.${MODEL}_${FOLD}.scripts.train"
## new paramets makes it run for 20,000 epochs !! force it to 1,500
EXTRA_PRAMS=" --num_images_per_batch 16"
EXTRA_PRAMS=$EXTRA_PRAMS" --num_patches_per_image 1"
EXTRA_PRAMS=$EXTRA_PRAMS" --num_iterations 1500"
EXTRA_PRAMS=$EXTRA_PRAMS" --num_iterations_per_validation 100"
EXTRA_PRAMS=$EXTRA_PRAMS" --num_sw_batch_size 36"
train_1_node
}
swinunetr
Error
epoch 8/210
learning rate is set to 0.0001
[2022-11-29 21:44:18] 1/7, train_loss: 0.4237
[2022-11-29 21:44:19] 2/7, train_loss: 0.4575
2022-11-29 21:44:25,647 - > collate dict key "image" out of 4 keys
2022-11-29 21:44:25,701 - >> collate/stack a list of tensors
2022-11-29 21:44:25,705 - >> E: stack expects each tensor to be equal size, but got [1, 96, 96, 64] at entry 0 and [1, 96, 95, 64] at entry 10, shape [(1, 96, 96, 64), (
1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 95, 64),
(1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64)] in collate([tensor([[[[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[ 0.16601867, 0.11132774, 0.97981832, -12.53823159],
[ 0. , 0. , 0. , 1. ]])},
id: 140606314046512,
orig_size: (96, 96, 64)},
id: 140604144127376,
orig_size: (96, 96, 64)},
id: 140604144127184,
orig_size: (96, 96, 64)}]
Is batch?: False] ... )
2022-12-06 20:32:04,170 - > collate dict key "label" out of 4 keys
2022-12-06 20:32:04,219 - >> collate/stack a list of tensors
Expected behavior
As you see from the error log it actually starts training in to 1 sometimes 10 epochs then it errors out. Expected for it wo continue running
The text was updated successfully, but these errors were encountered:
Running Auto3d with instance22 works with all networks. When I wanted to duplicate the data json in order to simulate larger dataset all networks worked except SwinUnet.
To Reproduce
1 - use Auto3d with instance22 using the dataset.json attached. I changed the extension to txt as json was not supported for uploading
2 - run script below to only trigger swinunet
Error
Expected behavior
As you see from the error log it actually starts training in to 1 sometimes 10 epochs then it errors out. Expected for it wo continue running
The text was updated successfully, but these errors were encountered: