Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto3D Swinunet fails with Instance22 dataset #5742

Closed
AHarouni opened this issue Dec 14, 2022 · 3 comments
Closed

Auto3D Swinunet fails with Instance22 dataset #5742

AHarouni opened this issue Dec 14, 2022 · 3 comments
Assignees
Labels
question Further information is requested

Comments

@AHarouni
Copy link

Running Auto3d with instance22 works with all networks. When I wanted to duplicate the data json in order to simulate larger dataset all networks worked except SwinUnet.

To Reproduce
1 - use Auto3d with instance22 using the dataset.json attached. I changed the extension to txt as json was not supported for uploading
2 - run script below to only trigger swinunet

train_1_node(){
    FOLDER="/workspace/${WORK_DIR}/${MODEL}_${FOLD}"
    rm -r $FOLDER/model_fold$FOLD
    CONF_FOLDER=${FOLDER}"/configs"
    rm ${FOLDER}/${MODEL}.log

    (time \
    torchrun --nnodes=1 --nproc_per_node=8 \
        ${SCRIPT} run \
        --config_file "['${CONF_FOLDER}/hyper_parameters.yaml','${CONF_FOLDER}/network.yaml','${CONF_FOLDER}/transforms_train.yaml','${CONF_FOLDER}/transforms_validate.yaml']" \
        $EXTRA_PRAMS ) 2>&1 | tee -i -p ${FOLDER}/${MODEL}.log
}

swinunetr(){
    MODEL="swinunetr"
    SCRIPT="-m ${WORK_DIR}.${MODEL}_${FOLD}.scripts.train"
    ## new paramets makes it run for 20,000 epochs  !! force it to 1,500
    EXTRA_PRAMS=" --num_images_per_batch 16"
    EXTRA_PRAMS=$EXTRA_PRAMS" --num_patches_per_image 1"
    EXTRA_PRAMS=$EXTRA_PRAMS" --num_iterations 1500"
    EXTRA_PRAMS=$EXTRA_PRAMS" --num_iterations_per_validation 100"
    EXTRA_PRAMS=$EXTRA_PRAMS" --num_sw_batch_size 36"
    train_1_node
}

swinunetr

Error

epoch 8/210
learning rate is set to 0.0001
[2022-11-29 21:44:18] 1/7, train_loss: 0.4237
[2022-11-29 21:44:19] 2/7, train_loss: 0.4575
2022-11-29 21:44:25,647 - > collate dict key "image" out of 4 keys
2022-11-29 21:44:25,701 - >> collate/stack a list of tensors
2022-11-29 21:44:25,705 - >> E: stack expects each tensor to be equal size, but got [1, 96, 96, 64] at entry 0 and [1, 96, 95, 64] at entry 10, shape [(1, 96, 96, 64), (
1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 95, 64), 
(1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64)] in collate([tensor([[[[0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],

[  0.16601867,   0.11132774,   0.97981832, -12.53823159],
       [  0.        ,   0.        ,   0.        ,   1.        ]])},
                                id: 140606314046512,
                                orig_size: (96, 96, 64)},
                  id: 140604144127376,
                  orig_size: (96, 96, 64)},
    id: 140604144127184,
    orig_size: (96, 96, 64)}]
Is batch?: False] ... )
2022-12-06 20:32:04,170 - > collate dict key "label" out of 4 keys
2022-12-06 20:32:04,219 - >> collate/stack a list of tensors

Expected behavior
As you see from the error log it actually starts training in to 1 sometimes 10 epochs then it errors out. Expected for it wo continue running

@tangy5
Copy link
Contributor

tangy5 commented Dec 14, 2022

Thanks, investigating this now.

@Nic-Ma
Copy link
Contributor

Nic-Ma commented Dec 15, 2022

Hi @tangy5 ,

Seems like the error is because images have different shape ((1, 96, 96, 64) vs (1, 96, 95, 64)) when stacking them?

Thanks.

@Nic-Ma Nic-Ma added the question Further information is requested label Dec 15, 2022
@wyli
Copy link
Contributor

wyli commented Apr 19, 2023

Probably addressed by #5950 please feel free to reopen if you see the same with a recent version

@wyli wyli closed this as completed Apr 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants