-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Have you met memory leak problem when running model? #9
Comments
Yes I also met memory leak problem. May be 1 * 1 convolution version(branch adjusted) would spend less memory. I think that memory leak comes from local Picanet's implementation. It makes H * W * C Tensor to H * W number of 14 * 14 * C size patches. If you have better idea to implement local Picanet, please comment here or make a pull request. Since I am not the author of paper, this code is not the best implementation. I'm sorry for that. |
Yes, I noticed the batch size option, it is wired and strange. I have no better idea so far, but I hope for further discussion. This week, I will go through the caffe code from author and see the difference in implementation in pytorch and caffe, going deeper in local PiCANet and global PiCANet. |
Can you give me the link of caffe implementation? I didn't know that. Thanks. |
@Ugness https://github.com/nian-liu/PiCANet, with deeplab caffe version. |
Thanks a lot. |
@Ugness I change the PiCANet config from ‘GGLLL’ to ‘GGGGG’ and ‘LLLLL’, both of them have memory leak problem when running network.py, have you met this before? I also found an interesting implementation of authors caffe code, they seem implemented an attpooling function on their own proto cpp which support their global or local attention function like conv3d. Can you give me a hint on how you thinking about the conv3d processing? |
I think that would not work with 'GGGGG' or 'LLLLL'. I just tested with 'GGLLL' and other options may cause some tensor dimension error. And I will check protocpp ASAP. |
How Conv3d works?Assumption
What's difference between convolution and 'PiCA' process?
PiCA process with Conv3d (Main Idea of method)
I used same idea to local PiCANet. |
X_X |
@Ugness I do not think they use loop for implementing PiCANet. They use im2col and col2im, which is torch.nn.Unfold and torch.nn.Fold in pytorch. I suppose Conv3d can be translated into a combination of several im2col + matric multiplication + col2im, but I still confused how to implement this, still working on it. |
Thanks. I also try to convert conv3d operation to combination of matrix multiplication. |
@Sucran I think I can improve my model soon. There was no such function like torch.nn.Fold on pytorch 0.4.0 when I started this project. Now, I found the function that I need. Thanks. |
Oh, really?Amazing! @Ugness You are such a genius. |
Hi @Sucran I made a new logic! |
@Ugness Soooooo happy for it works! I check the branch of Fold_Unfold, the memory leak problem seems gone. The VRAM is also lower for increasing the batch size, but cannot be 10. I will check the channel setting of each layer by comparing the caffe version of the author, maybe there is something misunderstanding still exits. |
@Sucran Thanks a lot for your interest. It gave a lot of improvement. It seems like training speed is also improved. |
@Ugness Ok. Thanks for your work again. It is my pleasure. |
@Ugness Anything new? |
One of my model got about 88 on F-measure score with 200 samples of DUTS-TE which scored 87 with model in paper, So I am measuring score with all of DUTS-TE, on all of checkpoints. So it takes a little bit long time. I ensure that new model(with bigger batch_size) performs much better. |
I updated and merged branch. |
@Ugness So the result is the branch of origin (33 conv) not the Adjusted(11 conv) one? it seems to increase the performance of the author's version? The curve you plot is corresponding to training or validation? |
No, it's adjusted one. I used (1*1 conv). I think I need to check all of the code hardly. May be there is something wrong. |
@Ugness Hi, I try to reproduce your result, but I am confusing how to compute the metric result you reported. I had a trained weight model, but which code file contains the test part code? |
You can check the measuring code in pytorch/measure_test.py. It will report the result on tensorboard, and you can download csv from tensorboard. |
Hi @Ugness, do you check your test code for computing Max F_b and MAE, I think there are problems here.
|
For example, if threshold=0.7 and predicted value=0.8, I made 0.8 to 1. As like as making PR-curve. |
@Ugness I do not think the scikit-learn API provided a correct way to compute max F-beta, but you can ref the paper "Salient Object Detection: A Survey" for Chapter 3.2. Usually, we have a fixed threshold which changes from 0 to 255, for binarizing the saliency map to compute precision and recall. F-beta is computed from the average precision and average recall of all images. Then we pick the maximum as max F-beta. |
And about your memory problem, how much VRAM and RAM do you have? |
@Ugness I think the F-score procedure code that you showed is correct. It is almost the same as I reported in 7 days ago, right? I just set the number of threshold as 100 and you set it as 256, which no cause too many differences, but the result still be 0.854 when you tested? |
I also think it is strange. And I have a few questions to compare our results.
|
|
Thank you for answering. |
@Ugness Sorry, the threshold I test is 0.8. I have no test your option yet, I need to wait for any available GPU in my lab. |
What do you mean by without modifying the dataset? |
@Ugness. I mean it should be 5019 images in DUTS-TE without deleting mismatching files, you should test on all 5019 images. |
While DUTS-TE-Mask has 2 more images than DUTS-TE-Image? |
Hi, @Ugness , I intergrate your flie of measure.py and train.py ,but I don't change the file of network.py . now , I set the value of batch_size is 2, at the first falling of learning rate,my train loss can falling .but after that,althought my learning rate falling ,my train loss never falling. And , I test my model on PASCAL-S ,the best value of MAE is 0.1243.could you help me and sovle this problem? |
@RaoHaobo Can you give me some captures of your loss graph?? You may found it on Tensorboard. |
I think that graph looks fine. But if you think that loss should be more less, I recommend you to increase lr decay rate and lr decay step. The hyper parameters on my code, I just followed the implementation on PiCANet paper with DUTS Dataset. I’ll let you know the specific value of score when I found the past results. |
@dylanqyuan version of your tensorboardX is too high. |
It works! Thank you buddy! |
@RaoHaobo #16 (comment) My graph is also fluctuating as like as yours, and looks it is not decreasing. If you want to check your models performance, I suggest you to follow the steps on the link.
p.s. please comment at #17 if you want to talk about this issue more. To make easy to find! |
@Ugness I test your the '36epo_383000step.ckpt' on PASCAL-S ,and the result is |
@Ugness the second problem have been solved ,the first isn't sovled |
Sorry. I forgot to mention that all of my experiment results are on DUTS dataset only. I updated my readme file. |
@Ugness Ok, |
@Ugness this code on your measure_test.py . |
I've made that .sum(dim=-1) because my code evaluates several images on parallel. |
@Ugness i mean that tp + 1e-10,the 1e-10 maybe take out ,I try to take out it ,but the max_F falling much. |
How much difference that follows from the error? |
when threhold equal to 1,the prec must be 0,but your result equal to 1 |
@Ugness |
https://github.com/tensorflow/tensorboard/releases |
Hi, @Ugness
I met a RAM memory leak problem when running network.py and train.py, this issue confused me for a few days. I have run other pytorch repo which is OK.
I run the code in Ubuntu 14.04, Pytorch 0.4.1, CUDA 8.0, cudnn 6.0.
The text was updated successfully, but these errors were encountered: