Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue #1359] hipModuleLoad/hipModuleUnload is leaking file descriptor #2223

Closed
junliume opened this issue Jun 24, 2023 · 6 comments · Fixed by #2225
Closed

[Issue #1359] hipModuleLoad/hipModuleUnload is leaking file descriptor #2223

junliume opened this issue Jun 24, 2023 · 6 comments · Fixed by #2225

Comments

@junliume
Copy link
Contributor

junliume commented Jun 24, 2023

(Thanks to @JehandadKhan to find the issue and to create the minimal reproduce example), creating this issue on GitHub on his behalf)

  • [Symptom]:

During MIOpen exhaustive tuning (e.g. MIOPEN_FIND_ENFORCE=3), MIOpen needs to open a large number of modules, even though we proactively close the ones not been used (e.g. #1221), the program still fails with

terminate called after throwing an instance of 'miopen::Exception'
  what():  /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/MLOpen/src/hipoc/hipoc_program.cpp:162: Failed creating module from file

This issue was previously discussed in #1359 #1360

  • [Steps to reproduce and Findings]:

Using the minimal reproducible example below:
module_api.zip

(need hip enabled docker, in this case any public docker should be able to reproduce the issue)

ROCm version where this issue can be reproduced:
ROCm 5.5
cannot reproduce in ROCm 5.4.3

cd module_api
make clean
make
./hip_module_api

and we will see

Iteration: 65399
An error encountered: "shared object initialization failed" at main.hip:57
  • [Hypothesis]:

The test code is very simple, essentially:

while(true) {
	std::cout << "Iteration: " << ++idx << std::endl;
    // std::this_thread::sleep_for(std::chrono::milliseconds(50));
    hipModule_t module;
    HIP_CHECK(hipModuleLoad(&module, module_path.u8string().c_str()));

    HIP_CHECK(hipModuleUnload(module));
}

however, using the following monitoring command:

watch -n1 'cat /proc/sys/fs/file-nr | tee -a log.log'

we can observe the # of FD that was left open (leaked) keep increasing till the HIP program crashes.

  • [Discussions]:
  1. is hipModuleUnload leaking FD?
  2. the # of FD that was left open (leaked) causes crashes well before it reaches ulimit, so is there any HIP limitation on the # of FD?

CC: @atamazov @averinevg @DrizztDoUrden @dmikushin @CAHEK7

@dmikushin
Copy link
Contributor

One quick workaround that comes to mind is to replace hipModuleLoad with hipModuleLoadData on memory buffer instead of a file, and perform the load of module file content into that buffer ourselves, with proper open and close.

@dmikushin
Copy link
Contributor

I think the "leak" can be legitimate due to the lazy binding of executable code. Suppose you have a dll, and you loaded one symbol from it for use. Then could the dll be ever fully unloaded? No, because the lazy loader cannot guarantee the symbol might decide to jump at a random address of that dll. Therefore, a single lazily-loaded symbol will hold the dll file descriptor forever.

@junliume
Copy link
Contributor Author

junliume commented Jun 24, 2023

Thanks @dmikushin for the quick reply!

This is a regression from ROCm 5.4.3 and begin to manifest in ROCm 5.5, I guess something indeed has changed in hipModuleUnload

@junliume
Copy link
Contributor Author

Fix from hip runtime/ROCr is proposed and should be merged soon.

@junliume junliume closed this as completed Jul 6, 2023
@atamazov
Copy link
Contributor

atamazov commented Jul 6, 2023

@junliume Which rocm versions have this bug? #2225 can be narrowed to provide w/a for them.

@junliume
Copy link
Contributor Author

junliume commented Jul 7, 2023

ROCm 5.6.1 should have it fixed, and all release thereafter, like ROCm 5.7 etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants