Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Segmentation Fault in libMKLDNNPlugin.so when using blobs with pre-allocated buffers #4742

Closed
3 tasks done
tanmayv25 opened this issue Mar 11, 2021 · 5 comments · Fixed by #5000
Closed
3 tasks done
Assignees
Labels
bug Something isn't working ONNX Related to support for ONNX standard. PSE support_request

Comments

@tanmayv25
Copy link

tanmayv25 commented Mar 11, 2021

System information (version)
  • OpenVINO=> 2021.2.185
  • Operating System / Platform => 18.04.1-Ubuntu
  • Compiler => gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
  • Problem classification => Inference
Detailed description

For my use case where I already had input data in memory, I created InferenceEngine::Blob objects using the pointer to the pre-allocated buffer and size. I had to free the buffer once the inference results were retrieved and create/set another blob with the next requests. First several runs are successful with expected results, however, I see a segmentation fault after some iterations.

When using the valgrind it looks like in the Infer() call the libMKLDNNPlugin.so attempts to access the the memory address of one of the previous request which has been already freed when the corresponding request completed. I expect once the SetBlob is called on the InferRequest it should override the previous SetBlob call completely and the Infer call should use the latest blob for inference.

See the valgrind log here:

==1842== Command: ./repeated_wraps /tmp/host/model.xml 10000000 CPU
==1842== 
==1842== Thread 2:
==1842== Invalid write of size 2
==1842==    at 0x4C38753: memmove (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==1842==    by 0x7A1E24A: ??? (in /opt/intel/openvino_2021.2.185/deployment_tools/inference_engine/lib/intel64/libMKLDNNPlugin.so)
==1842==    by 0x7A7311C: ??? (in /opt/intel/openvino_2021.2.185/deployment_tools/inference_engine/lib/intel64/libMKLDNNPlugin.so)
==1842==    by 0x7A8D4DA: ??? (in /opt/intel/openvino_2021.2.185/deployment_tools/inference_engine/lib/intel64/libMKLDNNPlugin.so)
==1842==    by 0x7A2C59C: ??? (in /opt/intel/openvino_2021.2.185/deployment_tools/inference_engine/lib/intel64/libMKLDNNPlugin.so)
==1842==    by 0x5ACAAC1: tbb::interface7::internal::task_arena_base::internal_execute(tbb::interface7::internal::delegate_base&) const (in /opt/intel/openvino_2021.2.185/deployment_tools/inference_engine/external/tbb/lib/libtbb.so.2)
==1842==    by 0x4EDC5F0: ??? (in /opt/intel/openvino_2021.2.185/deployment_tools/inference_engine/lib/intel64/libinference_engine.so)
==1842==    by 0x51CE6DE: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25)
==1842==    by 0x62946DA: start_thread (pthread_create.c:463)
==1842==    by 0x57D371E: clone (clone.S:95)
==1842==  Address 0x9d810e0 is 0 bytes inside a block of size 4 free'd
==1842==    at 0x4C32D3B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==1842==    by 0x10D4CB: main (in /root/inference_engine_cpp_samples_build/intel64/Release/repeated_wraps)
==1842==  Block was alloc'd at
==1842==    at 0x4C31B0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==1842==    by 0x10CFE9: main (in /root/inference_engine_cpp_samples_build/intel64/Release/repeated_wraps)

I hacked the hello_classification sample code little bit to repeatedly issue the Infer requests to an IR model converted from ONNX Identity model. The main section looks like:

for (int i = 0; i < 10000; i++) {
        // --------------------------- 6. Prepare input --------------------------------------------------------
        float* input_ptr = (float*)malloc(4);
        *input_ptr = 2.2f;

        // Blob::Ptr imgBlob = wrapMat2Blob(image);  // just wrap Mat data by Blob::Ptr without allocating of new memory
        InputInfo::Ptr input_info = network.getInputsInfo().begin()->second;
        auto tensor_desc = input_info->getTensorDesc();
        Blob::Ptr data_blob = make_shared_blob<float>(tensor_desc, input_ptr, 4);
        infer_request.SetBlob(input_name, data_blob);  // infer_request accepts input blob of any size
        // -----------------------------------------------------------------------------------------------------

        // --------------------------- 7. Do inference --------------------------------------------------------
        /* Running the request synchronously */
        infer_request.Infer();
        // -----------------------------------------------------------------------------------------------------

        // --------------------------- 8. Process output ------------------------------------------------------
        Blob::Ptr output = infer_request.GetBlob(output_name);
        float* output_ptr = output->buffer().as<float *>();
        if (*output_ptr != 2.2f) {
           std::cout << "mismatch found " << *output_ptr << std::endl;
        }
        // Print classification results
        free(input_ptr);
        // -----------------------------------------------------------------------------------------------------
}

Steps to reproduce

ov_segf_repro.zip
The attached zip file contains

  1. The hacked sample in a directory repeated_wraps.
  2. The model.onnx file.
  3. The compiled repeated_wraps binary. [redundant if compiling from the above source]
  4. IR model directory [redundant if converting from model.xml ]

Follow the below steps to reproduce the segmentation fault:

  1. Extract and copy the attached archive file into openvino/ubuntu18_dev container image.
  2. Convert the model.onnx to IR using model optimizer in the container:

root@1b6004fab03d:/opt/intel/openvino_2021.2.185/deployment_tools/model_optimizer# python3 mo.py --input
_model model.onnx --input_shape [1]

  1. Copy the repeated_wraps folder to the /opt/intel/openvino_2021.2.185/inference_engine/samples/cpp
  2. apt install build-essential then change to the above directory.
  3. Run ./build_samples.sh script.
  4. Move to ~/inference_engine_cpp_samples_build/intel64/Release
  5. Run the compiled binary like below pointing to the IR model generated in step 2.

./repeated_wraps <path_to_model.xml>/model.xml 10000000 CPU

A segmentation fault will occur.

Some example runs:

root@5309ddf0990e:/tmp/host/ov_segf_repro# ./repeated_wraps_bin IRModels/model.xml 1 CPU
This sample is an API example, for any performance measurements please use the dedicated benchmark_app tool
root@5309ddf0990e:/tmp/host/ov_segf_repro# ./repeated_wraps_bin IRModels/model.xml 2 CPU
This sample is an API example, for any performance measurements please use the dedicated benchmark_app tool
root@5309ddf0990e:/tmp/host/ov_segf_repro# ./repeated_wraps_bin IRModels/model.xml 100 CPU
Segmentation fault (core dumped)
Issue submission checklist
  • I report the issue, it's not a question
  • I checked the problem with documentation, FAQ, open issues, Stack Overflow, etc and have not found solution
  • There is reproducer code and related data files: images, videos, models, etc.
@tanmayv25 tanmayv25 added bug Something isn't working support_request labels Mar 11, 2021
@Iffa-Intel Iffa-Intel self-assigned this Mar 12, 2021
@Iffa-Intel Iffa-Intel added ONNX Related to support for ONNX standard. and removed bug Something isn't working labels Mar 15, 2021
@jgespino
Copy link
Contributor

Hi @tanmayv25

Thanks for reporting, providing necessary files and steps to reproduce! I was able to reproduce the segmentation fault with niter set to 3 and above. Interesting, your application works fine when using a GPU device. We have opened a bug under the CPU plugin for the development team to investigate.

Regards,
Jesus

Ref. 51087

@jgespino jgespino added bug Something isn't working PSE labels Mar 15, 2021
@jgespino jgespino self-assigned this Mar 15, 2021
@maxnick
Copy link
Contributor

maxnick commented Mar 17, 2021

Hi @tanmayv25

Thanks for the reproducer. But the attached model is a degenerate case that triggers an unusual call sequence that has little to do with real-world use cases, but it should work anyways and will be fixed. However, I suspect that this reproducer does not reveal the real problem. I suppose that your original model is somewhat more complex and the problem with it may have a different source. Am I right? If so, could you please provide the original model that the problem initially occurred with?

Regards,
Maksim

@tanmayv25
Copy link
Author

tanmayv25 commented Mar 17, 2021

Hi @maxnick,
Thanks for quick response...
I am working on the openvino backend for Triton. The model that I have shared(identity model) comes from Triton's CI testing. In order to productize the backend this is an essential test. The identity model helps in understanding the overhead added by Triton for max throughput case. Also this is the case that I have been able to consistently reproduce outside triton.

That being said, I have tried running valgrind on Triton+OV for resnet50_int8 model and saw the following reports:

Invalid read of size 32
==5608==    at 0x182A64090: ???
==5608==    by 0x2: ???
==5608==    by 0x17AB3F63F: ???
==5608==    by 0x2: ???
==5608==  Address 0x189c5bf90 is 3,984 bytes inside a block of size 4,004 alloc'd
==5608==    at 0x483E0F0: memalign (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==5608==    by 0x483E212: posix_memalign (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==5608==    by 0x17466FB5D: ??? (in /opt/tritonserver/backends/openvino/libMKLDNNPlugin.so)
==5608==    by 0x17463F05D: ??? (in /opt/tritonserver/backends/openvino/libMKLDNNPlugin.so)
==5608==    by 0x17463F554: ??? (in /opt/tritonserver/backends/openvino/libMKLDNNPlugin.so)
==5608==    by 0x17306A3B7: ??? (in /opt/tritonserver/backends/openvino/libMKLDNNPlugin.so)
==5608==    by 0x1730A5D53: ??? (in /opt/tritonserver/backends/openvino/libMKLDNNPlugin.so)
==5608==    by 0x174512F48: ??? (in /opt/tritonserver/backends/openvino/libMKLDNNPlugin.so)
==5608==    by 0x1746BE1CF: ??? (in /opt/tritonserver/backends/openvino/libMKLDNNPlugin.so)
==5608==    by 0x1746C00C7: ??? (in /opt/tritonserver/backends/openvino/libMKLDNNPlugin.so)
==5608==    by 0x174689CED: ??? (in /opt/tritonserver/backends/openvino/libMKLDNNPlugin.so)
==5608==    by 0x174689EBB: ??? (in /opt/tritonserver/backends/openvino/libMKLDNNPlugin.so)

This doesn't cause seg faults like the model I shared before and I don't find any inconsistencies in the results. The output is of 1001 FP32 values with probability for each class. I don't get why libMKLDNNPlugin.so will try to read 32 bytes at the margin. It looked like an artifact from avx instructions. Unfortunately, I don't have a simple reproducer for the latter case but I can share one if it is of interest.

@tanmayv25
Copy link
Author

tanmayv25 commented Apr 12, 2021

@maxnick Has there been any progress on the resolution of this issue? An identity model is the most simplistic and serve various utilities for the framework use-cases.

@jgespino
Copy link
Contributor

@tanmayv25 Take a look at PR #5000 for updates.

Regards,
Jesus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ONNX Related to support for ONNX standard. PSE support_request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants