Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: multiple undefined behavior discovered with -fsanitize=undefined in DEV builds #2602

Closed
junliume opened this issue Dec 13, 2023 · 11 comments · Fixed by #2609
Closed
Assignees

Comments

@junliume
Copy link
Contributor

junliume commented Dec 13, 2023

Problem Description

[Suspected Issue]:
Uninitialized values are been used.

[Steps to Reproduce]

CXX=/opt/rocm/bin/amdclang++ CXXFLAGS='-Werror'  \
cmake -DMIOPEN_TEST_FLAGS=' --disable-verification-cache ' -DCMAKE_BUILD_TYPE=debug \
-DCMAKE_CXX_FLAGS_DEBUG='-g -fno-omit-frame-pointer -fsanitize=undefined \
-fno-sanitize-recover=undefined -Wno-option-ignored ' -DBUILD_DEV=On -DMIOPEN_USE_MLIR=ON \
-DMIOPEN_GPU_SYNC=Off  -DCMAKE_PREFIX_PATH="/opt/rocm" ..

LLVM_PATH=/opt/rocm/llvm CTEST_PARALLEL_LEVEL=4 MIOPEN_CONV_PRECISE_ROCBLAS_TIMING=0 \
make -j$(nproc) check MIOpenDriver

[List of Observed Issues]

  • smoke_miopendriver_gemm
    undefined bool here?

    bool deterministic;

    MIOpenDriver gemm -m 256 -n 512 -k 1024 -i 1 -V 1
    /home/user/MIOpen/src/gemm_v2.cpp:455:22: runtime error: load of value 48, which is not a valid value for type 'bool'
    SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /home/user/MIOpen/src/gemm_v2.cpp:455:22 in 
    
  • test_bn_aux test_tensor_test test_main and many of TensorApiSetTensor/TestSetTensor

    MIOpen(HIP): miopenStatus_t miopenGetTensorDescriptor(miopenTensorDescriptor_t, miopenDataType_t *, int *, int *){
    MIOpen(HIP): 	tensorDesc = {1, 32, 1, 1}, {32, 1, 1, 1}, packed, 
    /home/user/MIOpen/src/include/miopen/logger.hpp:264:15: runtime error: load of value 32764, which is not a valid value for type 'miopenDataType_t'
    SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /home/user/MIOpen/src/include/miopen/logger.hpp:264:15
    
  • test_cba_inference test_na_inference test_na_train

    MIOpen(HIP): miopenStatus_t miopenGetActivationDescriptor(miopenActivationDescriptor_t, miopenActivationMode_t *, double *, double *, double *){
    MIOpen(HIP): 	activDesc = miopenActivationRELU, 0.5, 0.5, 0.5, 
    /home/user/MIOpen/src/include/miopen/logger.hpp:264:15: runtime error: load of value 32767, which is not a valid value for type 'miopenActivationMode_t'
    SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /home/user/MIOpen/src/include/miopen/logger.hpp:264:15
    

Operating System

Ubuntu 22.04.3 LTS

CPU

AMD Ryzen Threadripper 3960X

GPU

AMD Radeon RX 7900 XT

Other

No response

ROCm Version

ROCm 5.7.1

ROCm Component

MIOpen

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@atamazov

This comment was marked as outdated.

@atamazov
Copy link
Contributor

I am working on this

@atamazov
Copy link
Contributor

@junliume

# ll /opt/rocm/llvm/bin/amdclang++
lrwxrwxrwx 1 root root 7 Sep  9 00:01 /opt/rocm/llvm/bin/amdclang++ -> amdllvm*
# ll /opt/rocm/llvm/bin/amdllvm
-rwxr-xr-x 1 root root 548112 Sep  9 00:01 /opt/rocm/llvm/bin/amdllvm*
# ll /opt/rocm/llvm/bin/clang++
lrwxrwxrwx 1 root root 5 Sep  9 00:00 /opt/rocm/llvm/bin/clang++ -> clang*
# ll /opt/rocm/llvm/bin/clang
lrwxrwxrwx 1 root root 8 Sep  9 00:00 /opt/rocm/llvm/bin/clang -> clang-17*
# ll /opt/rocm/llvm/bin/clang-17
-rwxr-xr-x 1 root root 164455928 Sep  9 00:00 /opt/rocm/llvm/bin/clang-17*

What is the difference between /opt/rocm/bin/amdclang++ and /opt/rocm/llvm/bin/clang++? Shall we use the former to build the library (host code) starting from certain ROCm version?

@atamazov
Copy link
Contributor

@junliume @JehandadKhan Please note that Jenkinsfile uses /opt/rocm/llvm/bin/clang++ everywhere.

@atamazov
Copy link
Contributor

atamazov commented Dec 13, 2023

@junliume WRT the 2nd and the 3rd bullets, these are related to the miopenGetTensorDescriptor and miopenGetActivationDescriptor calls. Explanation:

/// In bn_aux.cpp:
        miopenDataType_t dt; // The value passed to miopenGetTensorDescriptor is not initialized.
        miopenGetTensorDescriptor(derivedTensor, &dt, lens.data(), nullptr);
/// In tensor_api.cpp:
extern "C" miopenStatus_t miopenGetTensorDescriptor(miopenTensorDescriptor_t tensorDesc,
                                                    miopenDataType_t* dataType,
                                                    int* dimsA,
                                                    int* stridesA)
{
    MIOPEN_LOG_FUNCTION(tensorDesc, dataType, dimsA, stridesA);
...

Then MIOPEN_LOG_FUNCTION reads uninitialized value of dataType which leads to the sanitizer failure.

(1) We should avoid logging the output values in the miopenGet* functions because these may be uninitialized (and this is legal because getters (miopenGet*) do not read the outputs but only write them).

(2) Alternatively, we can move MIOPEN_LOG_FUNCTION just before return.

  • [benefit] The actual output values would be logged.
  • [drawback] rocTracer annotations would be delayed. This should not be a problem right now because miopenGet* calls do not run kernels, but who knows about the future?
  • [drawback] If an exception fires during the miopenGet* call, the call won't be logged.
  • [drawback] Mess in the detailed log. Normally we expect LOG_FUNCTION record at the entry point, then a series of LOG records with details, but these would be printed in reverse order.

(3) Combine both. Remove outputs from MIOPEN_LOG_FUNCTION as proposed in (1). Create the MIOPEN_LOG_FUNCTION_PARAMS macro and use it with outputs just before return, as proposed in (2). This approach is free of the drawbacks of (2) but keeps its benefits.

I am going to leverage approach (1) for now.

@junliume
Copy link
Contributor Author

@junliume @JehandadKhan Please note that Jenkinsfile uses /opt/rocm/llvm/bin/clang++ everywhere.

The request to use /opt/rocm/bin/amdclang++ is pretty recent, so far we have not observed any issues/differences with /opt/rocm/llvm/bin/clang++

@atamazov
Copy link
Contributor

@junliume not reproducible with rocm/miopen:ci_386a9d that contains 5.7.1.

@atamazov
Copy link
Contributor

atamazov commented Dec 14, 2023

@junliume Please check if https://github.com/atamazov/MIOpen/tree/fix-issue-2602 resolves the issues.

The same as a patch: fix-issue-2602.diff.txt

@atamazov
Copy link
Contributor

@JehandadKhan #2602 (comment) updated with approach (3) that might be useful in the future.

@junliume
Copy link
Contributor Author

@junliume Please check if https://github.com/atamazov/MIOpen/tree/fix-issue-2602 resolves the issues.

The same as a patch: fix-issue-2602.diff.txt

Thanks! I will be checking that branch out to test later today and provide results here.

@atamazov
Copy link
Contributor

@junliume I am glad to see that the fix works as expected ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants