Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fp16] MIOpen integration or layout transpose issue with FP16_INF and FP16_MAX #2496

Open
junliume opened this issue Nov 1, 2023 · 8 comments

Comments

@junliume
Copy link
Contributor

junliume commented Nov 1, 2023

[Observations]
On gfx90a platform, with the input
input.txt

# /opt/rocm/bin/MIOpenDriver convfp16 -n 5 -c 1 -H 3 -W 3 -k 3 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1 -V 0 --dump_output 1 --in_data input.txt.bin --dout_data dout.txt.bin
MIOpenDriver convfp16 -n 5 -c 1 -H 3 -W 3 -k 3 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1 -V 0 --dump_output 1 --in_data input.txt.bin --dout_data dout.txt.bin
Read data from input file input.txt.bin
Could not open file dout.txt.bin for reading
PRNG seed: 12345678
Wrote output to file dump_in.bin
Wrote output to file dump_dout.bin
MIOpen Backward Weights Conv. Algorithm: 5, Solution: 110/ConvAsmImplicitGemmGTCDynamicWrwXdlopsNHWC
GPU Kernel Time Backward Weights Conv. Elapsed: 0.024996 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdw-conv1x1u1, 5, 1, 3, 3, 1, 1, 3,  270, 0, 0, 0, 0, 0.024996
Wrote output to file dump_bwd_dwei_gpu.bin

The IGEMM kernel gave output:

[65504. 65504. 65504.]

But the expected output (from direct convolution) should be:

[inf, inf, inf]

[Analysis]

  • 65504 is FP16_MAX
  • It is determined that the iGEMM kernel actually generates INF
  • @JehandadKhan @atamazov Now it is suspected that during data-layout transpose or MIOpen integration, the INF is reinterpreted as MAX?

@carlushuang could you help to comment on this issue if anything is missing? Thanks!

@carlushuang
Copy link
Contributor

hotfix is here:
#2497

@carlushuang
Copy link
Contributor

carlushuang commented Nov 1, 2023

The root cause is in this case wrw kernel using atomic, and in the solver it will use CastTensor() function to cast from fp32->fp16
However, the in the source code of cast function : src\kernels\MIOpenSubTensorOpWithCastTensorKernel.cl, it will clamp to MAX value.

            _FLOAT_SRC temp_src = *(src + sindex + srcOffset);
#if MIOPEN_SRC_TYPE == 3 && MIOPEN_DST_TYPE == 4
            temp_src *= alpha;
            *(dst + dindex + dstOffset) = float_to_bfloat16(temp_src);
#else
            bool over_flow = (alpha * ((float)temp_src)) >= ((float)MAX_VAL);
            *(dst + dindex + dstOffset) =
                (_FLOAT_DST)(over_flow ? MAX_VAL : alpha * ((float)temp_src));
#endif
        }

@junliume I think need some one to evaluate why here need clamp to MAX

@atamazov
Copy link
Contributor

atamazov commented Nov 1, 2023

@junliume @JehandadKhan @carlushuang First of all, it should be noted that the convolution code we generate is focused on performance, not IEEE-754 correctness. Convolutions should provide maximum performance, provided that the accuracy is sufficient for the neural networks to function correctly.

That is why the OpenCL kernels (for example) are compiled with the -cl-fast-relaxed-math option, which neither guarantees high precision for many operations nor provides correct handling of special numbers. In particular, this option enables -cl-finite-math-only, whose meaning is obvious. See https://man.opencl.org/clBuildProgram.html for more info. HIP and ASM kernels are expected to follow the same policy wrt precision vs performance.

Please also note that the result of MAX instead of INF can be interpreted as a difference of 1ULP, which is sufficient accuracy for convolutions ;)

So the test that checks for INF on output is questionable.

From the other hand, if we want to make the test passing, fixing the cast kernel (as triaged by @carlushuang) is a way to go. But please note that while this resolves the immediate problem, we cannot always guarantee the expected special number result (INF) due to the the above.

WRT the implementation of the full-blown fix. Git blame shows that over_flow flag is used in the kernels from the beginning. That is all what I found in the public repo. Therefore I highly recommend:

  • (1) Investigate the code (git blame and observe review comments etc) in the private repo, starting from https://github.com/AMDComputeLibraries/MLOpen/pull/1327
  • (2) If the investigation gives clear results (we want to know why this conversion of INF to MAX is performed), then develop and apply correct fix.
  • (3) Otherwise (if we do not clearly know the reasons of the questionable conversion), I recommend skipping it but only when the kernel is used in BwdData and BwdWeights convolutions. That would require adding one more build-time parameter to the kernel or adding one more real-time parameter.
  • (3) Similar fix should be applied to other casting kernels and other convolution solvers that leverage them.

@carlushuang Thanks for triaging the issue!

@ekuznetsov139
Copy link

ekuznetsov139 commented Nov 1, 2023

I want to make two points.
First, frameworks rely on backward convolution kernels for fp16 (and likewise fp8) producing infs when computation results overflow. It is not an IEEE-754 compliance issue, it is a fundamental requirement.

Second, whatever the reasoning was for that clamp into finite range, we can probably agree that applying it only on the positive side can't possibly be correct.

From the frameworks perspective, it would be best if the cast kernel clamped into finite range for forward but not backward convolutions. If that is too hard, then it should not clamp at all.

@junliume
Copy link
Contributor Author

@JehandadKhan and @atamazov (CC: @carlushuang) we could change the logic and only "clamp" in the forward direction? Then we could let it be tested through a whole round of staging and trying to identify if there are any issues long before the next release.

@junliume junliume added this to the ROCm 6.1 milestone Nov 10, 2023
@atamazov
Copy link
Contributor

atamazov commented Nov 11, 2023

@ekuznetsov139 Thanks for the explanation of the frameworks' requirement.

@JehandadKhan @carlushuang I can work on this if no one else is doing it yet. The plan is to implement the 3rd and 4th bullets from #2496 (comment). It seems like we have little time. Q: shall we add a run-time or build-time parameter?

@junliume

we could change the logic and only "clamp" in the forward direction?

Sure, and this seems the best solution from the frameworks' POV.

@atamazov
Copy link
Contributor

atamazov commented Nov 17, 2023

@junliume @JehandadKhan The fix is implemented in #2538.

The fix is very conservative (minimal functional changes because release branching date is near). We should consider other option: remove clamping controls (introduced in this PR) and keep clamping permanent, but replace clamping to MAX with clamping to INF. This seems mathematically correct (and can't break #2496 again), and should slightly improve performance, but let's do that after release branching.

Therefore I recommend keeping this ticket open (after merging #2538) but lowering its urgency, and then closing after implementing the clamping to INF.

@atamazov
Copy link
Contributor

@junliume @ekuznetsov139 The fix is merged into develop. If the problem is resolved, then I suggest lowering the urgency of this issue to "normal" and removing the milestone (see previous comment).

@junliume junliume removed this from the MIOpen v3.1.0 milestone Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants