[Fp16] MIOpen integration or layout transpose issue with FP16_INF and FP16_MAX #2496

junliume · 2023-11-01T06:27:30Z

[Observations]
On gfx90a platform, with the input
input.txt

# /opt/rocm/bin/MIOpenDriver convfp16 -n 5 -c 1 -H 3 -W 3 -k 3 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1 -V 0 --dump_output 1 --in_data input.txt.bin --dout_data dout.txt.bin
MIOpenDriver convfp16 -n 5 -c 1 -H 3 -W 3 -k 3 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1 -V 0 --dump_output 1 --in_data input.txt.bin --dout_data dout.txt.bin
Read data from input file input.txt.bin
Could not open file dout.txt.bin for reading
PRNG seed: 12345678
Wrote output to file dump_in.bin
Wrote output to file dump_dout.bin
MIOpen Backward Weights Conv. Algorithm: 5, Solution: 110/ConvAsmImplicitGemmGTCDynamicWrwXdlopsNHWC
GPU Kernel Time Backward Weights Conv. Elapsed: 0.024996 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdw-conv1x1u1, 5, 1, 3, 3, 1, 1, 3,  270, 0, 0, 0, 0, 0.024996
Wrote output to file dump_bwd_dwei_gpu.bin

The IGEMM kernel gave output:

[65504. 65504. 65504.]

But the expected output (from direct convolution) should be:

[inf, inf, inf]

[Analysis]

65504 is FP16_MAX
It is determined that the iGEMM kernel actually generates INF
@JehandadKhan @atamazov Now it is suspected that during data-layout transpose or MIOpen integration, the INF is reinterpreted as MAX?

@carlushuang could you help to comment on this issue if anything is missing? Thanks!

The text was updated successfully, but these errors were encountered:

carlushuang · 2023-11-01T10:01:24Z

hotfix is here:
#2497

carlushuang · 2023-11-01T10:04:44Z

The root cause is in this case wrw kernel using atomic, and in the solver it will use CastTensor() function to cast from fp32->fp16
However, the in the source code of cast function : src\kernels\MIOpenSubTensorOpWithCastTensorKernel.cl, it will clamp to MAX value.

            _FLOAT_SRC temp_src = *(src + sindex + srcOffset);
#if MIOPEN_SRC_TYPE == 3 && MIOPEN_DST_TYPE == 4
            temp_src *= alpha;
            *(dst + dindex + dstOffset) = float_to_bfloat16(temp_src);
#else
            bool over_flow = (alpha * ((float)temp_src)) >= ((float)MAX_VAL);
            *(dst + dindex + dstOffset) =
                (_FLOAT_DST)(over_flow ? MAX_VAL : alpha * ((float)temp_src));
#endif
        }

@junliume I think need some one to evaluate why here need clamp to MAX

atamazov · 2023-11-01T12:33:28Z

@junliume @JehandadKhan @carlushuang First of all, it should be noted that the convolution code we generate is focused on performance, not IEEE-754 correctness. Convolutions should provide maximum performance, provided that the accuracy is sufficient for the neural networks to function correctly.

That is why the OpenCL kernels (for example) are compiled with the -cl-fast-relaxed-math option, which neither guarantees high precision for many operations nor provides correct handling of special numbers. In particular, this option enables -cl-finite-math-only, whose meaning is obvious. See https://man.opencl.org/clBuildProgram.html for more info. HIP and ASM kernels are expected to follow the same policy wrt precision vs performance.

Please also note that the result of MAX instead of INF can be interpreted as a difference of 1ULP, which is sufficient accuracy for convolutions ;)

So the test that checks for INF on output is questionable.

From the other hand, if we want to make the test passing, fixing the cast kernel (as triaged by @carlushuang) is a way to go. But please note that while this resolves the immediate problem, we cannot always guarantee the expected special number result (INF) due to the the above.

WRT the implementation of the full-blown fix. Git blame shows that over_flow flag is used in the kernels from the beginning. That is all what I found in the public repo. Therefore I highly recommend:

(1) Investigate the code (git blame and observe review comments etc) in the private repo, starting from https://github.com/AMDComputeLibraries/MLOpen/pull/1327
(2) If the investigation gives clear results (we want to know why this conversion of INF to MAX is performed), then develop and apply correct fix.
(3) Otherwise (if we do not clearly know the reasons of the questionable conversion), I recommend skipping it but only when the kernel is used in BwdData and BwdWeights convolutions. That would require adding one more build-time parameter to the kernel or adding one more real-time parameter.
(3) Similar fix should be applied to other casting kernels and other convolution solvers that leverage them.

@carlushuang Thanks for triaging the issue!

ekuznetsov139 · 2023-11-01T23:36:17Z

I want to make two points.
First, frameworks rely on backward convolution kernels for fp16 (and likewise fp8) producing infs when computation results overflow. It is not an IEEE-754 compliance issue, it is a fundamental requirement.

Second, whatever the reasoning was for that clamp into finite range, we can probably agree that applying it only on the positive side can't possibly be correct.

From the frameworks perspective, it would be best if the cast kernel clamped into finite range for forward but not backward convolutions. If that is too hard, then it should not clamp at all.

junliume · 2023-11-10T05:50:36Z

@JehandadKhan and @atamazov (CC: @carlushuang) we could change the logic and only "clamp" in the forward direction? Then we could let it be tested through a whole round of staging and trying to identify if there are any issues long before the next release.

atamazov · 2023-11-11T22:58:34Z

@ekuznetsov139 Thanks for the explanation of the frameworks' requirement.

@JehandadKhan @carlushuang I can work on this if no one else is doing it yet. The plan is to implement the 3rd and 4th bullets from #2496 (comment). It seems like we have little time. Q: shall we add a run-time or build-time parameter?

@junliume

we could change the logic and only "clamp" in the forward direction?

Sure, and this seems the best solution from the frameworks' POV.

atamazov · 2023-11-17T23:06:30Z

@junliume @JehandadKhan The fix is implemented in #2538.

The fix is very conservative (minimal functional changes because release branching date is near). We should consider other option: remove clamping controls (introduced in this PR) and keep clamping permanent, but replace clamping to MAX with clamping to INF. This seems mathematically correct (and can't break #2496 again), and should slightly improve performance, but let's do that after release branching.

Therefore I recommend keeping this ticket open (after merging #2538) but lowering its urgency, and then closing after implementing the clamping to INF.

atamazov · 2023-11-23T21:58:23Z

@junliume @ekuznetsov139 The fix is merged into develop. If the problem is resolved, then I suggest lowering the urgency of this issue to "normal" and removing the milestone (see previous comment).

junliume added value_high quality urgency_high labels Nov 1, 2023

junliume assigned JehandadKhan Nov 1, 2023

junliume mentioned this issue Nov 1, 2023

workaround SWDEV_423713 by disabling the unit test case in wrw solver #2497

Merged

junliume added urgency_blocker and removed urgency_high labels Nov 8, 2023

junliume added this to the ROCm 6.1 milestone Nov 10, 2023

atamazov mentioned this issue Nov 17, 2023

[conv] Remove clamping to MAX from CastTensor used in Bwd and WrW #2538

Merged

junliume added the urgency_normal label Nov 26, 2023

junliume removed this from the MIOpen v3.1.0 milestone Nov 30, 2023

junliume removed the urgency_blocker label Dec 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fp16] MIOpen integration or layout transpose issue with FP16_INF and FP16_MAX #2496

[Fp16] MIOpen integration or layout transpose issue with FP16_INF and FP16_MAX #2496

junliume commented Nov 1, 2023

carlushuang commented Nov 1, 2023

carlushuang commented Nov 1, 2023 •

edited

Loading

atamazov commented Nov 1, 2023 •

edited

Loading

ekuznetsov139 commented Nov 1, 2023 •

edited

Loading

junliume commented Nov 10, 2023

atamazov commented Nov 11, 2023 •

edited

Loading

atamazov commented Nov 17, 2023 •

edited

Loading

atamazov commented Nov 23, 2023

[Fp16] MIOpen integration or layout transpose issue with FP16_INF and FP16_MAX #2496

[Fp16] MIOpen integration or layout transpose issue with FP16_INF and FP16_MAX #2496

Comments

junliume commented Nov 1, 2023

carlushuang commented Nov 1, 2023

carlushuang commented Nov 1, 2023 • edited Loading

atamazov commented Nov 1, 2023 • edited Loading

ekuznetsov139 commented Nov 1, 2023 • edited Loading

junliume commented Nov 10, 2023

atamazov commented Nov 11, 2023 • edited Loading

atamazov commented Nov 17, 2023 • edited Loading

atamazov commented Nov 23, 2023

carlushuang commented Nov 1, 2023 •

edited

Loading

atamazov commented Nov 1, 2023 •

edited

Loading

ekuznetsov139 commented Nov 1, 2023 •

edited

Loading

atamazov commented Nov 11, 2023 •

edited

Loading

atamazov commented Nov 17, 2023 •

edited

Loading