When run verification of resnet50 model with fp16 quantization option, the verification fails #2156

stefankoncarevic · 2023-09-06T13:01:11Z

Description

After quantization fp16 for ResNet50.onnx model, I got the error about mismatch:

./bin/migraphx-driver verify --gpu --onnx /MIGraphXDeps/resnet50-v1-7.onnx --fp16

...
FAILED: /MIGraphXDeps/resnet50-v1-7.onnx
error: 0.00123406
Max diff: 0.0360341
Mismatch at 0: -0.624751 != -0.62793

fp16-ResNet50-migraphx.txt

I disabled mlir before that.
This blocks adding our CI tracking fp16 and int8 language onnx (quantization) models.

stefankoncarevic · 2023-09-06T13:16:49Z

@jerryyin

jerryyin · 2023-09-06T19:56:13Z

Were you able to run into this issue when using onnx runtime to verify as well?

In today's meeting discussion, we believe this is the issue in particular related with migraphx's CPU verifier. It'd be desirable if you can get the additional data point on onnx runtime verification.

umangyadav · 2023-09-13T17:24:08Z

It'd be desirable if you can get the additional data point on onnx runtime verification.

yes i've seen such failures on resnet50 FP16 verification. @stefankoncarevic if you can run with real data it should match up with accuracy required.

jerryyin · 2023-09-13T17:47:53Z

@umangyadav Regardless of what data @stefankoncarevic used to run, the driver verify shouldn't fail. If the run pass with real data, then it is still a bug on migraphx driver because it didn't report the correct result.

The point is that there is patch work needed such that migraphx driver verification can use fp16 threshold even if the model produced a fp32 result.

umangyadav · 2023-09-13T17:51:33Z

Agree that migraphx-driver requires some change in logic for threshold. I wanted to say that even though migraphx-driver shows failing accuracy, it shouldn't affect accuracy in real use cases and therefore shouldn't be a blocker.

jerryyin · 2023-09-13T17:58:22Z

it shouldn't affect accuracy in real use cases and therefore shouldn't be a blocker.

Can you give us that guarantee? That we should ignore the migraphx driver verify result regardless of what it reports?

If yes, from today on, we can disable that checker. Please sync with @stefankoncarevic on a reasonable way to not only test resnet50 but also correctness of all other tier-1 models.
If no, this is a blocker, because this is at least a baseline verification that we can refer to when other verification methods are failing (like MIGraphX accuracy_checker problem when run the accuracy for different model(bert, gpt2) #2181).

umangyadav · 2023-09-13T18:04:33Z

Can you give us that guarantee?

I am pretty sure for resnet50 (based on my experience with UIF/MLPerf developments), but can't say for sure for other models. This issue specifically mentioned resnet50.

If the driver verification is relied upon for other models or further development then, yes it would be better to solve it. (i've also seen migraphx-driver failing with bert/GPT but passing when using real data, but l am less confident on those ones).

jerryyin · 2023-09-13T18:20:09Z

This issue specifically mentioned resnet50.

Resnet50 is just one example. We desperately need a mechanism to give us deterministic result to reflect the correctness of a the result of a run. whether this ticket or #2181 is a priority will be based on your and @causten's judgement. Pick either one to start with, and so we can have a way to routinely track as regression test.

If you already knew resnet50 is returning incorrect result, why wasn't this fixed before, or at least a ticket filed to track this? It make me concerned that those fixes require mlir team to drive it.

If the driver verification is relied upon for other models or further development

Nope, I believe we have run this on multiple models and get mostly failing result with or without mlir enabled. Now you should fix this issue so we can do a second round of regression checks.

pfultz2 · 2023-09-13T18:23:46Z

We desperately need a mechanism to give us deterministic result to reflect the correctness of a the result of a run.

You should use the test data for the model with tools/test_runner.py to verify the results. It much faster and more accurate than using driver verify.

jerryyin · 2023-09-13T18:27:03Z

@pfultz2 Thanks for the pointer. How hard is it to be automated in our CI so it gets to check on nightly basis? We can transition to this if it is the only way to verify the correctness result.

@stefankoncarevic Could you file a ticket on mlir side to track this? Also, could you follow-up with @pfultz2 (Paul Fultz) on how to use test_runner.py instead?

pfultz2 · 2023-09-13T18:34:59Z

How hard is it to be automated in our CI so it gets to check on nightly basis?

You just need the models and test data on the machine(usually these are packaged together). And then run python3 tools/test_runner.py <path-to-model-and-data-directory>.

jerryyin · 2023-09-13T18:43:56Z

You just need the models and test data on the machine(usually these are packaged together).

Is that in a mount point anywhere, that the models and test data exist? Downloading it on the fly for CI machines would be inconvenient.

pfultz2 · 2023-09-13T19:00:25Z

i've also seen migraphx-driver failing with bert/GPT but passing when using real data, but l am less confident on those ones.

So bert squad fails when using --fp16(-8.23 vs -7.7425637). Setting the tolerance to 1(instead of 1e-3) then it passes.

pfultz2 · 2023-09-13T19:04:13Z

Is that in a mount point anywhere, that the models and test data exist?

There is no mount point. There is zip file somewhere(it was on rome6 and mi250 but those machines are not accesible or was wiped clean) with a bunch of models and you can download the test data from onnx model zoo.

Downloading it on the fly for CI machines would be inconvenient.

It can get pretty big so you definitely want to store it on the node you are using(or mount a drive somehow).

jerryyin · 2023-09-13T19:58:06Z

@pfultz2 I'm going to ask @stefankoncarevic to follow-up with you on this. I think this is going to be slightly different conversation from this ticket because whether or not this can be used in mlir nightly CI isn't clear to me yet. And whether or not the test_runner mechanism will work is an orthogonal topic to this ticket and #2181.

@causten Any way we can have the unzipped (zip) file placed on the NAS mount point?

If not I'd like either this and #2181 to be worked as a priority as it is straightforward to use them in MLIR CI. Not having such regression test will yield significant risk to the deliverable to ROCm 6.0 release.

stefankoncarevic · 2023-09-14T11:16:33Z

@pfultz2 Where can I get the real data for model?
This is important to me because I want to run the test_runner.py, and see the output for different model.

I change the tolerance in accuracy_checker on 1 and after that a lot of model now passing for fp16 and int8, but I didn't sure it's the right value because it's very large number when compare with 1e-3?
For migraphx-driver the tolerance is much smaller 4e-6 for fp32, double threshold = epsilon() * tolerance, tolerance by default is 80, epsilon I think depends on type fp32, fp16 and int8?

jerryyin · 2023-09-14T15:25:16Z

@stefankoncarevic I don't think it is realistic for us to build a infrastructure for migraphx's model verification. That is out of scope for this release. Next release we can consider building something else that will isolate better for strictly mlir components only.

For #2182 completion purpose, please feel free to bump up the accuracy tolerance to get a full pass. Since verification doesn't work even without MLIR, this is out of our hands. We can reduce the tolerance once it get fixed in migraphx.

For this ticket, skip the models that has a verification failure until migraphx fixed it.

jerryyin · 2023-09-15T14:29:57Z

@TedThemistokleous @pfultz2 Correct me if I'm wrong, but I don't think the test runner PR should be linked to this one. This one is about migraphx verifier, and is orthogonal to the PR?

Based on this understanding I'm re-opening it. Feel free to close again if I misunderstood.

CharlieL7 · 2023-10-05T19:32:53Z

We need to update the tolerances for fp16 and other data type calculations. Currently we can get around this by using #2213 and adding the driver verify option --rms-tol 8e-2.

umangyadav · 2023-10-18T15:28:43Z

@CharlieL7 does #2334 fix this issue ?

CharlieL7 · 2023-10-18T15:49:56Z

Yes, should be fixed with #2334.

causten assigned umangyadav Sep 13, 2023

TedThemistokleous linked a pull request Sep 14, 2023 that will close this issue

Add fp16 flag to test runner to check models quantized to fp16 #2182

Merged

causten closed this as completed in #2182 Sep 15, 2023

jerryyin reopened this Sep 15, 2023

umangyadav mentioned this issue Sep 20, 2023

Add options to set tolerances inside MIGraphX driver #2213

Merged

umangyadav assigned CharlieL7 Oct 18, 2023

CharlieL7 closed this as completed Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When run verification of resnet50 model with fp16 quantization option, the verification fails #2156

When run verification of resnet50 model with fp16 quantization option, the verification fails #2156

stefankoncarevic commented Sep 6, 2023 •

edited

Loading

stefankoncarevic commented Sep 6, 2023

jerryyin commented Sep 6, 2023

umangyadav commented Sep 13, 2023

jerryyin commented Sep 13, 2023

umangyadav commented Sep 13, 2023

jerryyin commented Sep 13, 2023

umangyadav commented Sep 13, 2023 •

edited

Loading

jerryyin commented Sep 13, 2023

pfultz2 commented Sep 13, 2023

jerryyin commented Sep 13, 2023 •

edited

Loading

pfultz2 commented Sep 13, 2023

jerryyin commented Sep 13, 2023

pfultz2 commented Sep 13, 2023

pfultz2 commented Sep 13, 2023

jerryyin commented Sep 13, 2023

stefankoncarevic commented Sep 14, 2023 •

edited

Loading

jerryyin commented Sep 14, 2023

jerryyin commented Sep 15, 2023

CharlieL7 commented Oct 5, 2023

umangyadav commented Oct 18, 2023

CharlieL7 commented Oct 18, 2023

When run verification of resnet50 model with fp16 quantization option, the verification fails #2156

When run verification of resnet50 model with fp16 quantization option, the verification fails #2156

Comments

stefankoncarevic commented Sep 6, 2023 • edited Loading

Description

stefankoncarevic commented Sep 6, 2023

jerryyin commented Sep 6, 2023

umangyadav commented Sep 13, 2023

jerryyin commented Sep 13, 2023

umangyadav commented Sep 13, 2023

jerryyin commented Sep 13, 2023

umangyadav commented Sep 13, 2023 • edited Loading

jerryyin commented Sep 13, 2023

pfultz2 commented Sep 13, 2023

jerryyin commented Sep 13, 2023 • edited Loading

pfultz2 commented Sep 13, 2023

jerryyin commented Sep 13, 2023

pfultz2 commented Sep 13, 2023

pfultz2 commented Sep 13, 2023

jerryyin commented Sep 13, 2023

stefankoncarevic commented Sep 14, 2023 • edited Loading

jerryyin commented Sep 14, 2023

jerryyin commented Sep 15, 2023

CharlieL7 commented Oct 5, 2023

umangyadav commented Oct 18, 2023

CharlieL7 commented Oct 18, 2023

stefankoncarevic commented Sep 6, 2023 •

edited

Loading

umangyadav commented Sep 13, 2023 •

edited

Loading

jerryyin commented Sep 13, 2023 •

edited

Loading

stefankoncarevic commented Sep 14, 2023 •

edited

Loading