-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When run verification of resnet50 model with fp16 quantization option, the verification fails #2156
Comments
Were you able to run into this issue when using onnx runtime to verify as well? In today's meeting discussion, we believe this is the issue in particular related with migraphx's CPU verifier. It'd be desirable if you can get the additional data point on onnx runtime verification. |
yes i've seen such failures on resnet50 FP16 verification. @stefankoncarevic if you can run with real data it should match up with accuracy required. |
@umangyadav Regardless of what data @stefankoncarevic used to run, the driver verify shouldn't fail. If the run pass with real data, then it is still a bug on migraphx driver because it didn't report the correct result. The point is that there is patch work needed such that migraphx driver verification can use fp16 threshold even if the model produced a fp32 result. |
Agree that |
Can you give us that guarantee? That we should ignore the migraphx driver verify result regardless of what it reports?
|
I am pretty sure for resnet50 (based on my experience with UIF/MLPerf developments), but can't say for sure for other models. This issue specifically mentioned resnet50. If the driver verification is relied upon for other models or further development then, yes it would be better to solve it. (i've also seen |
Resnet50 is just one example. We desperately need a mechanism to give us deterministic result to reflect the correctness of a the result of a run. whether this ticket or #2181 is a priority will be based on your and @causten's judgement. Pick either one to start with, and so we can have a way to routinely track as regression test. If you already knew resnet50 is returning incorrect result, why wasn't this fixed before, or at least a ticket filed to track this? It make me concerned that those fixes require mlir team to drive it.
Nope, I believe we have run this on multiple models and get mostly failing result with or without mlir enabled. Now you should fix this issue so we can do a second round of regression checks. |
You should use the test data for the model with tools/test_runner.py to verify the results. It much faster and more accurate than using driver verify. |
@pfultz2 Thanks for the pointer. How hard is it to be automated in our CI so it gets to check on nightly basis? We can transition to this if it is the only way to verify the correctness result. @stefankoncarevic Could you file a ticket on mlir side to track this? Also, could you follow-up with @pfultz2 (Paul Fultz) on how to use |
You just need the models and test data on the machine(usually these are packaged together). And then run |
Is that in a mount point anywhere, that the models and test data exist? Downloading it on the fly for CI machines would be inconvenient. |
So bert squad fails when using |
There is no mount point. There is zip file somewhere(it was on rome6 and mi250 but those machines are not accesible or was wiped clean) with a bunch of models and you can download the test data from onnx model zoo.
It can get pretty big so you definitely want to store it on the node you are using(or mount a drive somehow). |
@pfultz2 I'm going to ask @stefankoncarevic to follow-up with you on this. I think this is going to be slightly different conversation from this ticket because whether or not this can be used in mlir nightly CI isn't clear to me yet. And whether or not the @causten Any way we can have the unzipped (zip) file placed on the NAS mount point? If not I'd like either this and #2181 to be worked as a priority as it is straightforward to use them in MLIR CI. Not having such regression test will yield significant risk to the deliverable to ROCm 6.0 release. |
@pfultz2 Where can I get the real data for model? I change the tolerance in accuracy_checker on 1 and after that a lot of model now passing for fp16 and int8, but I didn't sure it's the right value because it's very large number when compare with 1e-3? |
@stefankoncarevic I don't think it is realistic for us to build a infrastructure for migraphx's model verification. That is out of scope for this release. Next release we can consider building something else that will isolate better for strictly mlir components only. For #2182 completion purpose, please feel free to bump up the accuracy tolerance to get a full pass. Since verification doesn't work even without MLIR, this is out of our hands. We can reduce the tolerance once it get fixed in migraphx. For this ticket, skip the models that has a verification failure until migraphx fixed it. |
@TedThemistokleous @pfultz2 Correct me if I'm wrong, but I don't think the test runner PR should be linked to this one. This one is about migraphx verifier, and is orthogonal to the PR? Based on this understanding I'm re-opening it. Feel free to close again if I misunderstood. |
We need to update the tolerances for fp16 and other data type calculations. Currently we can get around this by using #2213 and adding the driver verify option |
@CharlieL7 does #2334 fix this issue ? |
Yes, should be fixed with #2334. |
Description
After quantization fp16 for ResNet50.onnx model, I got the error about mismatch:
./bin/migraphx-driver verify --gpu --onnx /MIGraphXDeps/resnet50-v1-7.onnx --fp16
...
FAILED: /MIGraphXDeps/resnet50-v1-7.onnx
error: 0.00123406
Max diff: 0.0360341
Mismatch at 0: -0.624751 != -0.62793
fp16-ResNet50-migraphx.txt
I disabled mlir before that.
This blocks adding our CI tracking fp16 and int8 language onnx (quantization) models.
The text was updated successfully, but these errors were encountered: