Bugfix for SimplifiedLayerNormalization #12975

centwang · 2022-09-15T08:25:43Z

This PR is to fix #12930 and #12579.

In detail:

For CPU EP, since current impl of SimplifiedLayerNormalization doesn't support input and scale having different data types, so if the sub-graph contains Cast Op, the sub-graph will not fused, this guarantee that both inputs and output data type will be same
For CUDA EP, add (fp16, float) support to (T,V) type constraints all combinations of fp16 and float can be supported in the impl

With the fix, the original model can be run with SimplifiedLayerNormalization, which also helps to improve the perf.

onnxruntime/test/optimizer/graph_transform_test.cc

skottmckay · 2022-09-20T07:52:52Z

onnxruntime/core/optimizer/layer_norm_fusion.cc

+    // If it's not GPU EP, since the CPU impl for SimplifiedLayerNormalization doesn't support input and scale
+    // having different types for now, and it may also have conflict to InsertCastTransformer,
+    // so the sub-graph will not be fused if it contains Cast Op.
+    bool is_gpu_ep = pow_node.GetExecutionProviderType() == kCudaExecutionProvider ||


Can you explain this change a bit more? Based on the comment I would have expected the code to look for a Cast and exit if the CPU EP was involved.

I don't quite understand why we change the first branch which seems to be about setting has_leading_cast and a second location which has nothing to do with has_leading_cast.

There are only 4 possible cases (x=Pow->ReduceMean->Add->Sqrt->Div, and y=Mul):
(1) cast(to:float)->x->cast(to:fp16)->y : SimplifiedLayerNorm(T:fp16,V:fp16)
(2) cast(to:float)->x->y : SimplifiedLayerNorm(T:fp16,V:float)
(3) x->cast(to:fp16)->y : SimplifiedLayerNorm(T:float,V:fp16)
(4) x->y : SimplifiedlayerNorm(T:float,V:float)

They all work for CUDA EP.

For CPU EP, we have only SimplifiedlayerNorm(T:float,V:float), so only (4) works. But if for (1) and (2), if we just treat the entry cast as a normal node, means has_leading_cast is always false, then for (2), we can still fuse it to "cast(to:float)->SimplifiedlayerNorm(T:float,V:float)" (just like applying (4) to the x->y after cast), so the condition for CPU EP to fuse or not is always set has_leading_cast to false and check if there is a cast between x and y. Having cast between means cannot fuse.

Would be great to put this excellent explanation in the comment so it's captured for the next person who works on the code.

This PR is to fix #12930 and #12579. In detail: - For CPU EP, since current impl of SimplifiedLayerNormalization doesn't support input and scale having different data types, so if the sub-graph contains Cast Op, the sub-graph will not fused, this guarantee that both inputs and output data type will be same - For CUDA EP, add (fp16, float) support to (T,V) type constraints all combinations of fp16 and float can be supported in the impl With the fix, the original model can be run with SimplifiedLayerNormalization, which also helps to improve the perf.

bugfix for symplified layer norm

aa2efc2

centwang added the core runtime issues related to core runtime label Sep 15, 2022

centwang requested review from skottmckay, askhade and pengwa September 15, 2022 08:25

ytaous reviewed Sep 20, 2022

View reviewed changes

onnxruntime/test/optimizer/graph_transform_test.cc Show resolved Hide resolved

skottmckay reviewed Sep 20, 2022

View reviewed changes

ytaous previously approved these changes Sep 26, 2022

View reviewed changes

skottmckay previously approved these changes Sep 26, 2022

View reviewed changes

centwang added 2 commits September 27, 2022 10:25

Merge branch 'main' into weicwang/sim_layer_norm

1886305

add comment and fix merge

5aa19cb

centwang dismissed stale reviews from skottmckay and ytaous via 5aa19cb September 27, 2022 03:30

ytaous approved these changes Sep 27, 2022

View reviewed changes

centwang merged commit 94e34ac into main Sep 27, 2022

centwang deleted the weicwang/sim_layer_norm branch September 27, 2022 06:24

borisfom mentioned this pull request Sep 28, 2022

LayerNormalization crashes on execution on CUDA with fp16 operands #13104

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix for SimplifiedLayerNormalization #12975

Bugfix for SimplifiedLayerNormalization #12975

centwang commented Sep 15, 2022

skottmckay Sep 20, 2022

centwang Sep 26, 2022 •

edited

Loading

skottmckay Sep 26, 2022

Bugfix for SimplifiedLayerNormalization #12975

Bugfix for SimplifiedLayerNormalization #12975

Conversation

centwang commented Sep 15, 2022

skottmckay Sep 20, 2022

Choose a reason for hiding this comment

centwang Sep 26, 2022 • edited Loading

Choose a reason for hiding this comment

skottmckay Sep 26, 2022

Choose a reason for hiding this comment

centwang Sep 26, 2022 •

edited

Loading