Add onnx mod operator #1302

TedThemistokleous · 2022-07-12T17:12:46Z

Adding Onnx Mod operator and handling both cases when the fmod flag is set to 0 and 1. Solving part of the issue found in #1283

Initial operator for mod implimentation and test cases for integer and floating based types. Need to use fmod from stdlib for floating point types. half_float::half thankfully is specced to the use the existing std::fmod() call when looking at the half.hpp implimentation. fmod_flag should mirror the onnx fmod attribute. Right now using a floating point type without setting that on the user side to true will result in an exception. Ref ticket #1283 Double,float and half use their own typename specification to achieve this, otherwise we rely on the % operator to get the integer remainder while preserving sign of the dividend to the result.

Since onnx's Mod operation changes behavior based on whether the fmod flag is set, functionality is now split to mirror python's fmod() functionality. For the integer mod case, I had to use a componsition of std::fmod() so that floating and integral types are all handled while also perserving sign to be identital to the python numpy::mod() case.

src/include/migraphx/op/mod.hpp

codecov · 2022-07-12T17:50:40Z

Codecov Report

Merging #1302 (6b2710a) into develop (8a30d69) will increase coverage by 0.00%.
The diff coverage is 96.15%.

❗ Current head 6b2710a differs from pull request most recent head 9d3c817. Consider uploading reports for the commit 9d3c817 to get more accurate results

@@           Coverage Diff            @@
##           develop    #1302   +/-   ##
========================================
  Coverage    93.08%   93.09%           
========================================
  Files          439      442    +3     
  Lines        14486    14512   +26     
========================================
+ Hits         13485    13510   +25     
- Misses        1001     1002    +1

Impacted Files	Coverage Δ
src/onnx/include/migraphx/onnx/onnx_parser.hpp	`100.00% <ø> (ø)`
src/include/migraphx/op/fmod.hpp	`85.71% <85.71%> (ø)`
src/include/migraphx/op/mod.hpp	`100.00% <100.00%> (ø)`
src/onnx/onnx_parser.cpp	`94.52% <100.00%> (+0.07%)`	⬆️
src/onnx/parse_mod.cpp	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us.

src/onnx/parse_mod.cpp

kahmed10 · 2022-07-12T18:05:54Z

Make sure to add an onnx parser test as well. That will require modifying test/onnx/gen_onnx.py, test/onnx/onnx_test.cpp, and test/py/onnx_backend_test.py. You can see an example on how to do this in #1177

TedThemistokleous · 2022-07-12T18:08:04Z

Make sure to add an onnx parser test as well. That will require modifying test/onnx/gen_onnx.py, test/onnx/onnx_test.cpp, and test/py/onnx_backend_test.py. You can see an example on how to do this in #1177

Oh awesome. Thanks!

Generate files for modulus operator from onnx. Used to verify parsing.

Verifies that we should get fmod or mod based on the fmod flag found in the parse in onnx operator. Added additional tests for data type, and half type. The half datatype currently uses std::fmod (defined in half.hpp) and incase things change down the road we'll have some sort of test coverage should std::fmod is not used anymore instead.

Put this in so that it's clear in migraphx that this binary operation is not commutative. Taking a look at other binary ops like min/max, mul, etc this is defined as true. My intention here is to be explicit if this is used for any checks and passes

Needed to recalculate this and make sure what I'm getting is valid on output

Fixed mod operator in migraphx to handle negative values correctly. Since onnx produces mod similar to the python equivalent of mod() this test was used to verify that negatives are treated as positive results, rather than a remainder.

…ed in mod protobuf Verify the result fomr the protobuf is correct when using mixed datatypes. The goal here is to verify we cast up correctly from int32 and our results are sane.

TedThemistokleous · 2022-07-15T04:30:32Z

Reused test values found in the onnx library for the simple mod operator that was brought up in issue #1283 by Paul. Looks like my previous changes broke functionality of the operator when using abs() instead of remainder() for this one.

Comes down to how this handles negatives. Let me know if you need me to calculate some other values then the repeated example.

Primarily I used the known set repeatedin the larger 3x3x3 shape in test_verify_onnx.

x = np.array([-4, 7, 5, 4, -7, 8]).astype(np.int64)
y = np.array([2, -3, 8, -2, 3, 5]).astype(np.int64)
z = np.mod(x, y)  # expected output [ 0, -2,  5,  0,  2,  3]
expect(node, inputs=[x, y], outputs=[z],
       name='test_mod_mixed_sign_int64')

and a smaller subset when testing the ref operator

x = [3, 8.5, -7]
np.mod(x, 3)
# array([0. , 2.5, 2. ])

Found from here: https://www.skytowner.com/explore/difference_between_the_methods_mod_and_fmod_in_numpy#:~:text=Both%20mod(~)%20and%20fmod,capable%20of%20parsing%20floating%20numbers."

Which should be the expected behavior here and should now work for both set of test cases.

…ng parsed in mod protobuf

TedThemistokleous · 2022-07-15T16:43:50Z

Converting to draft. Need to sort out CI issues and GPU related implementation that's breaking our builds

TedThemistokleous · 2022-07-16T01:25:29Z

@kahmed10 even after pulling out the gpu related test_mod.cpp and test_fmod.cpp builds were breaking.

Pulled out lines from onnx_backend_test.py for test_mod and test_fmod which seems to have fixed builds, I was still getting issues along the lines of "can't find migraphx::mod", is there somewhere else we should be defining our migraphx operators outside the CMakefile? Is there somewhere else I should be registering these?

Remove extra check since binary handles input checking in check_shapes

TedThemistokleous · 2022-07-18T23:53:57Z

Fixed the issue I had with CI regarding the check for inputs. realized adding a second thrown exception in parse_mod was redundant since check_shapes() as part of binary<> handles the number of allowed input args to both mod operators.

test/onnx/gen_onnx.py

Need this to be handle as float input with fmod=0 is actually invalid with the Onnx operator and should be flagged as an error on parse.

test/onnx/verify_onnx.cpp

Modified test cases and generated tests to reflect and validate this.

…rm/AMDMIGraphX into add_onnx_mod_operator

test/onnx/onnx_test.cpp

test/onnx/gen_onnx.py

test/ref_ops_test.cpp

src/onnx/onnx_parser.cpp

Co-authored-by: Umang Yadav <[email protected]>

…rm/AMDMIGraphX into add_onnx_mod_operator

src/onnx/include/migraphx/onnx/onnx_parser.hpp

test/ref_ops_test.cpp

migraphx-bot · 2022-07-25T18:05:11Z

Test	Rate new 637d1a	Rate old 637d1a	Diff
torchvision-resnet50	2,223.64	2,224.34	✅ -0.03%
torchvision-resnet50_fp16	4,748.16	4,742.04	✅ 0.13%
torchvision-alexnet	4,972.46	4,970.56	✅ 0.04%
torchvision-alexnet_fp16	26,205.88	26,188.29	✅ 0.07%
torchvision-densenet121	1,630.21	1,636.88	✅ -0.41%
torchvision-densenet121_fp16	2,527.37	2,521.69	✅ 0.23%
torchvision-inceptionv3	1,092.56	1,096.17	✅ -0.33%
torchvision-inceptionv3_fp16	1,993.25	1,975.74	✅ 0.89%
torchvision-vgg16	895.70	894.21	✅ 0.17%
torchvision-vgg16_fp16	1,727.67	1,723.51	✅ 0.24%
cadene-inceptionv4	528.29	527.51	✅ 0.15%
cadene-resnext64x4	577.71	577.09	✅ 0.11%
slim-mobilenet	6,401.86	6,391.56	✅ 0.16%
slim-nasnetalarge	203.36	203.19	✅ 0.08%
slim-resnet50v2	2,429.55	2,430.20	✅ -0.03%
bert-mrpc-onnx	639.85	639.40	✅ 0.07%
bert-mrpc-tf	296.78	296.21	✅ 0.19%
pytorch-examples-wlang-gru	229.23	229.94	✅ -0.31%
pytorch-examples-wlang-lstm	306.28	306.36	✅ -0.03%
torchvision-resnet50_1	517.19	514.95	✅ 0.44%
torchvision-inceptionv3_1	303.57	302.50	✅ 0.35%
torchvision-vgg16_1	464.15	463.04	✅ 0.24%
cadene-dpn92_1	297.75	306.11	✅ -2.73%
cadene-resnext101_1	236.29	229.11	✅ 3.14%
slim-vgg16_1	64.02	64.01	✅ 0.01%
slim-mobilenet_1	1,986.93	1,974.47	🔆 0.63%
slim-inceptionv4_1	199.92	195.16	✅ 2.44%
onnx-taau-downsample	259.20	258.79	✅ 0.16%

Check results before merge 🔆

This has been available since July 25th 2022 in MIGraphX. Appared to be missing from support list of ops ROCm/AMDMIGraphX#1302

This has been available since July 25th 2022 in MIGraphX. Appared to be missing from support list of ops ROCm/AMDMIGraphX#1302 ### Description  Add in node name for Mod Operator to be supported by MIGraphX ### Motivation and Context  Expand available functionality to Onnxruntime for the MIGraphX EP Co-authored-by: Ted Themistokleous <[email protected]>

@askhade

* fix build err inbuild with minimal_build conjuncting disable_exceptions flags (microsoft#14524) ### Description If we set flag 'disable_exceptions' to build ORT: `onnxruntime/contrib_ops/cpu/quantization/qlinear_global_average_pool.cc.o` woundn't generate such symbols which used by qlinear_pool.c ``` 0000000000000000 W _ZN11onnxruntime7contrib27ComputeQLinearGlobalAvgPoolIaEENS_6common6StatusEPKT_fS4_PS4_fS4_lllbPNS_11concurrency10ThreadPoolE 0000000000000000 W _ZN11onnxruntime7contrib27ComputeQLinearGlobalAvgPoolIhEENS_6common6StatusEPKT_fS4_PS4_fS4_lllbPNS_11concurrency10ThreadPoolE ``` so we get a error of undefined symbols of ComputeQLinearGlobalAvgPool<uin8_t> and ComputeQLinearGlobalAvgPool<in8_t>...... ### Motivation and Context  * Bump http-cache-semantics from 4.1.0 to 4.1.1 in /js/web (microsoft#14535) * [ROCm] Fix ROCm build issue caused by REMOVE_ITEM incorrect path (microsoft#14534) ### Description Fix not working REMOVE_ITEM. `onnxruntime/contrib_ops/rocm/aten_ops/aten_op.cc` is hipyfied from `onnxruntime/contrib_ops/cuda/aten_ops/aten_op.cc`. The file correct path is `${CMAKE_CURRENT_BINARY_DIR}/amdgpu/onnxruntime/contrib_ops/rocm/aten_ops/aten_op.cc` and it exists in hipyfied source files list `onnxruntime_rocm_generated_contrib_ops_cc_srcs`. A better way to fix it: If we don't want to build a file. Add it into hipify excluded files and will not hipify it. * Stable Diffusion CUDA Optimizations (microsoft#14428) ### Description Add stable diffusion CUDA kernel optimizations. The following are included: (1) GroupNorm operator. This kernel is from TensorRT 8.5. (2) BiasSplitGelu operator. This kernel is modified from SplitGelu of TensorRT 8.5. We added bias to the SplitGelu. (3) NhwcConv operator. This adds support of NHWC format (ONNX Conv operator uses NCHW format). (3) Update MultiHeadAttention (packed kv and no bias) for cross attention. This could avoid transpose of kv for TRT fused cross attention kernel. (4) Optimization and benchmark script Not included: (1) Script to convert Conv to NhwcConv in onnx graph. (2) Update symbolic shape inference for NhwcConv. (3) Add SeqLen2Spatial operator (4) Documents Limitations: GroupNorm, BiasSplitGelu and NhwcConv kernels are implemented based on stable diffusion usage. They might not be applicable to any input size or dimensions. For example, BiasSplitGelu requires hidden size to be 2560 | 5120 | 10240, and NhwcConv assumes 4D input/weight. There is minor increasement of binary size. For SM=75 only, python package wheel size adds (33757K - 33640K) = 117 KB. It is possible to move NHWC from template parameter to constructor to reduce binary size (with slight cost of performance). Note: for RTX 4090/4080/4070 Ti, need build with CUDA 11.8 and latest cuDNN to get best performance. * Fix sharing scalar bug (microsoft#14544) If an initializer is used as graph outputs, we should keep its name, instead of renaming it as constant sharing transformer did currently. To fix microsoft#14488 * link mpi when either use_mpi or use_nccl enabled (microsoft#14467) ### Only link mpi when either use_mpi or use_nccl enabled To fix the issue microsoft#14278. Talked with @askhade, we think if users want to enable NCCL/MPi but MPI is not found, it should be failure instead of warning. So this PR made the change. As a result, to make CIs pass, we need disable NCCL/MPI explicitly in the build command. This PR take an alternative approach, e.g. since NCCL and MPi are not used for customers, disable NCCL by default if "--disable_nccl" not specified, disable MPI by default if "--use_mpi" not specified. ### Motivation and Context  * Enable ability to control whether or not to quantize the bias (microsoft#14549) * Upgrade doxygen to fix C API docs build issue (microsoft#13950) * Add SLN support for t5 model with beam search (microsoft#14429) ### Description  ### Motivation and Context  --------- Co-authored-by: Ubuntu <[email protected]> * [ROCm][MIGraphX EP]Add back in support for gfx1030 (microsoft#14565) Adds back in proper build support for the Navi gen cards (gfx1030) Co-authored-by: Ted Themistokleous <[email protected]> * [ORTModule] ATen Support for upsample_bilinear (microsoft#14519) It's required by model MobileViT. * Change the return type of softmax function to Status (microsoft#14559) ### Description Change the return type of Softmax function(`dispatch_warpwise_softmax_forward `and `dispatch_blockwise_softmax_forward`) from `void ` to `Status`. ### Motivation and Context Softmax function will call TunableOp which return Status. It's necessary to pass the `Status` from inner function to outer function. * do not use raw pointer for CpuBuffersInfo::buffers (microsoft#14574) ### Description Do not use raw pointer for CpuBuffersInfo::buffers object ### Motivation and Context This PR is to fix the bug 11159: https://dev.azure.com/aiinfra/ONNX%20Runtime/_workitems/edit/11159/ * [DML EP] Fix ScatterElements registration (microsoft#14560) * IdentityBuilder should add Delimit for each input (microsoft#14592) …("####") should append for each input_def, not only on the last one else branch of this if should return ignore_identity https://github.com/microsoft/onnxruntime/blob/3d7518762ace6929be98e1203174c2dbf1ac094e/onnxruntime/core/optimizer/identical_children_consolidation.cc#L66 identity.append("####") should append for each input_def, not only on the last one ### Description  ### Motivation and Context  * Bump jszip from 3.7.1 to 3.8.0 in /js/web (microsoft#14536) * [ROCm] Enable Sampling Op UT on AMD (microsoft#14581) Making basic porting effort to run Sampling UT on ROCm ep, based on the commits: microsoft#13426 microsoft#14218 1. enabling EmbedLayerNorm op 2. enabling Sampling op 3. enabling helpers to copy data from CPU->GPU for subgraph This task is the first checkpoint. There could be other missing ops when testing a real model. We will migrate more code onto ROCm as needed. Co-authored-by: Ubuntu <ettao@ettao-amd-dev1.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net> * Fix CI failure: temporarily disable real model tests from onnx repo (microsoft#14606) ### Description  To faster unblock pipeline failure globally, disable these real models tests from onnx repo for now. Meanwhile, we are trying to move these models to Azure. ### Motivation and Context  onnx/onnx#4857 these models in onnx repo are broken. They are setup 4 years ago and the owner of these AWS instances is unfound. * try VS 2022 in windowsAI pipeline (microsoft#14608) ### Description update VS2019 to VS 2022 in onnxruntime-Nuget-WindowsAI-Pipeline-Official ### Motivation and Context  * Stable Diffusion CUDA optimizations Part 2 (microsoft#14597) ### Description This is a follow-up of microsoft#14428 for Stable Diffusion CUDA optimizations: (1) use NchwConv to replace Conv in onnx graph and add Tranpose nodes accordingly (2) reduce sequential Transpose nodes to at most one. (3) symbolic shape infer of NchwConv (4) fix add bias transpose which causes CUDA error (launching more than 1024 threads per block) in inferencing fp32 model. (5) add models (bert, bart, stable_diffusion subdirectories) to package; (6) remove option --disable_channels_last Note that (1) We can add a few graph transformations to reduce Transpose nodes further. It is not done in this PR due to time limit. (2) Stable diffusion 2.1 model outputs black images. It seems that forcing Attention to float32 could avoid the issue. However it is much slow to use float32 Attention. ### Motivation and Context  * reduce cuda library binary size (microsoft#14555) ### Description Reduce the cuda library size by: 1. refactoring beam_search_top_k to reduce template instantiation. It saves ~56MB 2. opt out TopK for type uint*, int8_t and int16_t. It saves ~50MB. ### Motivation and Context  * Remove Identical Children Consolidation from default transformer uitil. (microsoft#14602) ### Description  ### Motivation and Context  Co-authored-by: Scott McKay <[email protected]> * Revert mimalloc from v2.0.9 to v2.0.3 (microsoft#14603) Revert mimalloc from v2.0.9 to v2.0.3 to silence build error in [post-merge ](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=273075&view=logs&j=f019f681-ae8f-5ee4-d119-02530df66a84&t=6c90c65c-2ab2-56af-633f-b5631256a8e1&l=351) pipeline. New dependency version was generated [here](https://aiinfra.visualstudio.com/Lotus/_artifacts/feed/Lotus/UPack/onnxruntime_build_dependencies/overview/1.0.29). Co-authored-by: Randy Shuai <[email protected]> Co-authored-by: rui-ren <[email protected]> * Some kernel changes for TULR (microsoft#14517) ### Description  1. fix a bug in relative position bias kernel where seq_len > 32 2. rename extra_add_qk to relative_position_bias 3. support relative_position_bias in multihead attention (B, N, S, S*) 4. gru_gate support by Lei ### Motivation and Context  --------- Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Lei Zhang <[email protected]> * Introduce collective ops to ort inference build (microsoft#14399) ### Description Introduce collective ops into onnxruntime inference build, including 1) AllReduce and AllGather schema in contrib op, controlled by USE_MPI flag 2) AllReduce and AllGather kernel in cuda EP, controlled by ORT_USE_NCCL flag ### Motivation and Context Enable the collective ops in onnxruntime inference build so we have the ability to run distributed inference with multiple GPUs. The original ncclAllReduce ops in training build require quite complex configurations, which is not suitable for inference case, and it already broken. so we introduce a new implementation. --------- Co-authored-by: Cheng Tang <[email protected]@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> * fix snpe build (microsoft#14616) ### Description Fix SNPE build issue caused by cmake dependency refactor ### Motivation and Context  ### Motivation and Context For release, winai packaing pipeline's container image is revert to old image. So we should revert VS to 2019 * [Readme] Update table for build pipelines (microsoft#14618) ### Description Update list of pipelines to remove obsolete pipelines and reformat Optional pipelines are not included except for Android and iOS ![image](https://user-images.githubusercontent.com/20780999/217395702-f08f1252-e1aa-4fec-ac34-1c0b9859ec20.png) * [TVM EP] Support zero copying TVM EP output tensor to ONNX Runtime output tensor (microsoft#12593) **Description**: Support new feature of TVM Virtual Machine (method `set_outputs`) on TVM Execution Provider side. It allows to avoid excess copying from TVM EP output tensor to ONNX Runtime one **Motivation and Context** Tests with multiple output topologies and big output tensors shows that there is overheads spent on copying from TVM EP to ONNX Runtime. Returning output(s) on preallocated memory for VirtualMachine was implemented on TVM side. **Details** `set_output_zero_copy` provider option for TVM EP switches on/off this feature. It is true by default. The feature works for both GraphExecutor and VirtualMachine from TVM. --------- Co-authored-by: Valery Chernov <[email protected]> * Enable parallel output reordering in MlasReorderOutputNchw() (microsoft#13643) ### Description This PR speeds-up the output reordering operation (as implemented in [MlasReorderOutputNchw](https://github.com/microsoft/onnxruntime/blob/9954454c65086c49b7c00f83b23ada76975f3546/onnxruntime/core/mlas/lib/reorder.cpp#L400)) by replacing the sequential implementation with a parallelized one. The parallelization is achieved through the use of the existing [TryBatchParallelFor](https://github.com/microsoft/onnxruntime/blob/9954454c65086c49b7c00f83b23ada76975f3546/include/onnxruntime/core/platform/threadpool.h#L284) construct. ### Motivation and Context  The output reordering operation is frequently executed in image processing models. Its implementation can be easily parallelized and therefore sped up when executed on a multi-core machine. The amount of speedup achieved by this PR varies and depends on the actual input. The table below summarizes the results of some of the experiments I have conducted on a 16-core VM running on an AMD EPYC 7742 64-core processor. The experiment is based on the existing [unit test](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/test/mlas/unittest/test_reorder_output.cpp) for the output reordering operation. The first column represents the shape of the output as BatchCount:Channels:Height:Width, and the numbers in other columns represent the latency (in us, on average out of 100 runs) for the tested variants. Specifically, I compare the (sequential) baseline (in second column) with the (parallelized) variants, each using a number of worker threads equal to 1, 2, 4, 8 or 16 (as specified in [the constructor to the threadpool object](https://github.com/microsoft/onnxruntime/blob/9954454c65086c49b7c00f83b23ada76975f3546/onnxruntime/test/mlas/unittest/test_main.cpp#L12)). The numbers in () represent the speedup over the baseline. | Input | baseline | 1 Thread | 2 Threads | 4 Threads | 8 Threads | 16 Threads| | ------------- | ------------- |---------------|---------------|---------------|---------------|---------------| 1:1:112:112 | 20.8 | 21.5 (x0.97) | 21.9 (x0.95) | 22.2 (x0.94) | 22.5 (x0.92) | 23.0 (x0.90) | 1:128:160:84 | 540.4 | 712.5 (x0.76) | 404.0 (x1.34) | 327.8 (x1.65) | 377.9 (x1.43) | 371.8 (x1.45) | 13:240:4:314 | 1484.0 | 1851.1 (x0.80) | 1080.9 (x1.37) | 570.2 (x2.60) | 531.8 (x2.79) | 511.2 (x2.90) | 13:96:4:314 | 471.0 | 679.9 (x0.69) | 427.2 (x1.10) | 372.1 (x1.27) | 445.5 (x1.06) | 428.5 (x1.10) | 1:64:320:168 | 1215.1 | 1497.8 (x0.81) | 863.8 (x1.41) | 456.7 (x2.66) | 435.7 (x2.79) | 462.5 (x2.63) | 30:240:4:140 | 1711.5 | 2181.4 (x0.78) | 1182.6 (x1.45) | 657.4 (x2.60) | 592.5 (x2.89) | 578.0 (x2.96) | 30:336:4:140 | 2432.5 | 3039.2 (x0.80) | 1695.6 (x1.43) | 920.7 (x2.64) | 817.1 (x2.98) | 819.2 (x2.97) | The initial drop between the baseline and the variant using just one worker thread can be attributed to the overhead of invoking the reordering loop as a functor in TryBatchParallelFor. This overhead is compensated by the speedup of parallel processing when the number of worker threads is increased. * Rework C API to remove new/delete warnings (microsoft#14572) ### Description Re-work code so it does not require GSL_SUPPRESS ### Motivation and Context Do things right. * Move TRT include_directories to outside scope (microsoft#14622) Signed-off-by: Kevin Chen <[email protected]> ### Description Previously `include_directories(${TENSORRT_INCLUDE_DIR})` was only done if `onnxruntime_USE_TENSORRT_BUILTIN_PARSER` was false. This would cause a build failure when the switch was true as the include directory was not added. ### Motivation and Context Fixes TRT build when `onnxruntime_USE_TENSORRT_BUILTIN_PARSER` is true. --------- Signed-off-by: Kevin Chen <[email protected]> * Remove torch package from requirements.txt of stable diffusion models (microsoft#14630) ### Description Remove torch package from requirements to unblock nuget windowsai pipeline which does not allow --extra-index-url ### Motivation and Context  * Test and fix optimizers LayerNormFusion, BiasSoftmaxFusion, Transpose for opset 18 (microsoft#14542) ### Description Due to the changes introduced in opset 18 on Reduce operators (axes is an input and not an attribute), the following optimizers are not catching the pattern they are supposed to optimize. This PR addresses that. * layer_norm_fusion.cc: the optimizer was not detecting the pattern it was suppose to optimize * bias_softmax_fusion.cc: the optimizer was not detecting the pattern it was suppose to optimize * transpose_optimizer.cc: the optimizer was not optimize Reduce operators other than ReduceSum ### Motivation and Context Better performance. --------- Signed-off-by: xadupre <[email protected]> * Add rust bindings (microsoft#12606) This adds updated Rust bindings that have been located at [nbigaouette/onnxruntime-rs](https://github.com/nbigaouette/onnxruntime-rs). check out the build instructions included in this PR at /rust/BUILD.md. Changes to the bindings included in this PR: - The bindings are generated with the build script on each build - The onnxruntime shared library is built with ORT_RUST_STRATEGY=compile which is now the default. - A memory leak was fixed where a call to free wasn't called - Several small memory errors were fixed - Session is Send but not Sync, Environment is Send + Sync - Inputs and Outputs can be ndarray::Arrays of many different types. Some commits can be squashed, if wanted, but were left unsquashed to show differences between old bindings and new bindings. This PR does not cover packaging nor does it include the Rust bindings withing the build system. For those of you who have previous Rust code based on the bindings, these new bindings can be used as a `path` dependency or a `git` dependency (though I have not tested this out). The work addressed in this PR was discussed in microsoft#11992 * [DORT] Update import path (microsoft#14605) Follow up changes from https://github.com/pytorch/pytorch/pull/93409/files for fixing DORT CI failures. * Fix softmax block forward with small element size (microsoft#14475) ### Description 1. ALIGN_BYTES is set to 16 before because float4 is used for vectorization by default. This PR computes ALIGN_BYTES by vectorize size. 2. Fix wrong data access when using small elemant size (e.g., 1, 33). Small case may be used for SoftmaxTunableOp. 3. Fix the bug that data may be written first and then read in BlockReduce function on ROCm EP. There is a slightly performance improvement because all theads in warp-0 work. BlockReduce method before this PR: One block has N(warps_per_block) warps, one warp has M(WARP_SIZE) threads. step1. All the threads in one block read data into shared memory. step2. Reduce all data to the first warp. Only the first N threads of warp-0 are used. thread-0 computes data in warp-0 and writes the result into the location of data0, thread-1 computes data in warp-1 and writes the result into the location of data1. __syncwarp(mask) is necessary here to make sure thread-1,...N will delay writing data into warp-0 until thread-0 has finished reading data from warp-0. step3. Thread-0 reduces all vaild data(only the first N data) in warp-0 and writes the results into the location of data0, then return data0. Issue: ROCm doesn't support __syncwarp() now, we need another implementation to make sure read before write in warp-0. BlockReduce function in this PR. step2. Reduce all data to the first warp. Only the threads of warp-0 are used. Each thread in warp-0 read data from the same location of every warp and computes result. For example, thread-0 computes the first data of every warp and writes the result into the location of data0. step3. Thread-0 reduces all data in warp-0 and writes the results into the location of data0, then return data0. Shared memory ![image](https://user-images.githubusercontent.com/94887879/216281207-8b332af5-bb9f-443a-8e2d-5d40c2231629.png) Test: kernel explorer will use small element to test. (microsoft#14541) * [prefast:Warning]: C26451 (microsoft#14628) ### Description  ### Motivation and Context  * Fix SAL annotation in private DML EP interface (microsoft#14639) In microsoft#14461 I added a private interface to MLOperatorAuthorPrivate.h to pipe ORT node names through to the debug name of DML operators/graphs. The wrong SAL annotation was used on the `Get*Name` methods, which confused static analysis tools into thinking there is a potential buffer overrun. * Switch to a static local variable to avoid global constexpr warning (microsoft#14638) ### Description Switch to a static local variable to fix the warning Comments in the code so it's clear that it's intentional. ### Motivation and Context Prefast warning: [prefast:Warning]: C26426 (in 'onnxruntime::cuda::`dynamic initializer for 'castOpTypeConstraints''') Global initializer calls a non-constexpr function 'onnxruntime::DataTypeImpl::GetTensorType<onnxruntime::MLFloat16>' (i.22). * Skip all training opset model tests (microsoft#14636) * Add instructions for previewing docs changes (microsoft#12528) * Add TuningContext for TunableOp (microsoft#14557) This makes the the TunableOp tuning results state free and will allow us to dump and load offline tuning results. * add symmetric quant in softmax (microsoft#14640) ### Description microsoft#14626 ### Motivation and Context microsoft#14626 * fix problem of reduplicate input names (microsoft#14163) Contributor: @guyang3532 * Add extra include to fix build w/ CUDA 12 (microsoft#14659) Signed-off-by: Cliff Woolley <[email protected]> ### Description Including file to fix build w/CUDA 12 ### Motivation and Context It should allow users to compile against CUDA 12 Signed-off-by: Cliff Woolley <[email protected]> Co-authored-by: Cliff Woolley <[email protected]> * [ROCm] add Softmax Tunable Op (microsoft#14541) ### Description Add Softmax Tunable Op, only include blockwise vec implementation and composable kernel. Related PR: microsoft#14475, microsoft#14612 --------- Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net> * Update typing hints to support python 3.8 for training apis (microsoft#14649) * remove device_id parameter out of ExecutionProvider::GetAllocator() (microsoft#14580) ### Description Remove the parameter device_id out of ExecutionProvider::GetAllocator() function ### Motivation and Context The parameter device_id is not necessary. We can fully rely on the second parameter OrtMemType mem_type to determine the device_id when getting allocator from executionProvider. * Update OrtEnv class documentation (microsoft#14650) ### Description Tell more about `OrtEnv` class. ### Motivation and Context Need to mention the importance of creating `OrtEnv` first. * Fix DML release build (microsoft#14661) ### Description Fixes the DML release build for 1.14.1. This was initially fixed by microsoft#13417 for 1.13.1, but the changes didn't make their way back to the main branch. * Use miopenGetConvolutionSpatialDim if ROCm5.5 (microsoft#14483) MIOpen created a new API to get the spatial dimensions. * [MIGraphX EP] Add support for Mod OP (microsoft#14647) This has been available since July 25th 2022 in MIGraphX. Appared to be missing from support list of ops ROCm/AMDMIGraphX#1302 ### Description  Add in node name for Mod Operator to be supported by MIGraphX ### Motivation and Context  Expand available functionality to Onnxruntime for the MIGraphX EP Co-authored-by: Ted Themistokleous <[email protected]> * [T5 optimization] fuse rel_pos_bias and remove extended mask (microsoft#14645) ### Description  1. fuse rel_pos_bias in T5. 2. remove extended masks in T5 decoder and decoder_init since they generate all zeros 3. fix a bug in onnx_model.py ### Motivation and Context  --------- Co-authored-by: Ubuntu <[email protected]> * Remove erroneous function cast (microsoft#14673) ### Description The custom thread entry point was declared `__stdcall` even though the API dictated a different type. Casting caused improper cleanup of the stack and crash manifested only in 32-bit Debug builds. ### Motivation and Context This addresses microsoft#14613 * Stable Diffusion CUDA Optimizations Part 3 (microsoft#14646) The third part for stable diffusion CUDA optimizations (1) Add BiasAdd operator to replace two Add (bias and residual); Add fusion for BiasAdd (2) Add Attention fusion for VAE decoder. (3) Update float16 conversion to handle Resize and GroupNorm. This could reduce two Cast nodes for each Resize op in fp16 model. (4) Force inputs and outputs to be float16 to avoid data casts in the pipeline. (5) Add options --force_fp32_ops, --inspect etc in optimize script so that user could force some operator to run in float32 to potentially get better image quality (with cost of performance). Performance tests show slight improvement in T4. Average latency reduced 0.1 seconds (from 5.35s to 5.25s) for 512x512 in 50 steps. * Offline tuning (microsoft#14558) Add the ability to get and set tuning results of an inference session. Also add tool to manipulate onnx file to embed the results into the model file and automatically load it on session initialization. * [ROCm] Support for gpt2-based model inferencing (microsoft#14675) When inferencing real gpt2-based model, found some gaps between CUDA and ROCm codebase. The fixes include: 1. minimum code change to fix tensor shape on Attention Op 2. Support optional output tensor with SkipLayerNorm 3. fix a build error found on MI200 --------- Co-authored-by: Ubuntu <ettao@ettao-amd-dev1.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net> * skip col2im_pads test (microsoft#14685) ### Description skip col2im_pads test in model test. ### Motivation and Context The failed test blocks updating the new image. * Cfu fp16 (microsoft#14538) ### Description FP16 GEMM, including hardware agnostic driver code, a slow C++ kernel, and ARM64 NEON kernel. ### Motivation and Context First step in creating native support of fp16 model inferencing on ARM64 and AMD64 platforms. --------- Co-authored-by: Chen Fu <[email protected]> * Make some variables constexpr in orttraining/orttraining/training_ops/cuda/optimizer/lamb.cc. (microsoft#14698) * Stable Diffusion CUDA Optimizations Part 4 (microsoft#14680) (1) Support packed QKV format in MultiHeadAttention. This format could avoid add bias transpose when TRT fused kernel is used. (2) Add cache for cumulated sequence length computation. For SD, it only need computed once since sequence length is fixed. (3) Do not allocate qkv workspace to save memory for packed KV or QKV. (4) Add unit tests for packed kv and packed qkv format in MultiHeadAttention (5) Mark some fusion options for SD only Performance tests show slight improvement in T4. Average latency reduced 0.15 seconds (from 5.25s to 5.10s) for 512x512 in 50 steps for SD 1.5 models. Memory usage drops from 5.1GB to 4.8GB. * add noexcept to `InitApi()` and `GetApi()` (microsoft#13869) ### Description * add noexcept to `InitApi()` and `GetApi()` ### Motivation and Context * fixes microsoft#12581 * [Testing] Arrange parity utilities for onnxruntime parity tests to set order pr… (microsoft#14700) Current configuration has CPU as the highest priority as per the specification found at : https://onnxruntime.ai/docs/api/python/api_summary.html#inferencesession providers – Optional sequence of providers in order of decreasing precedence. Values can either be provider names or tuples of (provider name, options dict). If not provided, then all available providers are used with the default precedence. Sets correct operator precedence for the EPs in parity utilities for test runs Ruling out any odd out of order issues when setting up tests for multiple EPs Co-authored-by: Ted Themistokleous <[email protected]> * [ROCm] Add WarpWise Softmax into SoftmaxTunableOp (microsoft#14612) 1. Add Softmax warpwise_forward into SoftmaxTunableOp. 2. Set Softmax op use tunableOp as optional and use original implementation by default. 3. There are some other operators use `dispatch_warpwise_softmax_forward /dispatch_warpwise_softmax_forward/ SoftMaxComputeHelper ` directly. But they only have files under cuda directory, adding `RocmTuningContext ` for these files requires copying and modifying hipified files. Now only set RocmTuningContext as nullptr by default and not hipified other operators. Related PR: microsoft#14541 --------- Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net> * Stable Diffusion CUDA Optimizations Part 5 (microsoft#14706) Add a fusion to remove transpose in subgraph like ``` --> Gemm --> Unsqueeze(axes=[2]) --> Unsqueeze(axes=[3]) --> Add --> Transpose([0,2,3,1]) --> GroupNorm ``` With this fusion, we can remove 22 Transpose nodes in UNet, and reduce latency by 0.1 second per image in T4. * Add Rust docs generation --------- Signed-off-by: Kevin Chen <[email protected]> Signed-off-by: xadupre <[email protected]> Signed-off-by: Cliff Woolley <[email protected]> Co-authored-by: JiCheng <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: PeixuanZuo <[email protected]> Co-authored-by: Tianlei Wu <[email protected]> Co-authored-by: pengwa <[email protected]> Co-authored-by: Baiju Meswani <[email protected]> Co-authored-by: Ye Wang <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Ted Themistokleous <[email protected]> Co-authored-by: Ted Themistokleous <[email protected]> Co-authored-by: Vincent Wang <[email protected]> Co-authored-by: cao lei <[email protected]> Co-authored-by: Patrice Vignola <[email protected]> Co-authored-by: Jian Chen <[email protected]> Co-authored-by: ytaous <[email protected]> Co-authored-by: Ubuntu <ettao@ettao-amd-dev1.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net> Co-authored-by: Chun-Wei Chen <[email protected]> Co-authored-by: Yi Zhang <[email protected]> Co-authored-by: Yufeng Li <[email protected]> Co-authored-by: Scott McKay <[email protected]> Co-authored-by: RandySheriffH <[email protected]> Co-authored-by: Randy Shuai <[email protected]> Co-authored-by: rui-ren <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Tang, Cheng <[email protected]> Co-authored-by: Cheng Tang <[email protected]@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Hector Li <[email protected]> Co-authored-by: Maximilian Müller <[email protected]> Co-authored-by: Chi Lo <[email protected]> Co-authored-by: Faith Xu <[email protected]> Co-authored-by: Valery Chernov <[email protected]> Co-authored-by: Valery Chernov <[email protected]> Co-authored-by: Alex Kogan <[email protected]> Co-authored-by: Dmitri Smirnov <[email protected]> Co-authored-by: Kevin Chen <[email protected]> Co-authored-by: Xavier Dupré <[email protected]> Co-authored-by: Boyd Johnson <[email protected]> Co-authored-by: Wei-Sheng Chin <[email protected]> Co-authored-by: Justin Stoecker <[email protected]> Co-authored-by: Ryan Hill <[email protected]> Co-authored-by: cloudhan <[email protected]> Co-authored-by: Chen Fu <[email protected]> Co-authored-by: guyang3532 <[email protected]> Co-authored-by: Misha Chornyi <[email protected]> Co-authored-by: Cliff Woolley <[email protected]> Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net> Co-authored-by: Zachary Streeter <[email protected]> Co-authored-by: Chen Fu <[email protected]> Co-authored-by: Edward Chen <[email protected]> Co-authored-by: Dale Phurrough <[email protected]>

TedThemistokleous added 2 commits July 8, 2022 02:09

TedThemistokleous requested review from umangyadav and causten July 12, 2022 17:13

TedThemistokleous marked this pull request as ready for review July 12, 2022 17:13

TedThemistokleous requested a review from pfultz2 July 12, 2022 17:13

pfultz2 reviewed Jul 12, 2022

View reviewed changes

src/include/migraphx/op/mod.hpp Outdated Show resolved Hide resolved

kahmed10 reviewed Jul 12, 2022

View reviewed changes

src/onnx/parse_mod.cpp Outdated Show resolved Hide resolved

TedThemistokleous added 2 commits July 13, 2022 13:31

fixup! Split mod operation into fmod & mod equivalents

8288e88

Added Test related changes for mod operator

23a82ba

Generate files for modulus operator from onnx. Used to verify parsing.

TedThemistokleous force-pushed the add_onnx_mod_operator branch from 2a02e4e to 23a82ba Compare July 13, 2022 13:35

TedThemistokleous added 5 commits July 14, 2022 17:16

Fix mod_test in verify_onnx

0bf1090

Needed to recalculate this and make sure what I'm getting is valid on output

Add additional test in verify_onnx for different datatypes using pars…

ca630d4

…ed in mod protobuf Verify the result fomr the protobuf is correct when using mixed datatypes. The goal here is to verify we cast up correctly from int32 and our results are sane.

fixup! Add additional test in verify_onnx for different datatypes usi…

5419584

…ng parsed in mod protobuf

TedThemistokleous force-pushed the add_onnx_mod_operator branch from bc15592 to 5419584 Compare July 15, 2022 12:34

TedThemistokleous marked this pull request as draft July 15, 2022 16:43

Remove gpu test for now without GPU implimentation

7864e9e

TedThemistokleous force-pushed the add_onnx_mod_operator branch from 0310db8 to 7864e9e Compare July 16, 2022 00:55

kahmed10 approved these changes Jul 18, 2022

View reviewed changes

TedThemistokleous added 2 commits July 18, 2022 19:25

fixup! Added Test related changes for mod operator

1fa6802

Add name parameter to fmod and mod and remove extra check for dims

95e58d6

Remove extra check since binary handles input checking in check_shapes

Merge branch 'develop' into add_onnx_mod_operator

9befc00

umangyadav reviewed Jul 22, 2022

View reviewed changes

test/onnx/gen_onnx.py Show resolved Hide resolved

test/onnx/gen_onnx.py Show resolved Hide resolved

umangyadav and others added 2 commits July 22, 2022 11:25

Merge branch 'develop' into add_onnx_mod_operator

92edd77

Add extra check to not parse float inputs without fmod=1

140ca43

Need this to be handle as float input with fmod=0 is actually invalid with the Onnx operator and should be flagged as an error on parse.

umangyadav reviewed Jul 22, 2022

View reviewed changes

test/onnx/verify_onnx.cpp Outdated Show resolved Hide resolved

TedThemistokleous added 4 commits July 22, 2022 16:47

Require float inputs to use fmod=1

f867ae6

Modified test cases and generated tests to reflect and validate this.

Merge branch 'add_onnx_mod_operator' of github.com:ROCmSoftwarePlatfo…

366204c

…rm/AMDMIGraphX into add_onnx_mod_operator

Fix issue with clang-tidy

7a9e143

Add more fractional values for floating point mod tests

484cd4f

umangyadav reviewed Jul 22, 2022

View reviewed changes

test/onnx/onnx_test.cpp Outdated Show resolved Hide resolved

umangyadav reviewed Jul 22, 2022

View reviewed changes

test/onnx/gen_onnx.py Outdated Show resolved Hide resolved

test/ref_ops_test.cpp Outdated Show resolved Hide resolved

test/ref_ops_test.cpp Outdated Show resolved Hide resolved

umangyadav reviewed Jul 22, 2022

View reviewed changes

src/onnx/onnx_parser.cpp Outdated Show resolved Hide resolved

TedThemistokleous and others added 3 commits July 22, 2022 15:03

Update test/onnx/onnx_test.cpp

a659a50

Co-authored-by: Umang Yadav <[email protected]>

Cleanup and PR comments

583c5b4

Merge branch 'add_onnx_mod_operator' of github.com:ROCmSoftwarePlatfo…

6a30e2b

…rm/AMDMIGraphX into add_onnx_mod_operator

umangyadav reviewed Jul 22, 2022

View reviewed changes

src/onnx/include/migraphx/onnx/onnx_parser.hpp Outdated Show resolved Hide resolved

umangyadav reviewed Jul 22, 2022

View reviewed changes

test/ref_ops_test.cpp Show resolved Hide resolved

umangyadav approved these changes Jul 22, 2022

View reviewed changes

cleanup

5db9d45

TedThemistokleous force-pushed the add_onnx_mod_operator branch from cba70fd to 5db9d45 Compare July 22, 2022 19:26

Merge branch 'develop' into add_onnx_mod_operator

9d3c817

causten merged commit 77e80b8 into develop Jul 25, 2022

causten deleted the add_onnx_mod_operator branch July 25, 2022 18:56

TedThemistokleous added a commit to TedThemistokleous/onnxruntime that referenced this pull request Feb 9, 2023

Add support for Mod OP

bdb0592

This has been available since July 25th 2022 in MIGraphX. Appared to be missing from support list of ops ROCm/AMDMIGraphX#1302

TedThemistokleous mentioned this pull request Feb 9, 2023

[MIGraphX EP] Add support for Mod OP microsoft/onnxruntime#14647

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add onnx mod operator #1302

Add onnx mod operator #1302

TedThemistokleous commented Jul 12, 2022 •

edited

Loading

codecov bot commented Jul 12, 2022 •

edited

Loading

kahmed10 commented Jul 12, 2022

TedThemistokleous commented Jul 12, 2022

TedThemistokleous commented Jul 15, 2022

TedThemistokleous commented Jul 15, 2022

TedThemistokleous commented Jul 16, 2022

TedThemistokleous commented Jul 18, 2022

migraphx-bot commented Jul 25, 2022

Add onnx mod operator #1302

Add onnx mod operator #1302

Conversation

TedThemistokleous commented Jul 12, 2022 • edited Loading

codecov bot commented Jul 12, 2022 • edited Loading

Codecov Report

kahmed10 commented Jul 12, 2022

TedThemistokleous commented Jul 12, 2022

TedThemistokleous commented Jul 15, 2022

TedThemistokleous commented Jul 15, 2022

TedThemistokleous commented Jul 16, 2022

TedThemistokleous commented Jul 18, 2022

migraphx-bot commented Jul 25, 2022

TedThemistokleous commented Jul 12, 2022 •

edited

Loading

codecov bot commented Jul 12, 2022 •

edited

Loading