Create dedicated build for training api #14136

askhade · 2023-01-05T00:57:06Z

Description

Enable creating dedicated build for on device training. With this PR we can build a lean binary for on device training using flag --enable_training_apis. This binary includes only the essentials like training ops, optimizers etc and NOT features like Aten fallback, strided tensors, gradient builders etc . This binary also removes all the deprecated components like training::TrainingSession and OrtTrainer etc

Motivation and Context

This enables our partners to create a lean binary for on device training.

…p to true when enable_training is ON

cmake/CMakeLists.txt

souptc · 2023-01-09T05:38:56Z

cmake/external/onnxruntime_external_deps.cmake

@@ -429,7 +429,7 @@ if(onnxruntime_ENABLE_ATEN)
  FetchContent_Populate(dlpack)
 endif()

-if(onnxruntime_ENABLE_TRAINING)
+if(onnxruntime_ENABLE_TRAINING OR (onnxruntime_ENABLE_TRAINING_APIS AND onnxruntime_BUILD_UNIT_TESTS))


so if we have ENABLE_TRAINING but not onnxruntime_BUILD_UNIT_TESTS, we still enable the following flag?

Yes. It is being used in onnxruntime_training_runner executable as well as onnxruntime_training_mnist and onnxruntime_training_gpt2... All this is deprecated code and we can simply remove onnxruntime_ENABLE_TRAINING from there once that code is removed. I will add this as a comment in the cmake.

souptc · 2023-01-09T05:42:24Z

cmake/onnxruntime_graph.cmake

@@ -72,6 +72,27 @@ if (onnxruntime_ENABLE_TRAINING_OPS AND NOT onnxruntime_ENABLE_TRAINING)
      "${ORTTRAINING_SOURCE_DIR}/core/graph/training_op_defs.h"
      )
 endif()
+
+if (onnxruntime_ENABLE_TRAINING_APIS AND NOT onnxruntime_ENABLE_TRAINING)


it is a little confusing, as the definition in the build flag part indicate the ENABLE_TRAINING seems to be a superset of ENABLE_TRAINING_API...

I guess the idea here is the following list file only used for ort training C++ API, but not in ort training python api (ORTModule), right? if that is the case, could we make it the python API build and C++ training API build a explicit flag, instead of mixed with ENABLE_TRAINING.

I believe onnxruntime_ENABLE_TRAINING_APIS is for on-device training, so it does not want to include those ORT trainer C++ codes. But yes, onnxruntime_ENABLE_TRAINING_APIS from its name is a subset of ENABLE_TRAINING, I recalled it is called TRAINING_ON_DEVICE previously.

cmake/onnxruntime_optimizer.cmake

onnxruntime/core/framework/session_state.h

souptc · 2023-01-09T06:08:33Z

onnxruntime/core/optimizer/compute_optimizer/compute_optimizer.cc

@@ -1,7 +1,7 @@
 // Copyright (c) Microsoft Corporation. All rights reserved.
 // Licensed under the MIT License.

-#ifdef ENABLE_TRAINING
+#ifdef ENABLE_TRAINING_CORE
 #include <onnx/defs/attr_proto_util.h>


if the whole file is only for training build, could we just put it under the training folder, and only include it in training build in the cmake, instead of have the ifdef in the code?

this is added by me, I put it here for easily enabling it for inferencing later.

onnxruntime/core/optimizer/graph_transformer_utils.cc

souptc · 2023-01-09T06:11:30Z

onnxruntime/core/session/environment.cc

@@ -34,13 +34,14 @@
 #if defined(ENABLE_TRAINING_OPS)
 #include "orttraining/core/graph/training_op_defs.h"
 #endif
+#ifdef ENABLE_TRAINING_CORE
+#include "orttraining/core/graph/loss_function_registry.h"


do you need these for ORTModule build? I thought it is only useful for training c++ api, but ENABLE_TRAINING_CORE seems to be shared by both side.

Originally this was part of ENABLE_TRAINING so I think they are needed (will check). BTW we don't do any dedicated build for ortmodule, we simply do a full training build, so will need these.

I believe "orttraining/core/graph/loss_function_registry.h" is not needed for training_api (on-device training)

pengwa · 2023-01-10T05:14:50Z

cmake/onnxruntime_providers.cmake

@@ -471,7 +471,9 @@ if (onnxruntime_USE_CUDA)
  onnxruntime_add_include_to_target(onnxruntime_providers_cuda onnxruntime_common onnxruntime_framework onnx onnx_proto ${PROTOBUF_LIB} flatbuffers::flatbuffers)
  if (onnxruntime_ENABLE_TRAINING_OPS)
    onnxruntime_add_include_to_target(onnxruntime_providers_cuda onnxruntime_training)
-    target_link_libraries(onnxruntime_providers_cuda PRIVATE onnxruntime_training)
+    if (onnxruntime_ENABLE_TRAINING)


if "if (onnxruntime_ENABLE_TRAINING_OPS)" is true, then "if (onnxruntime_ENABLE_TRAINING)" is ture, right?

not always...

On Device Training build uses enable_training_apis flag and in this case enable_training will not be true

There is 1 more scenario where training ops are included in the inference build... I am not 100% sure how it is used but this enable_trianing_ops macro was first added for this scenario.

cmake/onnxruntime_unittests.cmake

onnxruntime/core/framework/session_state.h

pengwa · 2023-01-10T05:40:36Z

onnxruntime/core/optimizer/compute_optimizer/compute_optimizer.cc

@@ -1,7 +1,7 @@
 // Copyright (c) Microsoft Corporation. All rights reserved.
 // Licensed under the MIT License.

-#ifdef ENABLE_TRAINING
+#ifdef ENABLE_TRAINING_CORE
 #include <onnx/defs/attr_proto_util.h>


this is added by me, I put it here for easily enabling it for inferencing later.

pengwa · 2023-01-10T05:43:11Z

onnxruntime/core/session/environment.cc

@@ -34,13 +34,14 @@
 #if defined(ENABLE_TRAINING_OPS)
 #include "orttraining/core/graph/training_op_defs.h"
 #endif
+#ifdef ENABLE_TRAINING_CORE
+#include "orttraining/core/graph/loss_function_registry.h"


I believe "orttraining/core/graph/loss_function_registry.h" is not needed for training_api (on-device training)

onnxruntime/core/session/environment.cc

pengwa · 2023-01-10T05:45:36Z

onnxruntime/test/contrib_ops/attention_op_test.cc

@@ -3650,7 +3650,7 @@ TEST(AttentionTest, DISABLED_Attention_Mask1D_Fp16_B2_FusedNoPadding) {
  }
 }

-#ifndef ENABLE_TRAINING  // Prepacking is enabled only on non-training builds
+#ifndef ENABLE_TRAINING_CORE  // Prepacking is enabled only on non-training builds


I think we need make it clear the relationship between those training macros somewhere. The inferencing guys at least need to know what macro should be used to explicitly turning off some code snippet.

souptc · 2023-01-11T00:39:49Z

cmake/onnxruntime_providers.cmake

@@ -471,7 +471,9 @@ if (onnxruntime_USE_CUDA)
  onnxruntime_add_include_to_target(onnxruntime_providers_cuda onnxruntime_common onnxruntime_framework onnx onnx_proto ${PROTOBUF_LIB} flatbuffers::flatbuffers)
  if (onnxruntime_ENABLE_TRAINING_OPS)
    onnxruntime_add_include_to_target(onnxruntime_providers_cuda onnxruntime_training)
-    target_link_libraries(onnxruntime_providers_cuda PRIVATE onnxruntime_training)
+    if (onnxruntime_ENABLE_TRAINING)
+      target_link_libraries(onnxruntime_providers_cuda PRIVATE onnxruntime_training)


why cuda EP need to linked with gradient builder / training agent? I thought in training build, the only impact to cuda ep is to include the additional training kernels.

souptc · 2023-01-11T00:46:19Z

onnxruntime/core/session/environment.cc

+#ifdef ENABLE_TRAINING_CORE
+      // <training schemas>
+      // This can also be moved inside enable_training. Needs more investigation
+      training::GraphTransformerRegistry::GetInstance().RegisterExternalGraphTransformers();


if i remember correctly, this api is added for Apollo usage, which we don't need it anymore. we can double confirm whether we can remove it, but i believe you don't need it for on device training.

makes sense will cover this in a separate PR which I am working on right now... Will merge this PR.

askhade added 14 commits December 13, 2022 13:55

rename training_on_deivce to training_apis, set training_torch_intero…

9c8b516

…p to true when enable_training is ON

plus python formating fix

b139bb2

merge main

f3397c3

fix build failures

3681101

update yml file name

d9eb42c

bug fix

697d1c2

fix build failure

af388ad

Fix test failures

e00b728

merge main

22943dc

fix CI failures

3e699fb

merge main

d242919

Fix Guardian Break V2 failures

823e249

First draft - dedicated on device training build

fa9c2db

merge main

e1a8db3

askhade requested a review from a team as a code owner January 5, 2023 00:57

askhade added 2 commits January 5, 2023 13:52

fix test coverage for training api builds

7f89a16

fix ci failure

86bb0b2

askhade changed the title ~~[WIP] Creating dedicated build for training api~~ Create dedicated build for training api Jan 6, 2023

askhade added 2 commits January 6, 2023 14:30

merge main

4a8aff3

merge main

a805a82

souptc reviewed Jan 9, 2023

View reviewed changes

plus updates per review

1dfa4d2

pengwa reviewed Jan 10, 2023

View reviewed changes

plus updates per review

4f2ad94

souptc reviewed Jan 11, 2023

View reviewed changes

souptc approved these changes Jan 11, 2023

View reviewed changes

pengwa approved these changes Jan 11, 2023

View reviewed changes

jchen351 approved these changes Jan 11, 2023

View reviewed changes

askhade merged commit d92c663 into main Jan 11, 2023

askhade deleted the askhade/dedicated_trainingapi_build branch January 11, 2023 04:58

baijumeswani mentioned this pull request Jan 30, 2023

[Bug Fix] Include python training apis when enable_training is enabled #14485

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create dedicated build for training api #14136

Create dedicated build for training api #14136

askhade commented Jan 5, 2023 •

edited

Loading

souptc Jan 9, 2023

askhade Jan 9, 2023

souptc Jan 9, 2023

pengwa Jan 10, 2023

souptc Jan 9, 2023

pengwa Jan 10, 2023

souptc Jan 9, 2023

askhade Jan 9, 2023 •

edited

Loading

pengwa Jan 10, 2023

pengwa Jan 10, 2023

askhade Jan 10, 2023

pengwa Jan 10, 2023

pengwa Jan 10, 2023

pengwa Jan 10, 2023

souptc Jan 11, 2023

souptc Jan 11, 2023

askhade Jan 11, 2023

Create dedicated build for training api #14136

Create dedicated build for training api #14136

Conversation

askhade commented Jan 5, 2023 • edited Loading

Description

Motivation and Context

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

askhade Jan 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

askhade commented Jan 5, 2023 •

edited

Loading

askhade Jan 9, 2023 •

edited

Loading