-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Capacity aware partitioning #22766
base: main
Are you sure you want to change the base?
Capacity aware partitioning #22766
Conversation
How about the intermediate memory usage (workspace) for each node? That is usually unknown during partitioning, and even unknown during inference since op has no interface to tell its workspace size right now. For example, MultiHeadAttention op might call different cuda kernels (flash attention, cutlass fmha, tensorrt fmha kernel or unfused kernel), each has different memory consumption. |
This is true. The function is currently accounts for initializers and inputs. It cannot account for temporary allocations because those are done at inference time, and partitioning takes place well before kernels are instantiated. The approach of computing memory patterns cannot be taken here since that relies on the presence of a runnable model which we do not have today in a constrained environment. This PR is still at the experimental stage. I envision that most of the burden would be placed on the individual EPs The simplest way is to add an additional if/else to enumerate the kernels and attempt to infer the amount of temporary space. However, that creates an additional maintenance burden since we already have plenty of such places in optimizers and what not where we need to make sure that changes to individual kernels are reflected. However, it would still work in its current form. One can try one setting and then lower it if the consumption is too much. Another idea would be to run the model beforehand and record the consumption. Then use that trace to set the limit n the constrained environment. |
If so, I think the feature is not very helpful for vision or LLMs models due to the limitations.
That's a good idea, and it will be great that we can support the use case. BTW, a general way to help capacity constraint is that we can have a way to manually configure location of initializers and inputs. This can be extended to support offloading initializers to CPU, and only load them on the GPU when needed. |
6244735
to
b2bb641
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can commit the suggested changes from lintrunner.
|
||
void NodeStatsRecorder::ReportNodeStats(const std::string& node_name, const NodeAllocationStats& stats) { | ||
std::lock_guard lock(impl_->mut_); | ||
auto result = impl_->node_stats_.emplace(node_name, stats); |
Check warning
Code scanning / PREfast
The pointer is dangling because it points at a temporary instance which was destroyed.
7596ac5
to
cb2277d
Compare
Implement GetSizeFromTensorTypeProo Wire in accounting Make CUDA EP resource aware and account on assignment Fix missing accountant for Ort format Remove redundant functions Remove unnecessary interface Fix DML issue, minor fixes Fix alert DEMO changes Implement node memory stats collection Place container in the session. Support nested graphs Add synchronization Update stats for the max consumption. Introduce input sizes computation.
91dfeca
to
3d805b0
Compare
|
||
TEST(CApiTest, GenerateNodeStatsFile) { | ||
Ort::Env env(ORT_LOGGING_LEVEL_INFO); | ||
constexpr const ORTCHAR_T* model_path = TSTR("testdata/transformers/tiny_gpt2_beamsearch.onnx"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// if present and adds it to the consumed amount | ||
void AccountForNode(size_t cost_index) const { | ||
assert(cost_index < nodes_costs.size()); | ||
if (nodes_costs[cost_index].has_value()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need handle nodes_costs[cost_index].has_value() is false here? add some comments might help.
@@ -0,0 +1,56 @@ | |||
GptAttention_1_add,18432,0,0,0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a header line?
MatMul_1165,146432,0,576000,0 | ||
GptAttention_2,30720,0,36864,165888 | ||
LayerNorm_6,18432,0,0,0 | ||
BeamSearch_gpt2,24,0,256,1823244 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a node with subgraph, does the resource include or exclude subgraphs?
onnx allows duplicated node name in different subgraphs. Is there a way to distinguish that?
GetConstantInitializer(input->Name(), check_outer_scope_true); | ||
if (initializer != nullptr) { | ||
size_t out; | ||
if (utils::GetSizeInBytesFromTensorProto<0>(*initializer, &out).IsOK()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an initializer is used by two nodes, will the size of initializer be counted twice?
const auto* proto = input->TypeAsProto(); | ||
if (proto != nullptr && utils::HasTensorType(*proto)) { | ||
const auto& tensor_type = proto->tensor_type(); | ||
if (utils::HasElemType(tensor_type) && utils::HasShape(tensor_type)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when an input does not have shape? Shall we add a log here (and seems result is not accurate in such case).
for (int i = 0, lim = kernel_ctx.InputCount(); i < lim; ++i) { | ||
// Need to get ort_value_index for each input. | ||
const OrtValue* p_input = kernel_ctx.GetInputMLValue(i); | ||
if (p_input != nullptr && p_input->IsAllocated() && p_input->IsTensor()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If some inputs or outputs are external allocated memory using IO binding, shall we include that in accounting?
|
||
if (!resource_partitioning_settings.empty()) { | ||
auto splits = utils::SplitString(resource_partitioning_settings, ",", true); | ||
if (splits.size() == 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else raise error for invalid format
auto& map = result.emplace(); | ||
|
||
if (!splits[0].empty()) { | ||
SafeInt<size_t> cuda_memory_limit = std::stoul(std::string{splits[0]}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use ParseStringWithClassicLocale?
void NodeStatsRecorder::ReportNodeStats(const std::string& node_name, const NodeAllocationStats& stats) { | ||
auto result = impl_->node_stats.emplace(node_name, stats); | ||
if (!result.second) { | ||
// Node already exists, update the stats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think each node is visited only once (only assigned to one EP). In what situation will the code hit this?
Description
Allow users to specify per EP specific resource constraints.
Currently, models that do not fit into device memory error out.
This PR introduces lays a groundwork for EP specific resource constrained graph
partitioning, subject to incremental feature additions.
Partitioning in this context means to assign graph nodes to a specific device (Execution Provider)
up to a certain limit that is every automatically inferred or provided by configuration.
In this implementation, we stop assigning nodes to CUDA once we reach the specified memory limit.
This allows users to run models on devices with limited memory or other pre-defined resources and
offload parts of the graph on CPU or other EPs as configured.
The PR also introduces an ability to profile and save resource consumption on a per node basis.
The results of one or more runs are saved into a CSV file which can then be loaded to assist
partitioning.
Model architecture-based portioning (like put N transformer blocks on GPU and embedding on CPU) is not implemented in this PR but will be coming in the future.
Motivation and Context
We want to allow models to run in constrained environments.
Pending
Annotation assisted partitioning