[ENV] update runtime setting default values #18987

szha · 2020-08-23T19:46:48Z

Description

update runtime setting default values for resource copies, mem pool type

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

mxnet-bot · 2020-08-23T19:46:52Z

Hey @szha , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [centos-cpu, windows-gpu, centos-gpu, unix-cpu, sanity, windows-cpu, miscellaneous, clang, website, edge, unix-gpu]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

szha · 2020-08-24T06:31:15Z

@sxjscience @zhreshold it would be great to have some data on how such defaults work for the cv and nlp models.

src/resource.cc

zhreshold · 2020-08-24T23:21:56Z

src/storage/storage.cc

-  if (type == nullptr)
-    type = "Naive";   // default pool
+  if (type == nullptr) {
+    type = "Round";   // default pool


it's obvious that round can help speed up certain dynamic input workloads, but tends to oom more frequently. I suggest we much very cautious about changing the default to round, unless there's a good fallback for oom handling.

I share the concern on this. This change will impact those static-size static-graph models that were at the boundary of GPU memory limit. Which of the current GluonCV model training scripts fall in this category?

My hope is of course to provide a good out-of-the-box usage experience to mxnet users. From what I observed, there seems to be more models with dynamic shape inputs than static ones, and many of the static-shape models can still run in this setting, hence the proposal.

sxjscience · 2020-08-25T01:48:18Z

I think CNNs are generally static shape while models in NLP are generally dynamic shape. Do we have any plan for improving the memory usage? Get Outlook for iOS<https://aka.ms/o0ukef>

________________________________ From: Sheng Zha <[email protected]> Sent: Monday, August 24, 2020 6:30:41 PM To: apache/incubator-mxnet <[email protected]> Cc: Xingjian SHI <[email protected]>; Mention <[email protected]> Subject: Re: [apache/incubator-mxnet] [ENV] update runtime setting default values (#18987) @szha commented on this pull request.

________________________________ In src/storage/storage.cc<#18987 (comment)>:

@@ -67,8 +67,9 @@ StorageManager *CreateStorageManager(const Context &ctx, const char *context,

int num_gpu_device, std::string *pStrategy) { const auto env_var = env_var_name(context, pool_type); const char *type = getenv(env_var.c_str()); - if (type == nullptr) - type = "Naive"; // default pool + if (type == nullptr) { + type = "Round"; // default pool My hope is of course to provide a good out-of-the-box usage experience to mxnet users. From what I observed, there seems to be more models with dynamic shape inputs than static ones, and many of the static-shape models can still run in this setting, hence the proposal. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#18987 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABHQH3TMASLXDEBXVJXNACLSCMH4DANCNFSM4QI2O3KQ>.

szha · 2020-08-25T01:54:50Z

I think CNNs are generally static shape while models in NLP are generally dynamic shape.

I don't think we can generalize like this. For example, object detection and segmentation are based on CNN and are usually not static-shaped.

Do we have any plan for improving the memory usage?

Of course we do. I think @ArmageddonKnight is currently fixing some missed allocation entries in memory profiler, and plans on developing a memory usage visualization tool later this week to help narrow down the focus for memory optimization. We also intend to add mirror option to cached op to allow training for larger model.

sxjscience

I can approve from the NLP side. Because the workloads in NLP are usually dynamic so it's beneficial to have round memory management.

szha · 2020-08-27T05:22:01Z

thanks for the reviews. @zhreshold how about setting the pool strategy to be naive when shapes are static?

zhreshold · 2020-08-28T01:21:33Z

thanks for the reviews. @zhreshold how about setting the pool strategy to be naive when shapes are static?

This is actually fine when shapes are static, my major concern is that with round enabled by default, in most use cases mxnet can be faster but consumes more memory than expected

szha · 2020-08-29T01:29:06Z

@zhreshold we could consider adding an interface to allocate the exact size and use it in cached op for static shape only.

szha · 2020-08-30T07:40:44Z

I reverted the round pool change first to merge the rest of the changes. I will work on a cached op path to enable exact size allocation to avoid the memory waste in the static graph case.

szha marked this pull request as ready for review August 24, 2020 01:30

yzhliu approved these changes Aug 24, 2020

View reviewed changes

src/resource.cc Show resolved Hide resolved

zhreshold reviewed Aug 24, 2020

View reviewed changes

sxjscience approved these changes Aug 27, 2020

View reviewed changes

sxjscience mentioned this pull request Aug 30, 2020

[Numpy] MobileBERT SQuAD training cannot reproduce the previous results dmlc/gluon-nlp#1322

Closed

szha force-pushed the env_defaults branch 2 times, most recently from ff5a940 to 9267d11 Compare September 6, 2020 05:40

update runtime setting default values

1c85d2b

szha force-pushed the env_defaults branch from 9267d11 to 1c85d2b Compare September 6, 2020 20:02

szha merged commit 04e394a into apache:master Sep 7, 2020

szha deleted the env_defaults branch September 7, 2020 02:39

szha restored the env_defaults branch September 7, 2020 02:39

szha deleted the env_defaults branch September 7, 2020 02:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENV] update runtime setting default values #18987

[ENV] update runtime setting default values #18987

szha commented Aug 23, 2020

mxnet-bot commented Aug 23, 2020

szha commented Aug 24, 2020

zhreshold Aug 24, 2020

szha Aug 25, 2020 •

edited

Loading

szha Aug 25, 2020

sxjscience commented Aug 25, 2020 via email

szha commented Aug 25, 2020

sxjscience left a comment

szha commented Aug 27, 2020

zhreshold commented Aug 28, 2020

szha commented Aug 29, 2020

szha commented Aug 30, 2020

[ENV] update runtime setting default values #18987

[ENV] update runtime setting default values #18987

Conversation

szha commented Aug 23, 2020

Description

Checklist

Essentials

Changes

Comments

mxnet-bot commented Aug 23, 2020

szha commented Aug 24, 2020

zhreshold Aug 24, 2020

Choose a reason for hiding this comment

szha Aug 25, 2020 • edited Loading

Choose a reason for hiding this comment

szha Aug 25, 2020

Choose a reason for hiding this comment

sxjscience commented Aug 25, 2020 via email

szha commented Aug 25, 2020

sxjscience left a comment

Choose a reason for hiding this comment

szha commented Aug 27, 2020

zhreshold commented Aug 28, 2020

szha commented Aug 29, 2020

szha commented Aug 30, 2020

szha Aug 25, 2020 •

edited

Loading