Utilities to set ML models and scripts from driver scripts #185

al-rigazzi · 2022-04-05T17:13:47Z

This PR adds the functionalities needed to add ML models and TorchScript functions to the orchestrators (convergend and not, but see caveat below) from the driver script launching SmartSim.

The full set of available new functionalities:

set a ML model from a file on converged and non-converged orchestrators
set a ML model from memory on non-converged orchestrators
set a script from a file or from a string representation in memory on converged and non-converged orchestrators
set a function on non-converged orchestrators
do any of the above mentioned actions for an ensemble

Basically, what can not be done is setting a Model or a function from memory on converged orchestrators. These are the reasons for such API gaps:

an in-memory representation of a model is binary, to pass it to converged orchestrators we need to put it in the launcher script, and there are many problems when putting a binary string on a text file. Moreover, setting from memory should avoid writing to file, but as we would be dumping the model on the script, we are storing it anyhow. Better leave this to the user, so that models can be stored where needed.
setting a function on the orchestrator requires passing the address of the function itself, which would not work once the context changes (i.e. when the converged setting is deployed).

I decided to use the SmartRedis client to connect to orchestrators and set models and scripts, as spawning redis-cli processes was more convoluted, in my opinion.

…odel-setter

codecov-commenter · 2022-04-05T17:33:01Z

Codecov Report

Merging #185 (9b1b681) into develop (3f9b583) will decrease coverage by 3.23%.
The diff coverage is 92.16%.

@@             Coverage Diff             @@
##           develop     #185      +/-   ##
===========================================
- Coverage    81.80%   78.57%   -3.24%     
===========================================
  Files           57       59       +2     
  Lines         2974     3743     +769     
===========================================
+ Hits          2433     2941     +508     
- Misses         541      802     +261

Impacted Files	Coverage Δ
smartsim/ml/tf/utils.py	`95.83% <ø> (ø)`
smartsim/_core/utils/redis.py	`51.13% <83.78%> (-48.87%)`	⬇️
smartsim/entity/dbobject.py	`86.23% <86.23%> (ø)`
smartsim/_core/launcher/colocated.py	`90.12% <93.47%> (+4.40%)`	⬆️
smartsim/_core/control/manifest.py	`92.24% <96.00%> (-2.65%)`	⬇️
smartsim/_core/control/controller.py	`80.47% <97.05%> (-2.25%)`	⬇️
smartsim/_core/utils/__init__.py	`100.00% <100.00%> (ø)`
smartsim/database/orchestrator.py	`84.90% <100.00%> (+0.96%)`	⬆️
smartsim/entity/__init__.py	`100.00% <100.00%> (ø)`
smartsim/entity/ensemble.py	`98.40% <100.00%> (-0.47%)`	⬇️
... and 26 more

Spartee

Great stuff! couple comments and 1 bigger design thought.

comments

whats the workflow for Ensemble? is the user required to write that loop themselves to call the Model functions?
tests for Ensembles?

Should we allow the user to simply pass a DBModel to Experiment.start() in addition to the methods in the Model and Ensemble classes? DBObjects would be more user facing, but

smartsim/_core/control/controller.py

smartsim/_core/launcher/colocated.py

smartsim/entity/__init__.py

smartsim/entity/model.py

al-rigazzi · 2022-04-20T08:39:16Z

@Spartee re: passing the model to Experiment.start(). That would be an interesting idea, but I think that for now, I'd prefer to keep the DBObjects a Model-related entity. If multiple models need the same DBObject, they either belong to an ensemble (and this is addressed now), or they can all put the DBObject/the user can just set it to one of the models.

Spartee

Very close, couple comments we should chat about today.

smartsim/entity/ensemble.py

smartsim/_core/control/controller.py

Spartee

Approving with minor comments. Looks great. Works on my mac.

Spartee · 2022-04-22T18:23:59Z

smartsim/_core/control/manifest.py

+        if not ensembles:
+            return False
+
+        has_db_objects |= any([has_db_models(ensemble) | has_db_scripts(ensemble) for ensemble in ensembles])


add comments here to explain what's being done. Love the operator usage but it's not readable.

Spartee · 2022-04-22T18:24:44Z

smartsim/_core/entrypoints/colocated.py

+                                  args.device+f":{device_num}")
+    elif args.file:
+        if args.devices_per_node == 1:
+            client.set_script_from_file(args.name, 


What happens if these fail? Have we tried setting a bad model?

The functions are launched in a try-catch statement, I think it should catch such an exception. Do you think we should add a test for that case?

smartsim/database/orchestrator.py

…odel-setter

…into model-setter

@al-rigazzi

Adds utilities to set ML models and ML scripts directly from driver scripts, as opposed to only from application code. [ committed by @al-rigazzi ] [ reviewed by @Spartee ]

al-rigazzi added 3 commits April 1, 2022 16:14

First commit

a6701c0

Merge branch 'develop' of https://github.com/CrayLabs/SmartSim into m…

2deee1c

…odel-setter

Working add_script, add_function, add_model

ca9c9db

al-rigazzi requested a review from Spartee April 5, 2022 17:13

Spartee suggested changes Apr 11, 2022

View reviewed changes

al-rigazzi added 6 commits April 18, 2022 18:37

First commit

9db1c96

Working add_script, add_function, add_model

75a1f74

Address reviewers' comments

2e31e62

First commit

8f6ba4b

Working add_script, add_function, add_model

004b85e

Address reviewers' comments

21bc88a

al-rigazzi force-pushed the model-setter branch from 77067f5 to 21bc88a Compare April 18, 2022 16:40

al-rigazzi added 4 commits April 18, 2022 19:37

Fix rebase

0d8d7ae

Fix function name

b9f83d1

Update dbobject tests

352f570

Add DBObject functionality to ensembles

1d3672c

al-rigazzi requested a review from Spartee April 19, 2022 22:20

Improve coverage for DBObject code

f5de152

Spartee reviewed Apr 20, 2022

View reviewed changes

smartsim/entity/ensemble.py Show resolved Hide resolved

smartsim/entity/ensemble.py Outdated Show resolved Hide resolved

smartsim/_core/control/controller.py Outdated Show resolved Hide resolved

Remove duplicate exception catches, reuse DB check

ec8c918

Spartee approved these changes Apr 22, 2022

View reviewed changes

al-rigazzi added 3 commits April 25, 2022 15:39

Minor adjustments following review

5cf503c

Merge branch 'develop' of https://github.com/CrayLabs/SmartSim into m…

9b1b681

…odel-setter

Merge branch 'model-setter' of https://github.com/al-rigazzi/SmartSim …

9c03527

…into model-setter

al-rigazzi merged commit 8798c28 into CrayLabs:develop May 11, 2022

al-rigazzi deleted the model-setter branch March 20, 2023 19:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Utilities to set ML models and scripts from driver scripts #185

Utilities to set ML models and scripts from driver scripts #185

al-rigazzi commented Apr 5, 2022 •

edited

Loading

codecov-commenter commented Apr 5, 2022 •

edited

Loading

Spartee left a comment

al-rigazzi commented Apr 20, 2022

Spartee left a comment

Spartee left a comment

Spartee Apr 22, 2022

al-rigazzi Apr 25, 2022

Spartee Apr 22, 2022

al-rigazzi Apr 25, 2022

Utilities to set ML models and scripts from driver scripts #185

Utilities to set ML models and scripts from driver scripts #185

Conversation

al-rigazzi commented Apr 5, 2022 • edited Loading

codecov-commenter commented Apr 5, 2022 • edited Loading

Codecov Report

Spartee left a comment

Choose a reason for hiding this comment

comments

al-rigazzi commented Apr 20, 2022

Spartee left a comment

Choose a reason for hiding this comment

Spartee left a comment

Choose a reason for hiding this comment

Spartee Apr 22, 2022

Choose a reason for hiding this comment

al-rigazzi Apr 25, 2022

Choose a reason for hiding this comment

Spartee Apr 22, 2022

Choose a reason for hiding this comment

al-rigazzi Apr 25, 2022

Choose a reason for hiding this comment

al-rigazzi commented Apr 5, 2022 •

edited

Loading

codecov-commenter commented Apr 5, 2022 •

edited

Loading