Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utilities to set ML models and scripts from driver scripts #185

Merged
merged 18 commits into from
May 11, 2022

Conversation

al-rigazzi
Copy link
Collaborator

@al-rigazzi al-rigazzi commented Apr 5, 2022

This PR adds the functionalities needed to add ML models and TorchScript functions to the orchestrators (convergend and not, but see caveat below) from the driver script launching SmartSim.

The full set of available new functionalities:

  • set a ML model from a file on converged and non-converged orchestrators
  • set a ML model from memory on non-converged orchestrators
  • set a script from a file or from a string representation in memory on converged and non-converged orchestrators
  • set a function on non-converged orchestrators
  • do any of the above mentioned actions for an ensemble

Basically, what can not be done is setting a Model or a function from memory on converged orchestrators. These are the reasons for such API gaps:

  • an in-memory representation of a model is binary, to pass it to converged orchestrators we need to put it in the launcher script, and there are many problems when putting a binary string on a text file. Moreover, setting from memory should avoid writing to file, but as we would be dumping the model on the script, we are storing it anyhow. Better leave this to the user, so that models can be stored where needed.
  • setting a function on the orchestrator requires passing the address of the function itself, which would not work once the context changes (i.e. when the converged setting is deployed).

I decided to use the SmartRedis client to connect to orchestrators and set models and scripts, as spawning redis-cli processes was more convoluted, in my opinion.

@al-rigazzi al-rigazzi requested a review from Spartee April 5, 2022 17:13
@codecov-commenter
Copy link

codecov-commenter commented Apr 5, 2022

Codecov Report

Merging #185 (9b1b681) into develop (3f9b583) will decrease coverage by 3.23%.
The diff coverage is 92.16%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #185      +/-   ##
===========================================
- Coverage    81.80%   78.57%   -3.24%     
===========================================
  Files           57       59       +2     
  Lines         2974     3743     +769     
===========================================
+ Hits          2433     2941     +508     
- Misses         541      802     +261     
Impacted Files Coverage Δ
smartsim/ml/tf/utils.py 95.83% <ø> (ø)
smartsim/_core/utils/redis.py 51.13% <83.78%> (-48.87%) ⬇️
smartsim/entity/dbobject.py 86.23% <86.23%> (ø)
smartsim/_core/launcher/colocated.py 90.12% <93.47%> (+4.40%) ⬆️
smartsim/_core/control/manifest.py 92.24% <96.00%> (-2.65%) ⬇️
smartsim/_core/control/controller.py 80.47% <97.05%> (-2.25%) ⬇️
smartsim/_core/utils/__init__.py 100.00% <100.00%> (ø)
smartsim/database/orchestrator.py 84.90% <100.00%> (+0.96%) ⬆️
smartsim/entity/__init__.py 100.00% <100.00%> (ø)
smartsim/entity/ensemble.py 98.40% <100.00%> (-0.47%) ⬇️
... and 26 more

Copy link
Contributor

@Spartee Spartee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff! couple comments and 1 bigger design thought.

comments

  • whats the workflow for Ensemble? is the user required to write that loop themselves to call the Model functions?
  • tests for Ensembles?

Should we allow the user to simply pass a DBModel to Experiment.start() in addition to the methods in the Model and Ensemble classes? DBObjects would be more user facing, but

@al-rigazzi al-rigazzi requested a review from Spartee April 19, 2022 22:20
@al-rigazzi
Copy link
Collaborator Author

@Spartee re: passing the model to Experiment.start(). That would be an interesting idea, but I think that for now, I'd prefer to keep the DBObjects a Model-related entity. If multiple models need the same DBObject, they either belong to an ensemble (and this is addressed now), or they can all put the DBObject/the user can just set it to one of the models.

Copy link
Contributor

@Spartee Spartee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very close, couple comments we should chat about today.

Copy link
Contributor

@Spartee Spartee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with minor comments. Looks great. Works on my mac.

if not ensembles:
return False

has_db_objects |= any([has_db_models(ensemble) | has_db_scripts(ensemble) for ensemble in ensembles])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comments here to explain what's being done. Love the operator usage but it's not readable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done.

args.device+f":{device_num}")
elif args.file:
if args.devices_per_node == 1:
client.set_script_from_file(args.name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if these fail? Have we tried setting a bad model?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functions are launched in a try-catch statement, I think it should catch such an exception. Do you think we should add a test for that case?

@al-rigazzi al-rigazzi merged commit 8798c28 into CrayLabs:develop May 11, 2022
al-rigazzi added a commit to al-rigazzi/SmartSim that referenced this pull request May 16, 2022
Adds utilities to set ML models and ML scripts directly from driver scripts, as opposed to only from application code.

[ committed by @al-rigazzi ]
[ reviewed by @Spartee ]
@al-rigazzi al-rigazzi deleted the model-setter branch March 20, 2023 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants