v0.7.0 #584

al-rigazzi · 2024-05-14T22:57:15Z

Merge develop to master for release of v0.7.0

@ashao

This PR brings develop up to date with master after releasing v0.6.2 [ committed by @ashao ] [ reviewed by @al-rigazzi ]

@al-rigazzi

This PR bumps up the Python version used when building the tutorial containers (both production and development) to avoid module incompatibility. A minor bug in the tutorial container version number is also addressed as part of this PR. [ committed by @al-rigazzi ] [ reviewed by @MattToast ]

@amandarichardsonn

This PR prevents the launch of duplicate named entities. Completed entities are allowed to rerun. [ committed by @amandarichardsonn ] [ reviewed by @ankona @MattToast ]

@mellis13

Update the start(), stop(), generate(), and get_status() Experiment API functions from ``t.any`` to ``t.Union[SmartSimEntity, EntitySequence[SmartSimEntity]]``. [ committed by @mellis13 ] [ reviewed by @ankona @MattToast ]

@ashao

The deploy_dev_docs action was failing due to running out of disk space in the container. This was alleviated by running the `maximize-build-space` Github action. [ committed by @ashao ] [ reviewed by @ankona ]

A fix in the build scripts of Redis 7.2.4 modifies build behavior on MacOS on Apple Silicon. The change fixes an issue where incorrect compiler flags are defined and result in build failures due to the `redis_fstat` macro

@MattToast

Promote SmartSim statuses to a dedicated type named SmartSimStatus. [ reviewed by @MattToast @al-rigazzi ] [ committed by @amandarichardsonn ]

@ashao

This PR provides a full refactor to the SmartSim documentation. This branch merges in sections: Experiment, Orchestrator, Model, Ensemble, RunSettings, BatchSettings, and SmartSim Logging. [ reviewed by @ashao @ankona @al-rigazzi @mellis13 @juliaputko ] [ committed by @amandarichardsonn ]

@MattToast

Removed deprecated SmartSim modules (slurm and mpirunSettings). [ reviewed by @MattToast @al-rigazzi ] [ committed by @amandarichardsonn ]

@mellis13

Jupyter notebook math expressions were not rendering locally or in docker container - update made to conf.py to fix. [ reviewed by @mellis13 @al-rigazzi ] [ committed by @amandarichardsonn ]

@mellis13

Implemented new check for edit to changelog.rst using [Changelog Enforcer](https://github.com/marketplace/actions/changelog-enforcer). [ reviewed by @mellis13 ] [ committed by @amandarichardsonn ]

@ashao

Adding readthedocs config file and robots.txt generation. [ reviewed by @ashao @mellis13 ] [ committed by @amandarichardsonn ]

@mellis13

Added `isinstance` check to RunSettings exe_args setter. Added additional tests. [ reviewed by @mellis13 ] [ committed by @amandarichardsonn ]

@MattToast

Removes behavior deprecated in #480 from test suite. [ committed by @MattToast ] [ reviewed by @mellis13 ]

@MattToast

Configure mypy to raise an error when: - An instance of an object is used for a boolean check when neither `__bool__` or `__len__` are implemented - A `Iterator` is used on a boolean check when the author almost certainly wanted a `Collection` Fix up/refactors areas of the code base where these potential errors linger. [ committed by @MattToast ] [ reviewed by @ankona @al-rigazzi ]

@MattToast

Colo Orchestrator launch moved to a blocking process. Application executes once Orchestrator is built. [ reviewed by @MattToast @mellis13 @ashao @al-rigazzi ] [ committed by @amandarichardsonn ]

@al-rigazzi

This PR adds the method `set_node_feature` to srunSettings that accepts a str or list of strs. Users may now specify node constraints for slurm jobs. [ reviewed by @al-rigazzi ] [ committed by @amandarichardsonn ]

@ankona

## New features - Creates and integrates metric collection into the telemetry monitor using the new `CollectorManager`. Metrics are written using the new `FileSink` class ## Included collectors: - `DbMemoryCollector` - `DbConnectionCollector` ## Updated features - Switch basic experiment tracking telemetry to default to on. - Improve telemetry monitor logging. - Create telemetry subpackage at `smartsim._core.utils.telemetry`. - Refactor telemetry monitor entrypoint. [committed by @ankona ] [reviewed by @MattToast @ashao @amandarichardsonn @mellis13 ]

@MattToast

Removing instances of ["CPU","GPU"] with a `Device` Enum. [ reviewed by @MattToast ] [ committed by @amandarichardsonn ]

@MattToast

Configure mypy to error when a potentially uninitialized variable is used. Fix lingering errors found by mypy. [ committed by @MattToast ] [ reviewed by @al-rigazzi @ankona ]

@al-rigazzi

Readthedocs fails and is blocking existing PRs. Failure is: `Extension error: Could not import extension sphinx_tabs.tabs (exception: cannot import name 'TypeAliasType' from 'typing_extensions')`. The issue came from pydantic==2.6.4 and typing_extensions==4.5.0. Sphinx uses Open AI which requires "pydantic>=1.9.0, <3", "typing-extensions>=4.7, <5". The versions have been changed in the readthedocs `yaml` file. [ reviewed by @al-rigazzi ] [ committed by @amandarichardsonn ]

@ashao

On systems that have the Intel Compilers and/or the Intel Math Kernel library installed, the Caffe2 package that comes with Torch will unconditionally try to link in the MKL during the Torch backend. This however can lead to two types of failures: - Problems when compiling the Torch backend because the linker does not include the path to the MKL library path - Loading the Torch backend into RedisAI fails because the user does not expect to need to have the MKL library loaded. To alleviate this, a new option "--no_torch_with_mkl" has been added to the `smart build` command that modifies the mkl.cmake file to prevent the detection of MKL. [ committed by @ashao ] [ reviewed by @MattToast and @al-rigazzi ]

@MattToast

Fixes unfalsifiable test that tests SmartSim's custom SIGINT signal handler. Adds infrastructure to make the test pass again. [ committed by @MattToast ] [ reviewed by @ashao ]

@AlyssaCote

Moves .out and .err files under the `.smartsim` directory and creates a symlink to those files under the experiment directory. [ committed by @AlyssaCote ] [ reviewed by @ashao , @al-rigazzi ]

@ankona

Updated watchdog dependency pin to next major version and removed `type: ignore` where possible due to new type hints added to watchdog [ committed by @ankona ] [ reviewed by @AlyssaCote ]

@AlyssaCote

Python 3.8 is nearing its end of life so we're no longer supporting it. [ committed by @AlyssaCote ] [ reviewed by @MattToast @mellis13 ]

@ashao

This PR makes changes to the default path for SS entities. New default path is `exp_path/entity_name/`. A path argument has also been added to create_ensemble and create_model. [ reviewed by @ashao @mellis13 ] [ committed by @amandarichardsonn ]

@ankona

Ensures that a managed step-mapping doesn't include a `task_id`. The telemetry monitor exhibits a defect where it logs errors from a task manager, even with only managed tasks being monitored for updates: ![image](https://github.com/CrayLabs/SmartSim/assets/3595025/84921e5b-144b-4fcd-8289-48d2504deaac) This fix modifies the telemetry monitor to not set a `task_id` when adding items to the `step_mapping` collection. This avoids triggering lookups for unmanaged processes. ![image](https://github.com/CrayLabs/SmartSim/assets/3595025/dfc7cfad-a875-45b3-91d2-fa19407c9d0c) [ committed by @ankona ] [ approved by @MattToast ]

@MattToast

This PR removes the helper function `init_default` and instead implements traditional type narrowing. [ reviewed by @MattToast ] [ committed by @amandarichardsonn ]

@AlyssaCote

Bump ubuntu to version 22.04 [ committed by @AlyssaCote ] [ reviewed by @ashao ]

@AlyssaCote

In this PR I removed the defensive regexp in `.gitignore` and added `test_dir` to the tests that were writing to the `cwd` instead of the `test_output` directory. [ committed by @AlyssaCote ] [ reviewed by @ankona ]

@MattToast

Fixes: - `tests/backends/test_onnx.py::test_sklearn_onnx` - Correctly set number of tasks when not using the local launcher - Makes sure the DB is not left running on test failure - `tests/full_wlm/test_generic_orc_launch_batch.py::test_launch_cluster_orc_reconnect` - Look for pickle file under the orchestrator path rather than the test dir - Makes sure that DB is cleaned up correctly on test failure Quiets: - `tests/on_wlm/test_het_job.py` - All experiments under the test module are given an explict test path [ committed by @MattToast ] [ reviewed by @ashao ]

@AlyssaCote

After testing a bunch of batch ensembles and batch models, I found that I hadn't actually symlinked the substeps in the controller. This fix should properly symlink the substeps. [ committed by @AlyssaCote ] [ reviewed by @ankona ]

@AlyssaCote

The `manifest.json` version needs to be bumped from `0.0.3` to `0.0.4` to match the version of SmartDashboard. [ committed by @AlyssaCote ] [ reviewed by @MattToast ]

@MattToast

This PR adds to the release.yml github workflow to autogenerate a PR that merge changes from master to develop. [ reviewed by @MattToast ] [ committed by @amandarichardsonn ]

@AlyssaCote

This PR removes :type: and :rtype: driectives from function docstrings as well as implements the sphinx-autodoc-typehints extension. [ reviewed by @AlyssaCote ] [ committed by @amandarichardsonn ]

@MattToast

This PR updates the authetication used in the release workflow from a developer created token to the GH_TOKEN environment variable. [ reviewed by @MattToast ] [ committed by @amandarichardsonn ]

@AlyssaCote

This PR adds a `release.yml` file to the root of the `.github` folder. Within the file we configure the release notes generated through PR tags. This PR also converts the changelog format from rst to md to match release notes format. [ reviewed by @AlyssaCote ] [ committed by @amandarichardsonn ]

@juliaputko

This PR adds the ``Experiment.preview`` method to display the entity summaries during runtime to offer additional insight into the launch details. The method surfaces entity information such as name, path, run settings, and client configuration of any instance of a ``Model``, ``Ensemble``, or ``Orchestrator`` prior to the start on an Experiment. [ committed by @juliaputko ] [ reviewed by @ankona @mellis13 @AlyssaCote @amandarichardsonn ]

@ashao

Tensorflow requires that typing_extensions<=4.6.0, however this cuases the sphinx build process to fail due to an error importing sphinx_tabs. This is potentially a misleading error because sphinx_tabs itself does not use typing extensions, but the problem nevertheless exists even when testing other versions of Sphinx. To allow the deploy_dev_docs action to complete, this modifies the Dockerfile used to build the docs to ensure that typing_extensions==4.6.1 (which is the lowest version that does not throw pip resolution errors) is present in the python environment prior to the build. [ committed by @ashao ] [ reviewed by @amandarichardsonn ]

@al-rigazzi

This PR adds a Dragon-based launcher to SmartSim. The new launcher can be selected by specifying `launcher="dragon"`, e.g. in the `Experiment` constructor. The Dragon launcher can be used on systems where the PBS or Slurm schedulers are available (MPI-based applications currently require Cray PMI or Cray PALS). A new `--dragon` option was added to `smart build` to install the correct Dragon package in the Python environment. More information about the Dragon launcher is available in the dedicated documentation section. Please note that the Dragon launcher is at an early development stage, and we would love to hear feedback about it. [ committed by @al-rigazzi @ankona @MattToast @amandarichardsonn ] [ reviewed by @mellis13 @ankona @MattToast @amandarichardsonn ] --------- Co-authored-by: Matt Drozt <[email protected]> Co-authored-by: Chris McBride <[email protected]>

@ashao

Tests which needed to launch an Orchestrator were spinning up and shutting down their own instances. This led to a number of cases where a single test failing would cascade into failures of other tests. Additionally, this also meant that a significant amount of time in the tests was spent waiting for Orchestrators to launch. This PR adds a session-scoped fixture that returns an Orchestrator. Most tests which use an Orchestrator have been updated to use this fixture; the remaining for various reasons still need to spin up their own (for example the multiple database tests need to have a named Orchestrator). [ committed by @ashao ] [ reviewed by @ankona @AlyssaCote ] Co-authored-by: Matt Drozt <[email protected]>

@al-rigazzi

The Dragon server could fail, dumping a core file, if it was shut down before all spawned Process Groups completed. This PR fixes such behavior: the immediate flag on the `DragonShutdownRequest` now requests every non-terminated job to be stopped. [ committed by @al-rigazzi ] [ reviewed by @ashao ]

@al-rigazzi

Update version to 0.7.0 and SmartRedis's version to 0.5.3 [ committed by @al-rigazzi ] [ reviewed by @amandarichardsonn ]

juliaputko

lgtm

AlyssaCote

LGTM!

al-rigazzi and others added 30 commits February 16, 2024 19:37

0.6.2 (#495)

b46d84d

This PR brings develop up to date with master after releasing v0.6.2 [ committed by @ashao ] [ reviewed by @al-rigazzi ]

Duplicate entity name prevention (#480)

39354db

This PR prevents the launch of duplicate named entities. Completed entities are allowed to rerun. [ committed by @amandarichardsonn ] [ reviewed by @ankona @MattToast ]

Change generic t.Any in Experiment API (#501)

36e1f44

Update the start(), stop(), generate(), and get_status() Experiment API functions from ``t.any`` to ``t.Union[SmartSimEntity, EntitySequence[SmartSimEntity]]``. [ committed by @mellis13 ] [ reviewed by @ankona @MattToast ]

Increase disk space in container for doc builder (#504)

b44ef3a

The deploy_dev_docs action was failing due to running out of disk space in the container. This was alleviated by running the `maximize-build-space` Github action. [ committed by @ashao ] [ reviewed by @ankona ]

Bump Redis dependency to 7.2.4 (#507)

63836a9

A fix in the build scripts of Redis 7.2.4 modifies build behavior on MacOS on Apple Silicon. The change fixes an issue where incorrect compiler flags are defined and result in build failures due to the `redis_fstat` macro

Change Status Module (#509)

33ee012

Promote SmartSim statuses to a dedicated type named SmartSimStatus. [ reviewed by @MattToast @al-rigazzi ] [ committed by @amandarichardsonn ]

Remove Long Deprecated SmartSim Modules (#514)

6660efc

Removed deprecated SmartSim modules (slurm and mpirunSettings). [ reviewed by @MattToast @al-rigazzi ] [ committed by @amandarichardsonn ]

Formatting in Jupyter Notebooks (#516)

4722b4f

Jupyter notebook math expressions were not rendering locally or in docker container - update made to conf.py to fix. [ reviewed by @mellis13 @al-rigazzi ] [ committed by @amandarichardsonn ]

Enforce changelog for SmartSim PRs (#518)

9e74ba9

Implemented new check for edit to changelog.rst using [Changelog Enforcer](https://github.com/marketplace/actions/changelog-enforcer). [ reviewed by @mellis13 ] [ committed by @amandarichardsonn ]

ReadTheDocs Configuration File (#512)

10e084e

Adding readthedocs config file and robots.txt generation. [ reviewed by @ashao @mellis13 ] [ committed by @amandarichardsonn ]

Correct ExecArgs Handling During RunSetting (#517)

e307b72

Added `isinstance` check to RunSettings exe_args setter. Added additional tests. [ reviewed by @mellis13 ] [ committed by @amandarichardsonn ]

Remove duplicate launched model names from full test suite (#520)

6dea582

Removes behavior deprecated in #480 from test suite. [ committed by @MattToast ] [ reviewed by @mellis13 ]

Application executes before colocated Orchestrator is created (#522)

06d6166

Colo Orchestrator launch moved to a blocking process. Application executes once Orchestrator is built. [ reviewed by @MattToast @mellis13 @ashao @al-rigazzi ] [ committed by @amandarichardsonn ]

Specify node feature for slurm job (#529)

4b35cc9

This PR adds the method `set_node_feature` to srunSettings that accepts a str or list of strs. Users may now specify node constraints for slurm jobs. [ reviewed by @al-rigazzi ] [ committed by @amandarichardsonn ]

Promote Build Device Option to Enum (#527)

fa0da2c

Removing instances of ["CPU","GPU"] with a `Device` Enum. [ reviewed by @MattToast ] [ committed by @amandarichardsonn ]

Disallow Uninitialized Variable Use (#521)

6f800b1

Configure mypy to error when a potentially uninitialized variable is used. Fix lingering errors found by mypy. [ committed by @MattToast ] [ reviewed by @al-rigazzi @ankona ]

Enhanced Signal Management (#535)

505de50

Fixes unfalsifiable test that tests SmartSim's custom SIGINT signal handler. Adds infrastructure to make the test pass again. [ committed by @MattToast ] [ reviewed by @ashao ]

Store SmartSim entity logs under the .smartsim directory (#532)

3edd895

Moves .out and .err files under the `.smartsim` directory and creates a symlink to those files under the experiment directory. [ committed by @AlyssaCote ] [ reviewed by @ashao , @al-rigazzi ]

Update watchdog dependency (#540)

1267a9a

Updated watchdog dependency pin to next major version and removed `type: ignore` where possible due to new type hints added to watchdog [ committed by @ankona ] [ reviewed by @AlyssaCote ]

Drop python 3.8 (#544)

4c9643c

Python 3.8 is nearing its end of life so we're no longer supporting it. [ committed by @AlyssaCote ] [ reviewed by @MattToast @mellis13 ]

Change default path for entities (#533)

f5beb41

This PR makes changes to the default path for SS entities. New default path is `exp_path/entity_name/`. A path argument has also been added to create_ensemble and create_model. [ reviewed by @ashao @mellis13 ] [ committed by @amandarichardsonn ]

Remove init_default function (#545)

044c4bd

This PR removes the helper function `init_default` and instead implements traditional type narrowing. [ reviewed by @MattToast ] [ committed by @amandarichardsonn ]

Upgrade ubuntu to 22.04 (#558)

04ea493

Bump ubuntu to version 22.04 [ committed by @AlyssaCote ] [ reviewed by @ashao ]

AlyssaCote and others added 14 commits April 22, 2024 07:56

Remove defensive regexp in .gitignore (#560)

75118ba

In this PR I removed the defensive regexp in `.gitignore` and added `test_dir` to the tests that were writing to the `cwd` instead of the `test_output` directory. [ committed by @AlyssaCote ] [ reviewed by @ankona ]

Symlink batch ensembles and batch models (#547)

62f2e8c

After testing a bunch of batch ensembles and batch models, I found that I hadn't actually symlinked the substeps in the controller. This fix should properly symlink the substeps. [ committed by @AlyssaCote ] [ reviewed by @ankona ]

Bump manifest.json version to 0.0.4 (#563)

399886b

The `manifest.json` version needs to be bumped from `0.0.3` to `0.0.4` to match the version of SmartDashboard. [ committed by @AlyssaCote ] [ reviewed by @MattToast ]

Auto-post release PR to develop (#566)

05a1e0a

This PR adds to the release.yml github workflow to autogenerate a PR that merge changes from master to develop. [ reviewed by @MattToast ] [ committed by @amandarichardsonn ]

Auto generate typehints into documentation (#561)

f5f2385

This PR removes :type: and :rtype: driectives from function docstrings as well as implements the sphinx-autodoc-typehints extension. [ reviewed by @AlyssaCote ] [ committed by @amandarichardsonn ]

Set GH_TOKEN environment variable to use Github CLI in workflow (#570)

7db8490

This PR updates the authetication used in the release workflow from a developer created token to the GH_TOKEN environment variable. [ reviewed by @MattToast ] [ committed by @amandarichardsonn ]

Update version number to 0.7.0 (#583)

ac80685

Update version to 0.7.0 and SmartRedis's version to 0.5.3 [ committed by @al-rigazzi ] [ reviewed by @amandarichardsonn ]

al-rigazzi requested review from ashao, mellis13, amandarichardsonn, AlyssaCote and juliaputko May 14, 2024 23:50

juliaputko approved these changes May 14, 2024

View reviewed changes

AlyssaCote approved these changes May 14, 2024

View reviewed changes

al-rigazzi merged commit 5039699 into master May 15, 2024
70 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.7.0 #584

v0.7.0 #584

al-rigazzi commented May 14, 2024

juliaputko left a comment

AlyssaCote left a comment

v0.7.0 #584

v0.7.0 #584

Conversation

al-rigazzi commented May 14, 2024

juliaputko left a comment

Choose a reason for hiding this comment

AlyssaCote left a comment

Choose a reason for hiding this comment