-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v0.7.0 #584
Merged
Merged
v0.7.0 #584
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This PR bumps up the Python version used when building the tutorial containers (both production and development) to avoid module incompatibility. A minor bug in the tutorial container version number is also addressed as part of this PR. [ committed by @al-rigazzi ] [ reviewed by @MattToast ]
This PR prevents the launch of duplicate named entities. Completed entities are allowed to rerun. [ committed by @amandarichardsonn ] [ reviewed by @ankona @MattToast ]
Update the start(), stop(), generate(), and get_status() Experiment API functions from ``t.any`` to ``t.Union[SmartSimEntity, EntitySequence[SmartSimEntity]]``. [ committed by @mellis13 ] [ reviewed by @ankona @MattToast ]
A fix in the build scripts of Redis 7.2.4 modifies build behavior on MacOS on Apple Silicon. The change fixes an issue where incorrect compiler flags are defined and result in build failures due to the `redis_fstat` macro
Promote SmartSim statuses to a dedicated type named SmartSimStatus. [ reviewed by @MattToast @al-rigazzi ] [ committed by @amandarichardsonn ]
This PR provides a full refactor to the SmartSim documentation. This branch merges in sections: Experiment, Orchestrator, Model, Ensemble, RunSettings, BatchSettings, and SmartSim Logging. [ reviewed by @ashao @ankona @al-rigazzi @mellis13 @juliaputko ] [ committed by @amandarichardsonn ]
Removed deprecated SmartSim modules (slurm and mpirunSettings). [ reviewed by @MattToast @al-rigazzi ] [ committed by @amandarichardsonn ]
Jupyter notebook math expressions were not rendering locally or in docker container - update made to conf.py to fix. [ reviewed by @mellis13 @al-rigazzi ] [ committed by @amandarichardsonn ]
Implemented new check for edit to changelog.rst using [Changelog Enforcer](https://github.com/marketplace/actions/changelog-enforcer). [ reviewed by @mellis13 ] [ committed by @amandarichardsonn ]
Adding readthedocs config file and robots.txt generation. [ reviewed by @ashao @mellis13 ] [ committed by @amandarichardsonn ]
Added `isinstance` check to RunSettings exe_args setter. Added additional tests. [ reviewed by @mellis13 ] [ committed by @amandarichardsonn ]
Removes behavior deprecated in #480 from test suite. [ committed by @MattToast ] [ reviewed by @mellis13 ]
Configure mypy to raise an error when: - An instance of an object is used for a boolean check when neither `__bool__` or `__len__` are implemented - A `Iterator` is used on a boolean check when the author almost certainly wanted a `Collection` Fix up/refactors areas of the code base where these potential errors linger. [ committed by @MattToast ] [ reviewed by @ankona @al-rigazzi ]
Colo Orchestrator launch moved to a blocking process. Application executes once Orchestrator is built. [ reviewed by @MattToast @mellis13 @ashao @al-rigazzi ] [ committed by @amandarichardsonn ]
This PR adds the method `set_node_feature` to srunSettings that accepts a str or list of strs. Users may now specify node constraints for slurm jobs. [ reviewed by @al-rigazzi ] [ committed by @amandarichardsonn ]
## New features - Creates and integrates metric collection into the telemetry monitor using the new `CollectorManager`. Metrics are written using the new `FileSink` class ## Included collectors: - `DbMemoryCollector` - `DbConnectionCollector` ## Updated features - Switch basic experiment tracking telemetry to default to on. - Improve telemetry monitor logging. - Create telemetry subpackage at `smartsim._core.utils.telemetry`. - Refactor telemetry monitor entrypoint. [committed by @ankona ] [reviewed by @MattToast @ashao @amandarichardsonn @mellis13 ]
Removing instances of ["CPU","GPU"] with a `Device` Enum. [ reviewed by @MattToast ] [ committed by @amandarichardsonn ]
Configure mypy to error when a potentially uninitialized variable is used. Fix lingering errors found by mypy. [ committed by @MattToast ] [ reviewed by @al-rigazzi @ankona ]
Readthedocs fails and is blocking existing PRs. Failure is: `Extension error: Could not import extension sphinx_tabs.tabs (exception: cannot import name 'TypeAliasType' from 'typing_extensions')`. The issue came from pydantic==2.6.4 and typing_extensions==4.5.0. Sphinx uses Open AI which requires "pydantic>=1.9.0, <3", "typing-extensions>=4.7, <5". The versions have been changed in the readthedocs `yaml` file. [ reviewed by @al-rigazzi ] [ committed by @amandarichardsonn ]
On systems that have the Intel Compilers and/or the Intel Math Kernel library installed, the Caffe2 package that comes with Torch will unconditionally try to link in the MKL during the Torch backend. This however can lead to two types of failures: - Problems when compiling the Torch backend because the linker does not include the path to the MKL library path - Loading the Torch backend into RedisAI fails because the user does not expect to need to have the MKL library loaded. To alleviate this, a new option "--no_torch_with_mkl" has been added to the `smart build` command that modifies the mkl.cmake file to prevent the detection of MKL. [ committed by @ashao ] [ reviewed by @MattToast and @al-rigazzi ]
Fixes unfalsifiable test that tests SmartSim's custom SIGINT signal handler. Adds infrastructure to make the test pass again. [ committed by @MattToast ] [ reviewed by @ashao ]
Moves .out and .err files under the `.smartsim` directory and creates a symlink to those files under the experiment directory. [ committed by @AlyssaCote ] [ reviewed by @ashao , @al-rigazzi ]
Updated watchdog dependency pin to next major version and removed `type: ignore` where possible due to new type hints added to watchdog [ committed by @ankona ] [ reviewed by @AlyssaCote ]
Python 3.8 is nearing its end of life so we're no longer supporting it. [ committed by @AlyssaCote ] [ reviewed by @MattToast @mellis13 ]
This PR makes changes to the default path for SS entities. New default path is `exp_path/entity_name/`. A path argument has also been added to create_ensemble and create_model. [ reviewed by @ashao @mellis13 ] [ committed by @amandarichardsonn ]
Ensures that a managed step-mapping doesn't include a `task_id`. The telemetry monitor exhibits a defect where it logs errors from a task manager, even with only managed tasks being monitored for updates:  This fix modifies the telemetry monitor to not set a `task_id` when adding items to the `step_mapping` collection. This avoids triggering lookups for unmanaged processes.  [ committed by @ankona ] [ approved by @MattToast ]
This PR removes the helper function `init_default` and instead implements traditional type narrowing. [ reviewed by @MattToast ] [ committed by @amandarichardsonn ]
Bump ubuntu to version 22.04 [ committed by @AlyssaCote ] [ reviewed by @ashao ]
In this PR I removed the defensive regexp in `.gitignore` and added `test_dir` to the tests that were writing to the `cwd` instead of the `test_output` directory. [ committed by @AlyssaCote ] [ reviewed by @ankona ]
Fixes: - `tests/backends/test_onnx.py::test_sklearn_onnx` - Correctly set number of tasks when not using the local launcher - Makes sure the DB is not left running on test failure - `tests/full_wlm/test_generic_orc_launch_batch.py::test_launch_cluster_orc_reconnect` - Look for pickle file under the orchestrator path rather than the test dir - Makes sure that DB is cleaned up correctly on test failure Quiets: - `tests/on_wlm/test_het_job.py` - All experiments under the test module are given an explict test path [ committed by @MattToast ] [ reviewed by @ashao ]
After testing a bunch of batch ensembles and batch models, I found that I hadn't actually symlinked the substeps in the controller. This fix should properly symlink the substeps. [ committed by @AlyssaCote ] [ reviewed by @ankona ]
The `manifest.json` version needs to be bumped from `0.0.3` to `0.0.4` to match the version of SmartDashboard. [ committed by @AlyssaCote ] [ reviewed by @MattToast ]
This PR adds to the release.yml github workflow to autogenerate a PR that merge changes from master to develop. [ reviewed by @MattToast ] [ committed by @amandarichardsonn ]
This PR removes :type: and :rtype: driectives from function docstrings as well as implements the sphinx-autodoc-typehints extension. [ reviewed by @AlyssaCote ] [ committed by @amandarichardsonn ]
This PR updates the authetication used in the release workflow from a developer created token to the GH_TOKEN environment variable. [ reviewed by @MattToast ] [ committed by @amandarichardsonn ]
This PR adds a `release.yml` file to the root of the `.github` folder. Within the file we configure the release notes generated through PR tags. This PR also converts the changelog format from rst to md to match release notes format. [ reviewed by @AlyssaCote ] [ committed by @amandarichardsonn ]
This PR adds the ``Experiment.preview`` method to display the entity summaries during runtime to offer additional insight into the launch details. The method surfaces entity information such as name, path, run settings, and client configuration of any instance of a ``Model``, ``Ensemble``, or ``Orchestrator`` prior to the start on an Experiment. [ committed by @juliaputko ] [ reviewed by @ankona @mellis13 @AlyssaCote @amandarichardsonn ]
Tensorflow requires that typing_extensions<=4.6.0, however this cuases the sphinx build process to fail due to an error importing sphinx_tabs. This is potentially a misleading error because sphinx_tabs itself does not use typing extensions, but the problem nevertheless exists even when testing other versions of Sphinx. To allow the deploy_dev_docs action to complete, this modifies the Dockerfile used to build the docs to ensure that typing_extensions==4.6.1 (which is the lowest version that does not throw pip resolution errors) is present in the python environment prior to the build. [ committed by @ashao ] [ reviewed by @amandarichardsonn ]
This PR adds a Dragon-based launcher to SmartSim. The new launcher can be selected by specifying `launcher="dragon"`, e.g. in the `Experiment` constructor. The Dragon launcher can be used on systems where the PBS or Slurm schedulers are available (MPI-based applications currently require Cray PMI or Cray PALS). A new `--dragon` option was added to `smart build` to install the correct Dragon package in the Python environment. More information about the Dragon launcher is available in the dedicated documentation section. Please note that the Dragon launcher is at an early development stage, and we would love to hear feedback about it. [ committed by @al-rigazzi @ankona @MattToast @amandarichardsonn ] [ reviewed by @mellis13 @ankona @MattToast @amandarichardsonn ] --------- Co-authored-by: Matt Drozt <[email protected]> Co-authored-by: Chris McBride <[email protected]>
Tests which needed to launch an Orchestrator were spinning up and shutting down their own instances. This led to a number of cases where a single test failing would cascade into failures of other tests. Additionally, this also meant that a significant amount of time in the tests was spent waiting for Orchestrators to launch. This PR adds a session-scoped fixture that returns an Orchestrator. Most tests which use an Orchestrator have been updated to use this fixture; the remaining for various reasons still need to spin up their own (for example the multiple database tests need to have a named Orchestrator). [ committed by @ashao ] [ reviewed by @ankona @AlyssaCote ] Co-authored-by: Matt Drozt <[email protected]>
The Dragon server could fail, dumping a core file, if it was shut down before all spawned Process Groups completed. This PR fixes such behavior: the immediate flag on the `DragonShutdownRequest` now requests every non-terminated job to be stopped. [ committed by @al-rigazzi ] [ reviewed by @ashao ]
Update version to 0.7.0 and SmartRedis's version to 0.5.3 [ committed by @al-rigazzi ] [ reviewed by @amandarichardsonn ]
juliaputko
approved these changes
May 14, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
AlyssaCote
approved these changes
May 14, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Merge develop to master for release of v0.7.0