Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.7.0 #584

Merged
merged 44 commits into from
May 15, 2024
Merged

v0.7.0 #584

merged 44 commits into from
May 15, 2024

Conversation

al-rigazzi
Copy link
Collaborator

Merge develop to master for release of v0.7.0

al-rigazzi and others added 30 commits February 16, 2024 19:37
This PR brings develop up to date with master after releasing v0.6.2

[ committed by @ashao ]
[ reviewed by @al-rigazzi ]
This PR bumps up the Python version used when building the tutorial
containers (both production and development) to avoid module incompatibility.
A minor bug in the tutorial container version number is also addressed
as part of this PR.

[ committed by @al-rigazzi ]
[ reviewed by @MattToast ]
This PR prevents the launch of duplicate named entities. Completed entities are allowed to rerun.

[ committed by @amandarichardsonn ]
[ reviewed by @ankona @MattToast ]
Update the start(), stop(), generate(), and get_status() Experiment
API functions from ``t.any``  to 
``t.Union[SmartSimEntity, EntitySequence[SmartSimEntity]]``.

[ committed by @mellis13 ]
[ reviewed by @ankona @MattToast ]
The deploy_dev_docs action was failing due to running out of disk space
in the container. This was alleviated by running the 
`maximize-build-space` Github action.

[ committed by @ashao ]
[ reviewed by @ankona ]
A fix in the build scripts of Redis 7.2.4 modifies build behavior on
MacOS on Apple Silicon. The change fixes an issue where incorrect
compiler flags are defined and result in build failures due to the
`redis_fstat` macro
Promote SmartSim statuses to a dedicated type named SmartSimStatus.

[ reviewed by @MattToast @al-rigazzi ]
[ committed by @amandarichardsonn ]
This PR provides a full refactor to the SmartSim documentation.
This branch merges in sections: Experiment, Orchestrator,
Model, Ensemble, RunSettings, BatchSettings, and SmartSim Logging.

[ reviewed by @ashao @ankona @al-rigazzi @mellis13 @juliaputko ]
[ committed by @amandarichardsonn ]
Removed deprecated SmartSim modules (slurm and mpirunSettings).

[ reviewed by @MattToast @al-rigazzi ]
[ committed by @amandarichardsonn ]
Jupyter notebook math expressions were not rendering locally or in
docker container - update made to conf.py to fix.

[ reviewed by @mellis13 @al-rigazzi ]
[ committed by @amandarichardsonn ]
Implemented new check for edit to changelog.rst using [Changelog
Enforcer](https://github.com/marketplace/actions/changelog-enforcer).

[ reviewed by @mellis13 ]
[ committed by @amandarichardsonn ]
Adding readthedocs config file and robots.txt generation.

[ reviewed by @ashao @mellis13 ]
[ committed by @amandarichardsonn ]
Added `isinstance` check to RunSettings exe_args setter. Added
additional tests.

[ reviewed by @mellis13 ]
[ committed by @amandarichardsonn ]
Removes behavior deprecated in #480 from test suite.

[ committed by @MattToast ]
[ reviewed by @mellis13 ]
Configure mypy to raise an error when:
- An instance of an object is used for a boolean check when neither
`__bool__` or `__len__` are implemented
- A `Iterator` is used on a boolean check when the author almost
certainly wanted a `Collection`

Fix up/refactors areas of the code base where these potential errors
linger.

[ committed by @MattToast ]
[ reviewed by @ankona @al-rigazzi ]
Colo Orchestrator launch moved to a blocking process. Application executes
once Orchestrator is built.

[ reviewed by @MattToast @mellis13 @ashao @al-rigazzi ]
[ committed by @amandarichardsonn ]
This PR adds the method `set_node_feature` to srunSettings that accepts
a str or list of strs. Users may now specify node constraints for slurm
jobs.

[ reviewed by @al-rigazzi ]
[ committed by @amandarichardsonn ]
## New features

- Creates and integrates metric collection into the telemetry monitor
using the new `CollectorManager`. Metrics are written using the new
`FileSink` class

## Included collectors:

- `DbMemoryCollector`
- `DbConnectionCollector`

## Updated features

- Switch basic experiment tracking telemetry to default to on. 
- Improve telemetry monitor logging. 
- Create telemetry subpackage at `smartsim._core.utils.telemetry`. 
- Refactor telemetry monitor entrypoint.

[committed by @ankona ]
[reviewed by @MattToast @ashao @amandarichardsonn @mellis13 ]
Removing instances of ["CPU","GPU"] with a `Device` Enum.

[ reviewed by @MattToast ]
[ committed by @amandarichardsonn ]
Configure mypy to error when a potentially uninitialized variable is
used. Fix lingering errors found by mypy.

[ committed by @MattToast ]
[ reviewed by @al-rigazzi @ankona ]
Readthedocs fails and is blocking existing PRs. Failure is: `Extension
error: Could not import extension sphinx_tabs.tabs (exception: cannot
import name 'TypeAliasType' from 'typing_extensions')`. The issue came
from pydantic==2.6.4 and typing_extensions==4.5.0. Sphinx uses Open AI
which requires "pydantic>=1.9.0, <3", "typing-extensions>=4.7, <5". The
versions have been changed in the readthedocs `yaml` file.

[ reviewed by @al-rigazzi ]
[ committed by @amandarichardsonn ]
On systems that have the Intel Compilers and/or the Intel Math Kernel
library installed, the Caffe2 package that comes with Torch will
unconditionally try to link in the MKL during the Torch backend. This
however can lead to two types of failures:
- Problems when compiling the Torch backend because the linker does not
include the path to the MKL library path
- Loading the Torch backend into RedisAI fails because the user does not
expect to need to have the MKL library loaded.

To alleviate this, a new option "--no_torch_with_mkl" has been added to
the `smart build` command that modifies the mkl.cmake file to prevent
the detection of MKL.

[ committed by @ashao ]
[ reviewed by @MattToast and @al-rigazzi  ]
Fixes unfalsifiable test that tests SmartSim's custom SIGINT signal
handler. Adds infrastructure to make the test pass again.

[ committed by @MattToast ]
[ reviewed by @ashao ]
Moves .out and .err files under the `.smartsim` directory and creates a symlink to those files under the experiment directory.
[ committed by @AlyssaCote ]
[ reviewed by @ashao , @al-rigazzi ]
Updated watchdog dependency pin to next major version and removed `type:
ignore` where possible due to new type hints added to watchdog


[ committed by @ankona ]
[ reviewed by @AlyssaCote ]
Python 3.8 is nearing its end of life so we're no longer supporting it. 
[ committed by @AlyssaCote ]
[ reviewed by @MattToast @mellis13 ]
This PR makes changes to the default path for SS entities. New default
path is `exp_path/entity_name/`. A path argument has also been added to
create_ensemble and create_model.

[ reviewed by @ashao @mellis13 ]
[ committed by @amandarichardsonn ]
Ensures that a managed step-mapping doesn't include a `task_id`.

The telemetry monitor exhibits a defect where it logs errors from a task
manager, even with only managed tasks being monitored for updates:


![image](https://github.com/CrayLabs/SmartSim/assets/3595025/84921e5b-144b-4fcd-8289-48d2504deaac)

This fix modifies the telemetry monitor to not set a `task_id` when
adding items to the `step_mapping` collection. This avoids triggering
lookups for unmanaged processes.


![image](https://github.com/CrayLabs/SmartSim/assets/3595025/dfc7cfad-a875-45b3-91d2-fa19407c9d0c)

[ committed by @ankona ]
[ approved by @MattToast ]
This PR removes the helper function `init_default` and instead
implements traditional type narrowing.

[ reviewed by @MattToast ]
[ committed by @amandarichardsonn ]
Bump ubuntu to version 22.04

[ committed by @AlyssaCote ]
[ reviewed by @ashao ]
AlyssaCote and others added 14 commits April 22, 2024 07:56
In this PR I removed the defensive regexp in `.gitignore` and added
`test_dir` to the tests that were writing to the `cwd` instead of the
`test_output` directory.

[ committed by @AlyssaCote ]
[ reviewed by @ankona ]
Fixes:
- `tests/backends/test_onnx.py::test_sklearn_onnx`
  - Correctly set number of tasks when not using the local launcher
  - Makes sure the DB is not left running on test failure
-
`tests/full_wlm/test_generic_orc_launch_batch.py::test_launch_cluster_orc_reconnect`
- Look for pickle file under the orchestrator path rather than the test
dir
  - Makes sure that DB is cleaned up correctly on test failure

Quiets:
- `tests/on_wlm/test_het_job.py`
  - All experiments under the test module are given an explict test path

[ committed by @MattToast ]
[ reviewed by @ashao ]
After testing a bunch of batch ensembles and batch models, I found that
I hadn't actually symlinked the substeps in the controller. This fix
should properly symlink the substeps.

[ committed by @AlyssaCote ]
[ reviewed by @ankona ]
The `manifest.json` version needs to be bumped from `0.0.3` to `0.0.4`
to match the version of SmartDashboard.

[ committed by @AlyssaCote ]
[ reviewed by @MattToast ]
This PR adds to the release.yml github workflow to autogenerate a PR
that merge changes from master to develop.

[ reviewed by @MattToast ]
[ committed by @amandarichardsonn ]
This PR removes :type: and :rtype: driectives from function docstrings
as well as implements the sphinx-autodoc-typehints extension.

[ reviewed by @AlyssaCote ]
[ committed by @amandarichardsonn ]
This PR updates the authetication used in the release workflow from a
developer created token to the GH_TOKEN environment variable.

[ reviewed by @MattToast ]
[ committed by @amandarichardsonn ]
This PR adds a `release.yml` file to the root of the `.github` folder.
Within the file we configure the release notes generated through PR
tags. This PR also converts the changelog format from rst to md to match
release notes format.

[ reviewed by @AlyssaCote ]
[ committed by @amandarichardsonn ]
This PR adds the ``Experiment.preview`` method to display the entity summaries 
during runtime to offer additional insight into the launch details.

The method surfaces entity information such as name, path, run settings, and client 
configuration of any instance of a ``Model``, ``Ensemble``, or ``Orchestrator`` prior 
to the start on an Experiment. 

 [ committed by @juliaputko ]
 [ reviewed by @ankona @mellis13 @AlyssaCote @amandarichardsonn ]
Tensorflow requires that typing_extensions<=4.6.0, however this cuases
the sphinx build process to fail due to an error importing sphinx_tabs.
This is potentially a misleading error because sphinx_tabs itself does
not use typing extensions, but the problem nevertheless exists even when
testing other versions of Sphinx. To allow the deploy_dev_docs action to
complete, this modifies the Dockerfile used to build the docs to ensure
that typing_extensions==4.6.1 (which is the lowest version that does not
throw pip resolution errors) is present in the python environment prior
to the build.

[ committed by @ashao ]
[ reviewed by @amandarichardsonn ]
This PR adds a Dragon-based launcher to SmartSim. The new launcher can be selected by specifying `launcher="dragon"`, e.g. in the `Experiment` constructor. The Dragon launcher can be used on systems where the PBS or Slurm schedulers are available (MPI-based applications currently require Cray PMI or Cray PALS). 
A new `--dragon` option was added to `smart build` to install the correct Dragon package in the Python environment. More information about the Dragon launcher is available in the dedicated documentation section. Please note that the Dragon launcher is at an early development stage, and we would love to hear feedback about it.

[ committed by @al-rigazzi @ankona @MattToast @amandarichardsonn ]
[ reviewed by @mellis13 @ankona @MattToast @amandarichardsonn ]

---------

Co-authored-by: Matt Drozt <[email protected]>
Co-authored-by: Chris McBride <[email protected]>
Tests which needed to launch an Orchestrator were spinning up and
shutting down their own instances. This led to a number of cases where a
single test failing would cascade into failures of other tests.
Additionally, this also meant that a significant amount of time in the
tests was spent waiting for Orchestrators to launch.

This PR adds a session-scoped fixture that returns an Orchestrator. Most
tests which use an Orchestrator have been updated to use this fixture;
the remaining for various reasons still need to spin up their own (for
example the multiple database tests need to have a named Orchestrator).

[ committed by @ashao ]
[ reviewed by @ankona @AlyssaCote ]

Co-authored-by: Matt Drozt <[email protected]>
The Dragon server could fail, dumping a core file, if it was shut down
before all spawned Process Groups completed. This PR fixes such
behavior: the immediate flag on the `DragonShutdownRequest` now requests
every non-terminated job to be stopped.

[ committed by @al-rigazzi ]
[ reviewed by @ashao ]
Update version to 0.7.0 and SmartRedis's version to 0.5.3

[ committed by @al-rigazzi ]
[ reviewed by @amandarichardsonn ]
Copy link
Contributor

@juliaputko juliaputko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@AlyssaCote AlyssaCote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@al-rigazzi al-rigazzi merged commit 5039699 into master May 15, 2024
70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants