Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve testing robustness on SLURM machines #381

Merged
merged 73 commits into from
Dec 11, 2023
Merged

Conversation

ashao
Copy link
Collaborator

@ashao ashao commented Oct 3, 2023

Some aspects of the testing were failing on SLURM machines due to (1) inconsistent assumptions about how the tasks should be run and (2 suspected hidden race conditions potentially related to the filesystem that caused non-idempotent behavior when running the tests. These defects were ameliorated by ensuring that the failing tests used only a single task and creating a separate run directory for every variant of the test.

@ashao ashao requested a review from MattToast October 3, 2023 15:22
Copy link
Member

@MattToast MattToast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! A couple of knit-picky changes requested, but otherwise looks about ready to go!!

@MattToast MattToast self-requested a review October 4, 2023 20:54
MattToast
MattToast previously approved these changes Oct 4, 2023
Copy link
Member

@MattToast MattToast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending tests!!

@codecov
Copy link

codecov bot commented Oct 16, 2023

Codecov Report

Merging #381 (0663f5d) into develop (d8fba1b) will decrease coverage by 0.10%.
The diff coverage is 98.36%.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #381      +/-   ##
===========================================
- Coverage    90.38%   90.29%   -0.10%     
===========================================
  Files           60       60              
  Lines         3839     3864      +25     
===========================================
+ Hits          3470     3489      +19     
- Misses         369      375       +6     
Files Coverage Δ
smartsim/_core/config/config.py 98.75% <100.00%> (+0.03%) ⬆️
smartsim/_core/control/controller.py 87.35% <100.00%> (+0.47%) ⬆️
smartsim/_core/control/jobmanager.py 94.19% <100.00%> (+0.23%) ⬆️
smartsim/_core/utils/helpers.py 91.96% <100.00%> (-0.21%) ⬇️
smartsim/_core/utils/redis.py 82.97% <ø> (-4.53%) ⬇️
smartsim/entity/dbnode.py 92.10% <100.00%> (+0.14%) ⬆️
smartsim/entity/model.py 95.67% <100.00%> (+0.08%) ⬆️
smartsim/experiment.py 81.01% <ø> (-1.27%) ⬇️
smartsim/settings/base.py 98.31% <ø> (ø)
smartsim/status.py 100.00% <100.00%> (ø)
... and 1 more

... and 1 file with indirect coverage changes

ashao added 2 commits December 1, 2023 14:49
Much of the need to check the type and value of the the nodes
property in QsubBatchSettings is because there are two technically
valid, but not quite equivalent ways of setting the number of
nodes. Now, we check at various points that the both 'select' and
'nodes' is not set. Additionally, both routes can be used to set
the internal _nodes property if it needs to be accessed within
Python
Copy link
Member

@MattToast MattToast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of small changes to consider in the main code base while we work out the merge conflicts! LMK what you think!!

ashao and others added 5 commits December 7, 2023 19:02
Further refactors the way that QsubBatchSettings is used and
accessed to streamline the logic and make it fail faster if users
attempt to set the number of nodes in multiple different ways
Copy link
Collaborator

@al-rigazzi al-rigazzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of comments based on what mypy/pylint spit out as errors

self._ncpus = ncpus
self._resources = resources or {}

resource_nodes = self.resources.get("nodes", None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might need to be resource_nodes = self._resources.get("nodes", None)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the assignment on 74 is actually wrong (or rather it doesn't go through the new setter method)

if num_nodes:
self._nodes = int(num_nodes)
self.set_resource("nodes", num_nodes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think casting num_nodes to str would make the resource dict easier to handle, as it would be of type dict[str, t.Optional[str]] instead of having to add int as another value type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm, I would agree if num_nodes was the only int option, but I think there's other fields like ncpus, ngpu, etc. that should be int, which can get converted to a string

Copy link
Member

@MattToast MattToast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of small fun corner cases for QsubBatchSettings for you to consider while I take a look over some of the more substantive changes in test suite. Otherwise this is looking about ready to go!

al-rigazzi and others added 2 commits December 8, 2023 12:21
We now check a number of extra cases that the user can follow
when trying to specify resources. These include:
- Validating prior to assignment, the additio of a resource to the
  dictionary
- Validating the types of keys and their values to be str or int
Copy link
Member

@MattToast MattToast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of last minute stylistic changes to consider while I run the test suite, but it looks about ready to go on my end!

Copy link
Member

@MattToast MattToast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending tests!!

@ashao ashao merged commit 1b92adf into CrayLabs:develop Dec 11, 2023
MattToast added a commit that referenced this pull request Dec 15, 2023
Bumps the required number of nodes in the test docs from 3 to 4 as required by the tests in #381 and #426.

[ committed by @MattToast ]
[ reviewed by @ashao ]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API break Issues that include incompatible API changes area: CI/CD Issues related to continuous integration and deployment area: orchestrator Issues related to the Ochestrator API, launch, and runtime area: test Issues related to the test suite type: refactor Issues focused on refactoring existing code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants