-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve testing robustness on SLURM machines #381
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! A couple of knit-picky changes requested, but otherwise looks about ready to go!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pending tests!!
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## develop #381 +/- ##
===========================================
- Coverage 90.38% 90.29% -0.10%
===========================================
Files 60 60
Lines 3839 3864 +25
===========================================
+ Hits 3470 3489 +19
- Misses 369 375 +6
|
…_create_standard_twice
…cked GPU" This reverts commit 762db80.
Much of the need to check the type and value of the the nodes property in QsubBatchSettings is because there are two technically valid, but not quite equivalent ways of setting the number of nodes. Now, we check at various points that the both 'select' and 'nodes' is not set. Additionally, both routes can be used to set the internal _nodes property if it needs to be accessed within Python
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of small changes to consider in the main code base while we work out the merge conflicts! LMK what you think!!
Further refactors the way that QsubBatchSettings is used and accessed to streamline the logic and make it fail faster if users attempt to set the number of nodes in multiple different ways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple of comments based on what mypy
/pylint
spit out as errors
self._ncpus = ncpus | ||
self._resources = resources or {} | ||
|
||
resource_nodes = self.resources.get("nodes", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might need to be resource_nodes = self._resources.get("nodes", None)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the assignment on 74 is actually wrong (or rather it doesn't go through the new setter method)
if num_nodes: | ||
self._nodes = int(num_nodes) | ||
self.set_resource("nodes", num_nodes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think casting num_nodes
to str
would make the resource dict easier to handle, as it would be of type dict[str, t.Optional[str]]
instead of having to add int
as another value type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm, I would agree if num_nodes
was the only int
option, but I think there's other fields like ncpus
, ngpu
, etc. that should be int, which can get converted to a string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of small fun corner cases for QsubBatchSettings
for you to consider while I take a look over some of the more substantive changes in test suite. Otherwise this is looking about ready to go!
We now check a number of extra cases that the user can follow when trying to specify resources. These include: - Validating prior to assignment, the additio of a resource to the dictionary - Validating the types of keys and their values to be str or int
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of last minute stylistic changes to consider while I run the test suite, but it looks about ready to go on my end!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending tests!!
Bumps the required number of nodes in the test docs from 3 to 4 as required by the tests in #381 and #426. [ committed by @MattToast ] [ reviewed by @ashao ]
Some aspects of the testing were failing on SLURM machines due to (1) inconsistent assumptions about how the tasks should be run and (2 suspected hidden race conditions potentially related to the filesystem that caused non-idempotent behavior when running the tests. These defects were ameliorated by ensuring that the failing tests used only a single task and creating a separate run directory for every variant of the test.