Improve testing robustness on SLURM machines #381

ashao · 2023-10-03T15:22:17Z

Some aspects of the testing were failing on SLURM machines due to (1) inconsistent assumptions about how the tasks should be run and (2 suspected hidden race conditions potentially related to the filesystem that caused non-idempotent behavior when running the tests. These defects were ameliorated by ensuring that the failing tests used only a single task and creating a separate run directory for every variant of the test.

MattToast

Looks great! A couple of knit-picky changes requested, but otherwise looks about ready to go!!

conftest.py

tests/on_wlm/test_colocated_model.py

conftest.py

MattToast

LGTM, pending tests!!

…x_tests

codecov · 2023-10-16T22:05:28Z

Codecov Report

Merging #381 (0663f5d) into develop (d8fba1b) will decrease coverage by 0.10%.
The diff coverage is 98.36%.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #381      +/-   ##
===========================================
- Coverage    90.38%   90.29%   -0.10%     
===========================================
  Files           60       60              
  Lines         3839     3864      +25     
===========================================
+ Hits          3470     3489      +19     
- Misses         369      375       +6

Files	Coverage Δ
smartsim/_core/config/config.py	`98.75% <100.00%> (+0.03%)`	⬆️
smartsim/_core/control/controller.py	`87.35% <100.00%> (+0.47%)`	⬆️
smartsim/_core/control/jobmanager.py	`94.19% <100.00%> (+0.23%)`	⬆️
smartsim/_core/utils/helpers.py	`91.96% <100.00%> (-0.21%)`	⬇️
smartsim/_core/utils/redis.py	`82.97% <ø> (-4.53%)`	⬇️
smartsim/entity/dbnode.py	`92.10% <100.00%> (+0.14%)`	⬆️
smartsim/entity/model.py	`95.67% <100.00%> (+0.08%)`	⬆️
smartsim/experiment.py	`81.01% <ø> (-1.27%)`	⬇️
smartsim/settings/base.py	`98.31% <ø> (ø)`
smartsim/status.py	`100.00% <100.00%> (ø)`
... and 1 more

... and 1 file with indirect coverage changes

…into fix_tests

…_create_standard_twice

…x_tests

…ix_tests

…cked GPU" This reverts commit 762db80.

Much of the need to check the type and value of the the nodes property in QsubBatchSettings is because there are two technically valid, but not quite equivalent ways of setting the number of nodes. Now, we check at various points that the both 'select' and 'nodes' is not set. Additionally, both routes can be used to set the internal _nodes property if it needs to be accessed within Python

MattToast

A couple of small changes to consider in the main code base while we work out the merge conflicts! LMK what you think!!

smartsim/settings/pbsSettings.py

.gitignore

smartsim/database/orchestrator.py

smartsim/settings/pbsSettings.py

tests/backends/test_dataloader.py

smartsim/settings/base.py

Further refactors the way that QsubBatchSettings is used and accessed to streamline the logic and make it fail faster if users attempt to set the number of nodes in multiple different ways

…x_tests

al-rigazzi

I left a couple of comments based on what mypy/pylint spit out as errors

al-rigazzi · 2023-12-08T14:48:09Z

smartsim/settings/pbsSettings.py

        self._ncpus = ncpus
+        self._resources = resources or {}
+
+        resource_nodes = self.resources.get("nodes", None)


I think this might need to be resource_nodes = self._resources.get("nodes", None)

I think the assignment on 74 is actually wrong (or rather it doesn't go through the new setter method)

al-rigazzi · 2023-12-08T14:49:19Z

smartsim/settings/pbsSettings.py

        if num_nodes:
-            self._nodes = int(num_nodes)
+            self.set_resource("nodes", num_nodes)


I think casting num_nodes to str would make the resource dict easier to handle, as it would be of type dict[str, t.Optional[str]] instead of having to add int as another value type.

Hrmm, I would agree if num_nodes was the only int option, but I think there's other fields like ncpus, ngpu, etc. that should be int, which can get converted to a string

smartsim/settings/pbsSettings.py

MattToast

A couple of small fun corner cases for QsubBatchSettings for you to consider while I take a look over some of the more substantive changes in test suite. Otherwise this is looking about ready to go!

smartsim/settings/pbsSettings.py

tests/test_colo_model_local.py

We now check a number of extra cases that the user can follow when trying to specify resources. These include: - Validating prior to assignment, the additio of a resource to the dictionary - Validating the types of keys and their values to be str or int

MattToast

Couple of last minute stylistic changes to consider while I run the test suite, but it looks about ready to go on my end!

smartsim/settings/pbsSettings.py

MattToast

LGTM pending tests!!

@MattToast

Bumps the required number of nodes in the test docs from 3 to 4 as required by the tests in #381 and #426. [ committed by @MattToast ] [ reviewed by @ashao ]

Update tests to pass on_wlm

c48db14

ashao requested a review from MattToast October 3, 2023 15:22

MattToast requested changes Oct 3, 2023

View reviewed changes

conftest.py Outdated Show resolved Hide resolved

conftest.py Outdated Show resolved Hide resolved

tests/on_wlm/test_colocated_model.py Outdated Show resolved Hide resolved

conftest.py Outdated Show resolved Hide resolved

MattToast self-requested a review October 4, 2023 20:54

MattToast previously approved these changes Oct 4, 2023

View reviewed changes

al-rigazzi and others added 5 commits October 12, 2023 19:14

Define make_test_dir and get_test_dir fixtures

7e02516

More permissive naming for caller_function

2f9882b

Style

978e177

Update tests to pass on_wlm

0e25a4a

Respond to review feedback

c6930c0

ashao force-pushed the fix_tests branch from 4086893 to c6930c0 Compare October 16, 2023 21:31

ashao added 2 commits October 16, 2023 17:01

Modify for mpirun with PBS

86e51f8

Merge branch 'fix_tests' of https://github.com/ashao/SmartSim into fi…

cc8c986

…x_tests

al-rigazzi and others added 16 commits October 16, 2023 18:06

Fix db shutdown and some fixtures

89c2008

Update DBModel tests

765c571

Merge branch 'develop' into test-tmp-dir

fde1efe

Begin adding a context manager for orchestrators in multidb cases

ea6bac5

Merge branch 'test-tmp-dir' of https://github.com/al-rigazzi/SmartSim …

662e4c5

…into fix_tests

More multidb tests wokring, stopping at test_multidb.py::test_multidb…

b46ccb9

…_create_standard_twice

Fix start_in_context

30d3fe4

Fix fixture usage

131310f

Fix get_status

dc301bb

Fix mypy issues

380c8ef

Fix a couple of tests

62b79b0

Merge branch 'fix_tests' of https://github.com/ashao/SmartSim into fi…

34ee0c3

…x_tests

tests are passing on PBS

ee39204

Fix one last typo

9f8a623

Make reset_hosts work on LSF

3769c90

Comply to mypy syntax for union

c959e17

al-rigazzi and others added 2 commits November 30, 2023 17:45

Merge branch 'develop' of https://github.com/CrayLabs/SmartSim into f…

c68b2af

…ix_tests

Spawn in TF saving/serializing in a new process to avoid a locked GPU

762db80

ashao force-pushed the fix_tests branch from 7743d47 to 762db80 Compare December 1, 2023 20:49

ashao added 2 commits December 1, 2023 14:49

Revert "Spawn in TF saving/serializing in a new process to avoid a lo…

31520a9

…cked GPU" This reverts commit 762db80.

ashao force-pushed the fix_tests branch from 973af2c to b703bc9 Compare December 7, 2023 01:37

MattToast requested changes Dec 7, 2023

View reviewed changes

MattToast reviewed Dec 7, 2023

View reviewed changes

smartsim/settings/base.py Outdated Show resolved Hide resolved

ashao and others added 5 commits December 7, 2023 19:02

Delete extraneous scripts

48defa2

Refactor QsubBatchSettings resources

b929ba8

Further refactors the way that QsubBatchSettings is used and accessed to streamline the logic and make it fail faster if users attempt to set the number of nodes in multiple different ways

Merge branch 'develop' into fix_tests

a0d8328

Merge branch 'fix_tests' of https://github.com/ashao/SmartSim into fi…

8e0e82f

…x_tests

Merge branch 'develop' into fix_tests

0979ced

al-rigazzi reviewed Dec 8, 2023

View reviewed changes

MattToast requested changes Dec 8, 2023

View reviewed changes

al-rigazzi and others added 2 commits December 8, 2023 12:21

Delete misleading comment

2e72604

ashao force-pushed the fix_tests branch from 8ef1375 to 14c420d Compare December 9, 2023 00:09

ashao added 4 commits December 8, 2023 18:15

Fix one use of | insteado t.Union

bd4345e

Fix an incorrect typehint

f2123fa

Yet another | instead of t.Union

5ff7007

Remove extraneous assignment and blackify

f485ad1

MattToast requested changes Dec 9, 2023

View reviewed changes

Remove now invalid test and update type checking

2824d69

MattToast approved these changes Dec 9, 2023

View reviewed changes

ashao added 2 commits December 11, 2023 11:57

Fix accidental collision with default value

2358c48

Update behaviour for test_create_pbs_batch

0663f5d

ashao merged commit 1b92adf into CrayLabs:develop Dec 11, 2023

MattToast mentioned this pull request Dec 15, 2023

Test Docs: Update Number of Required Nodes #442

Merged

MattToast added a commit that referenced this pull request Dec 15, 2023

Test Docs: Update Number of Required Nodes (#442)

91224d7

Bumps the required number of nodes in the test docs from 3 to 4 as required by the tests in #381 and #426. [ committed by @MattToast ] [ reviewed by @ashao ]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve testing robustness on SLURM machines #381

Improve testing robustness on SLURM machines #381

ashao commented Oct 3, 2023

MattToast left a comment

MattToast left a comment

codecov bot commented Oct 16, 2023 •

edited

Loading

MattToast left a comment

al-rigazzi left a comment

al-rigazzi Dec 8, 2023

ashao Dec 8, 2023

al-rigazzi Dec 8, 2023

ashao Dec 8, 2023

MattToast left a comment

MattToast left a comment

MattToast left a comment

Improve testing robustness on SLURM machines #381

Improve testing robustness on SLURM machines #381

Conversation

ashao commented Oct 3, 2023

MattToast left a comment

Choose a reason for hiding this comment

MattToast left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 16, 2023 • edited Loading

Codecov Report

MattToast left a comment

Choose a reason for hiding this comment

al-rigazzi left a comment

Choose a reason for hiding this comment

al-rigazzi Dec 8, 2023

Choose a reason for hiding this comment

ashao Dec 8, 2023

Choose a reason for hiding this comment

al-rigazzi Dec 8, 2023

Choose a reason for hiding this comment

ashao Dec 8, 2023

Choose a reason for hiding this comment

MattToast left a comment

Choose a reason for hiding this comment

MattToast left a comment

Choose a reason for hiding this comment

MattToast left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 16, 2023 •

edited

Loading