Skip to content

Commit

Permalink
Application executes before colocated Orchestrator is created (#522)
Browse files Browse the repository at this point in the history
Colo Orchestrator launch moved to a blocking process. Application executes
once Orchestrator is built.

[ reviewed by @MattToast @mellis13 @ashao @al-rigazzi ]
[ committed by @amandarichardsonn ]
  • Loading branch information
amandarichardsonn authored Mar 21, 2024
1 parent d1dfac8 commit 06d6166
Show file tree
Hide file tree
Showing 3 changed files with 29 additions and 12 deletions.
7 changes: 7 additions & 0 deletions doc/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ To be released at some future point in time

Description

- Colo Orchestrator setup now blocks application start until setup finished.
- ExecArgs handling correction
- ReadTheDocs config file added and enabled on PRs
- Enforce changelog updates
Expand All @@ -30,6 +31,11 @@ Description

Detailed Notes

- The request to the colocated entrypoints file within the shell script
is now a blocking process. Once the Orchestrator is setup, it returns
which moves the process to the background and allows the application to
start. This prevents the application from requesting a ML model or
script that has not been uploaded to the Orchestrator yet. (SmartSim-PR522_)
- Add checks and tests to ensure SmartSim users cannot initialize run settings
with a list of lists as the exe_args argument. (SmartSim-PR517_)
- Add readthedocs configuration file and enable readthedocs builds
Expand All @@ -55,6 +61,7 @@ Detailed Notes
Slurm and Open MPI. (SmartSim-PR520_)


.. _SmartSim-PR522: https://github.com/CrayLabs/SmartSim/pull/522
.. _SmartSim-PR524: https://github.com/CrayLabs/SmartSim/pull/524
.. _SmartSim-PR520: https://github.com/CrayLabs/SmartSim/pull/520
.. _SmartSim-PR518: https://github.com/CrayLabs/SmartSim/pull/518
Expand Down
19 changes: 13 additions & 6 deletions smartsim/_core/entrypoints/colocated.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
import tempfile
import typing as t
from pathlib import Path
from subprocess import PIPE, STDOUT
from subprocess import STDOUT
from types import FrameType

import filelock
Expand Down Expand Up @@ -177,6 +177,7 @@ def main(
db_scripts: t.List[t.List[str]],
db_identifier: str,
) -> None:
# pylint: disable=too-many-statements
global DBPID # pylint: disable=global-statement

lo_address = current_ip("lo")
Expand All @@ -201,8 +202,17 @@ def main(
# we generally want to catch all exceptions here as
# if this process dies, the application will most likely fail
try:
process = psutil.Popen(cmd, stdout=PIPE, stderr=STDOUT)
DBPID = process.pid
hostname = socket.gethostname()
filename = (
f"colo_orc_{hostname}.log"
if os.getenv("SMARTSIM_LOG_LEVEL") == "debug"
else os.devnull
)
with open(filename, "w", encoding="utf-8") as file:
process = psutil.Popen(cmd, stdout=file.fileno(), stderr=STDOUT)
DBPID = process.pid
# printing to stdout shell file for extraction
print(f"__PID__{DBPID}__PID__", flush=True)

except Exception as e:
cleanup()
Expand Down Expand Up @@ -249,9 +259,6 @@ def launch_db_scripts(client: Client, db_scripts: t.List[t.List[str]]) -> None:
# Make sure we don't keep this around
del client

for line in iter(process.stdout.readline, b""):
print(line.decode("utf-8").rstrip(), flush=True)

except Exception as e:
cleanup()
logger.error(f"Colocated database process failed: {str(e)}")
Expand Down
15 changes: 9 additions & 6 deletions smartsim/_core/launcher/colocated.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,9 +67,14 @@ def write_colocated_launch_script(
# STDOUT of the job
if colocated_settings["debug"]:
script_file.write("export SMARTSIM_LOG_LEVEL=debug\n")

script_file.write(f"{colocated_cmd}\n")
script_file.write("DBPID=$!\n\n")
script_file.write(f"db_stdout=$({colocated_cmd})\n")
# extract and set DBPID within the shell script that is
# enclosed between __PID__ and sent to stdout by the colocated
# entrypoints file
script_file.write(
"DBPID=$(echo $db_stdout | sed -n "
"'s/.*__PID__\\([0-9]*\\)__PID__.*/\\1/p')\n"
)

# Write the actual launch command for the app
script_file.write("$@\n\n")
Expand Down Expand Up @@ -190,10 +195,8 @@ def _build_colocated_wrapper_cmd(
db_script_cmd = _build_db_script_cmd(db_scripts)
db_cmd.extend(db_script_cmd)

# run colocated db in the background
db_cmd.append("&")

cmd.extend(db_cmd)

return " ".join(cmd)


Expand Down

0 comments on commit 06d6166

Please sign in to comment.