Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Improve L0_logging stability #7486

Merged
merged 3 commits into from
Jul 31, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 19 additions & 5 deletions qa/L0_logging/log_format_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -405,6 +405,9 @@ def _launch_server(self, escaped=None):
if not os.path.exists(self._server_options["log-file"]):
raise Exception("Log not found")

# Give server a little time to have the endpoints up and ready
time.sleep(10)

def _validate_log_record(self, record, format_regex, escaped):
match = format_regex.search(record)
assert match, "Invalid log line"
Expand Down Expand Up @@ -483,18 +486,29 @@ def test_injection(self, log_format, format_regex, injected_record):
# TODO Refactor server launch, shutdown into reusable class
wait_time = 10

while wait_time and not triton_client.is_server_ready():
while wait_time:
try:
if triton_client.is_server_ready():
break
# Gracefully handle connection error if server endpoint isn't up yet
except Exception as e:
print(
f"Client failed to connect, retries remaining: {wait_time}. Error: {e}"
)

time.sleep(1)
wait_time -= 1
print(f"Server not ready yet, retries remaining: {wait_time}")

while wait_time and not triton_client.is_model_ready("simple"):
time.sleep(1)
wait_time -= 1

if not triton_client.is_server_ready() or not triton_client.is_model_ready(
"simple"
):
raise Exception("Model or Server not Ready")
if not triton_client.is_server_ready():
raise Exception("Server not Ready")

if not triton_client.is_model_ready("simple"):
raise Exception("Model not Ready")

except Exception as e:
self._shutdown_server()
Expand Down
7 changes: 6 additions & 1 deletion qa/L0_logging/test.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/bin/bash
# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -60,6 +60,11 @@ source ../common/util.sh
rm -f *.log
rm -fr $MODELSDIR && mkdir -p $MODELSDIR

if [ ! -d ${DATADIR} ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check added just to let us know if the model repository in the dlcluster is not setup correctly? or if there's other cases that we want to catch?

Copy link
Contributor Author

@rmccorm4 rmccorm4 Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the test locally with a non-existing datadir and the test ran and failed much later with obscure errors, but was actually missing one of the models -- so just to catch low hanging fruit later on for future debuggers

echo -e "\n***\n*** ${DATADIR} does not exist!\n***"
exit 1
fi

# set up simple repository MODELBASE
rm -fr $MODELSDIR && mkdir -p $MODELSDIR && \
cp -r $DATADIR/$MODELBASE $MODELSDIR/simple && \
Expand Down
Loading