-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use an agent-id rather than the process PID #24968
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -56,58 +56,65 @@ void AgentManager::StartAgent() { | |||||
return; | ||||||
} | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I moved the stanza below to after adding the new arguments so it prints out everything |
||||||
|
||||||
if (RAY_LOG_ENABLED(DEBUG)) { | ||||||
std::stringstream stream; | ||||||
stream << "Starting agent process with command:"; | ||||||
for (const auto &arg : options_.agent_commands) { | ||||||
stream << " " << arg; | ||||||
} | ||||||
RAY_LOG(DEBUG) << stream.str(); | ||||||
} | ||||||
|
||||||
// Launch the process to create the agent. | ||||||
std::error_code ec; | ||||||
// Create a random agent_id to pass to the child process | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
int agent_id = rand(); | ||||||
const std::string agent_id_str = std::to_string(agent_id); | ||||||
std::vector<const char *> argv; | ||||||
for (const std::string &arg : options_.agent_commands) { | ||||||
argv.push_back(arg.c_str()); | ||||||
} | ||||||
argv.push_back("--agent-id"); | ||||||
argv.push_back(agent_id_str.c_str()); | ||||||
|
||||||
// Disable metrics report if needed. | ||||||
if (!RayConfig::instance().enable_metrics_collection()) { | ||||||
argv.push_back("--disable-metrics-collection"); | ||||||
} | ||||||
argv.push_back(NULL); | ||||||
|
||||||
if (RAY_LOG_ENABLED(DEBUG)) { | ||||||
std::stringstream stream; | ||||||
stream << "Starting agent process with command:"; | ||||||
for (const auto &arg : argv) { | ||||||
stream << " " << arg; | ||||||
} | ||||||
RAY_LOG(DEBUG) << stream.str(); | ||||||
} | ||||||
|
||||||
// Set node id to agent. | ||||||
ProcessEnvironment env; | ||||||
env.insert({"RAY_NODE_ID", options_.node_id.Hex()}); | ||||||
env.insert({"RAY_RAYLET_PID", std::to_string(getpid())}); | ||||||
|
||||||
// Launch the process to create the agent. | ||||||
std::error_code ec; | ||||||
Process child(argv.data(), nullptr, ec, false, env); | ||||||
if (!child.IsValid() || ec) { | ||||||
// The worker failed to start. This is a fatal error. | ||||||
RAY_LOG(FATAL) << "Failed to start agent with return value " << ec << ": " | ||||||
<< ec.message(); | ||||||
} | ||||||
|
||||||
std::thread monitor_thread([this, child]() mutable { | ||||||
std::thread monitor_thread([this, child, agent_id]() mutable { | ||||||
SetThreadName("agent.monitor"); | ||||||
RAY_LOG(INFO) << "Monitor agent process with pid " << child.GetId() | ||||||
<< ", register timeout " | ||||||
RAY_LOG(INFO) << "Monitor agent process with id " << agent_id << ", register timeout " | ||||||
<< RayConfig::instance().agent_register_timeout_ms() << "ms."; | ||||||
auto timer = delay_executor_( | ||||||
[this, child]() mutable { | ||||||
if (agent_pid_ != child.GetId()) { | ||||||
RAY_LOG(WARNING) << "Agent process with pid " << child.GetId() | ||||||
<< " has not registered. ip " << agent_ip_address_ | ||||||
<< ", pid " << agent_pid_; | ||||||
[this, child, agent_id]() mutable { | ||||||
if (agent_pid_ != agent_id) { | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of agent_pid_, can we rename it to agent_id_? Since it is not pid anymore. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done, and improved the error message |
||||||
RAY_LOG(WARNING) << "Agent process with id " << agent_id | ||||||
<< " has not registered. ip " << agent_ip_address_ << ", id " | ||||||
<< agent_pid_; | ||||||
child.Kill(); | ||||||
} | ||||||
}, | ||||||
RayConfig::instance().agent_register_timeout_ms()); | ||||||
|
||||||
int exit_code = child.Wait(); | ||||||
timer->cancel(); | ||||||
RAY_LOG(WARNING) << "Agent process with pid " << child.GetId() | ||||||
<< " exit, return value " << exit_code << ". ip " | ||||||
<< agent_ip_address_ << ". pid " << agent_pid_; | ||||||
RAY_LOG(WARNING) << "Agent process with id " << agent_id << " exit, return value " | ||||||
<< exit_code << ". ip " << agent_ip_address_ << ". pid " | ||||||
<< agent_pid_; | ||||||
RAY_LOG(ERROR) | ||||||
<< "The raylet exited immediately because the Ray agent failed. " | ||||||
"The raylet fate shares with the agent. This can happen because the " | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need the default id as pid in this case. Can you remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored to use a required keyword-only
*
signature.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I might be missing something here, but it still seems to receive os.getpid() as default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, check now :)