-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][3/3] Use the new standalone runtime env http server. #37585
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished reviewing cpp parts. I will review the python part tmrw.
Q: I don't quite understand the reasoning behind "we don't need registration anymore". Can you tell me why we had it before and why we don't have it anymore?
/// The agent manager RPC service. | ||
std::unique_ptr<rpc::AgentManagerServiceHandler> agent_manager_service_handler_; | ||
rpc::AgentManagerGrpcService agent_manager_service_; | ||
/// Wrapper client for RuntimeEnvManager. Always non-null. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we make it a none-pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to pass it to WorkerPool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't have to be a pointer to pass by argument?
@property | ||
def runtime_env_agent_address(self): | ||
"""Get the address that exposes runtime env agent as http""" | ||
return f"http://{self._raylet_ip_address}:{self._runtime_env_agent_port}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use node_ip_address or raylet_ip_address?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to use raylet_ip_address because the agent is started by raylet and always co-locates with raylet. However there are some cases I don't have a raylet address so I falled back to node_ip address.
Also, would you mind sharing some cases where node_ip_address != raylet_address?
@@ -1092,7 +1108,7 @@ def start_ray_client_server(self): | |||
stderr_file=stderr_file, | |||
redis_password=self._ray_params.redis_password, | |||
fate_share=self.kernel_fate_share, | |||
metrics_agent_port=self._ray_params.metrics_agent_port, | |||
runtime_env_agent_address=self.runtime_env_agent_address, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to pass in address instead of port as before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's used in proxier.py from client server. Previously it uses 127.0.0.1
Putting the raylet address in makes it clearer who it's talking to. I don't really have an idea of which raylet the client would like to connect to, so I tried to make things explicit. If you like to revert it back to 127.0.0.1 I can do that.
@ckw017 could you review the ray client changes? |
Previously, raylet starts a dashboard_agent. Then dashboard_agent calls "registration" RPC to raylet to tell its own grpc listen port. Then raylet talks to the runtime env agent in dashboard_agent via that port. The registration is needed primarily to share this port (along with the agent_id and address but raylet already knows these). Now, I let |
Ray Client server changes seem fine to me (assuming tests are still passing), but afaik the runtime env parts were added by Ed, so maybe check with him instead. |
4cba2fa
to
bc23447
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One additional questions regarding the port assignment. Is it possible to choose it inside the agent? (or is it required by the client server, so we should assign it before the agent starts?) I think we should follow up with this issue. Agent port issue causes problems for a long time, and introducing a new agent will double the probability of this issue.
Now, I let services.py to pick a free port for runtime env http agent, and passes this as an arg to raylet, so raylet knows the port at start. Raylet starts a runtime env http agent with the port, so the http agent use it. There's no registration needed since raylet knows everything (address + port) even before http agent is started.
I think particularly we should avoid doing this (maybe we should discuss offline for the solution, and we have to handle this before 2.7).
), | ||
os.path.join(RAY_PATH, "_private", "runtime_env", "agent", "main.py"), | ||
f"--node-ip-address={node_ip_address}", | ||
f"--runtime-env-agent-port={runtime_env_agent_port}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the runtime env agent port decided within the runtime env agent, or chosen before it is started? We had long term issue where agent port conflict causes issues because we choose it before agent is started (and it is due to some limitation we haven't fixed). It'd be the best we avoid the same issue from runtime env agent (the probability of port conflict will double.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this PR it's decided before the agent starts, by services.py
. Per offline talk, I will investigate how to assign all ports in services.py
in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Can you address the comments before merging it?
Also, have you run the release test in this PR or a separate one?
Updated PR.
Yes, but I will do another round in the latest commit. |
…ated Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Co-authored-by: angelinalg <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]>
Co-authored-by: angelinalg <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]>
…er.py Co-authored-by: angelinalg <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]>
Co-authored-by: angelinalg <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]>
Co-authored-by: angelinalg <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]>
Co-authored-by: angelinalg <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]>
Co-authored-by: angelinalg <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]>
Co-authored-by: angelinalg <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]>
Co-authored-by: angelinalg <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]>
Co-authored-by: angelinalg <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
7a7cc62
to
bf3b73f
Compare
Attention: External code changedThis PR changes code that is used or cited in external sources, e.g. blog posts. Before merging this PR, please make sure that the code in the external sources is still working, and consider updating them to reflect the changes. The affected files and the external sources are:
|
bf3b73f
to
39059a4
Compare
list of failed unit tests now:
looks like there's no newly introduced failures. |
…ect#37585) Rewrites agent_manager.cc. Removed its ability to do agent registration (no longer needs registration) and proxying runtime env agent (moved to the runtime_env_agent_client.cc). It will only do agent starting but we will have 2 instances in node_manager starting a dashboard agent and a runtime env agent. Deletes the runtime env agent python code from dashboard agent. Deletes the agent registration grpc interface, and the runtime env agent interface. Starts the standalone runtime env http server in services.py. Adds the extra port for the server everywhere: in services.py, node.py and gcs.proto. added 1 more port to Node info: runtime_env_agent_port. Intended to be used with raylet_address, but in some cases (1 test IIRC) we don't have one and it'll be used with node_address updated all related tests. Most tests used to use dashboard agent's port, now they use runtime env agent's port. Signed-off-by: NripeshN <[email protected]>
…ect#37585) Rewrites agent_manager.cc. Removed its ability to do agent registration (no longer needs registration) and proxying runtime env agent (moved to the runtime_env_agent_client.cc). It will only do agent starting but we will have 2 instances in node_manager starting a dashboard agent and a runtime env agent. Deletes the runtime env agent python code from dashboard agent. Deletes the agent registration grpc interface, and the runtime env agent interface. Starts the standalone runtime env http server in services.py. Adds the extra port for the server everywhere: in services.py, node.py and gcs.proto. added 1 more port to Node info: runtime_env_agent_port. Intended to be used with raylet_address, but in some cases (1 test IIRC) we don't have one and it'll be used with node_address updated all related tests. Most tests used to use dashboard agent's port, now they use runtime env agent's port. Signed-off-by: e428265 <[email protected]>
…ect#37585) Rewrites agent_manager.cc. Removed its ability to do agent registration (no longer needs registration) and proxying runtime env agent (moved to the runtime_env_agent_client.cc). It will only do agent starting but we will have 2 instances in node_manager starting a dashboard agent and a runtime env agent. Deletes the runtime env agent python code from dashboard agent. Deletes the agent registration grpc interface, and the runtime env agent interface. Starts the standalone runtime env http server in services.py. Adds the extra port for the server everywhere: in services.py, node.py and gcs.proto. added 1 more port to Node info: runtime_env_agent_port. Intended to be used with raylet_address, but in some cases (1 test IIRC) we don't have one and it'll be used with node_address updated all related tests. Most tests used to use dashboard agent's port, now they use runtime env agent's port. Signed-off-by: Victor <[email protected]>
This is the final patch in the series. Changes:
runtime_env_agent_client.cc
). It will only do agent starting but we will have 2 instances in node_manager starting a dashboard agent and a runtime env agent.services.py
.services.py
,node.py
andgcs.proto
.Related issue number
Part of issue #35472.
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.