Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][3/3] Use the new standalone runtime env http server. #37585

Merged
merged 37 commits into from
Aug 3, 2023

Conversation

rynewang
Copy link
Contributor

@rynewang rynewang commented Jul 19, 2023

This is the final patch in the series. Changes:

  • Rewrites agent_manager.cc. Removed its ability to do agent registration (no longer needs registration) and proxying runtime env agent (moved to the runtime_env_agent_client.cc). It will only do agent starting but we will have 2 instances in node_manager starting a dashboard agent and a runtime env agent.
  • Deletes the runtime env agent python code from dashboard agent.
  • Deletes the agent registration grpc interface, and the runtime env agent interface.
  • Starts the standalone runtime env http server in services.py.
  • Adds the extra port for the server everywhere: in services.py, node.py and gcs.proto.
  • added 1 more port to Node info: runtime_env_agent_port. Intended to be used with raylet_address, but in some cases (1 test IIRC) we don't have one and it'll be used with node_address
  • updated all related tests. Most tests used to use dashboard agent's port, now they use runtime env agent's port.

Related issue number

Part of issue #35472.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished reviewing cpp parts. I will review the python part tmrw.

Q: I don't quite understand the reasoning behind "we don't need registration anymore". Can you tell me why we had it before and why we don't have it anymore?

/// The agent manager RPC service.
std::unique_ptr<rpc::AgentManagerServiceHandler> agent_manager_service_handler_;
rpc::AgentManagerGrpcService agent_manager_service_;
/// Wrapper client for RuntimeEnvManager. Always non-null.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we make it a none-pointer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to pass it to WorkerPool

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't have to be a pointer to pass by argument?

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 21, 2023
@property
def runtime_env_agent_address(self):
"""Get the address that exposes runtime env agent as http"""
return f"http://{self._raylet_ip_address}:{self._runtime_env_agent_port}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use node_ip_address or raylet_ip_address?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to use raylet_ip_address because the agent is started by raylet and always co-locates with raylet. However there are some cases I don't have a raylet address so I falled back to node_ip address.

Also, would you mind sharing some cases where node_ip_address != raylet_address?

@@ -1092,7 +1108,7 @@ def start_ray_client_server(self):
stderr_file=stderr_file,
redis_password=self._ray_params.redis_password,
fate_share=self.kernel_fate_share,
metrics_agent_port=self._ray_params.metrics_agent_port,
runtime_env_agent_address=self.runtime_env_agent_address,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to pass in address instead of port as before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's used in proxier.py from client server. Previously it uses 127.0.0.1

https://github.com/ray-project/ray/blob/master/python/ray/util/client/server/proxier.py#L131

Putting the raylet address in makes it clearer who it's talking to. I don't really have an idea of which raylet the client would like to connect to, so I tried to make things explicit. If you like to revert it back to 127.0.0.1 I can do that.

@jjyao
Copy link
Collaborator

jjyao commented Jul 21, 2023

@ckw017 could you review the ray client changes?

@rynewang
Copy link
Contributor Author

Finished reviewing cpp parts. I will review the python part tmrw.

Q: I don't quite understand the reasoning behind "we don't need registration anymore". Can you tell me why we had it before and why we don't have it anymore?

Previously, raylet starts a dashboard_agent. Then dashboard_agent calls "registration" RPC to raylet to tell its own grpc listen port. Then raylet talks to the runtime env agent in dashboard_agent via that port. The registration is needed primarily to share this port (along with the agent_id and address but raylet already knows these).

Now, I let services.py to pick a free port for runtime env http agent, and passes this as an arg to raylet, so raylet knows the port at start. Raylet starts a runtime env http agent with the port, so the http agent use it. There's no registration needed since raylet knows everything (address + port) even before http agent is started.

@ckw017
Copy link
Member

ckw017 commented Jul 21, 2023

@ckw017 could you review the ray client changes?

Ray Client server changes seem fine to me (assuming tests are still passing), but afaik the runtime env parts were added by Ed, so maybe check with him instead.

@rynewang rynewang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 21, 2023
@rynewang rynewang force-pushed the runtime_env_http_final branch from 4cba2fa to bc23447 Compare July 24, 2023 20:06
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One additional questions regarding the port assignment. Is it possible to choose it inside the agent? (or is it required by the client server, so we should assign it before the agent starts?) I think we should follow up with this issue. Agent port issue causes problems for a long time, and introducing a new agent will double the probability of this issue.

Now, I let services.py to pick a free port for runtime env http agent, and passes this as an arg to raylet, so raylet knows the port at start. Raylet starts a runtime env http agent with the port, so the http agent use it. There's no registration needed since raylet knows everything (address + port) even before http agent is started.

I think particularly we should avoid doing this (maybe we should discuss offline for the solution, and we have to handle this before 2.7).

),
os.path.join(RAY_PATH, "_private", "runtime_env", "agent", "main.py"),
f"--node-ip-address={node_ip_address}",
f"--runtime-env-agent-port={runtime_env_agent_port}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the runtime env agent port decided within the runtime env agent, or chosen before it is started? We had long term issue where agent port conflict causes issues because we choose it before agent is started (and it is due to some limitation we haven't fixed). It'd be the best we avoid the same issue from runtime env agent (the probability of port conflict will double.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR it's decided before the agent starts, by services.py. Per offline talk, I will investigate how to assign all ports in services.py in another PR.

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 25, 2023
@rynewang rynewang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 25, 2023
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can you address the comments before merging it?

Also, have you run the release test in this PR or a separate one?

@rynewang
Copy link
Contributor Author

LGTM. Can you address the comments before merging it?

Updated PR.

Also, have you run the release test in this PR or a separate one?

Yes, but I will do another round in the latest commit.

rynewang and others added 17 commits August 2, 2023 18:05
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Co-authored-by: angelinalg <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Co-authored-by: angelinalg <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
…er.py

Co-authored-by: angelinalg <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Co-authored-by: angelinalg <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Co-authored-by: angelinalg <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Co-authored-by: angelinalg <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Co-authored-by: angelinalg <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Co-authored-by: angelinalg <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Co-authored-by: angelinalg <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Co-authored-by: angelinalg <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
@rynewang rynewang force-pushed the runtime_env_http_final branch from 7a7cc62 to bf3b73f Compare August 2, 2023 22:05
@github-actions
Copy link

github-actions bot commented Aug 2, 2023

Attention: External code changed

This PR changes code that is used or cited in external sources, e.g. blog posts.

Before merging this PR, please make sure that the code in the external sources is still working, and consider updating them to reflect the changes.

The affected files and the external sources are:

@rynewang
Copy link
Contributor Author

rynewang commented Aug 3, 2023

list of failed unit tests now:

looks like there's no newly introduced failures.

@aslonnie aslonnie removed the request for review from a team August 3, 2023 00:35
@rkooo567 rkooo567 merged commit 55dbf09 into ray-project:master Aug 3, 2023
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
…ect#37585)

Rewrites agent_manager.cc. Removed its ability to do agent registration (no longer needs registration) and proxying runtime env agent (moved to the runtime_env_agent_client.cc). It will only do agent starting but we will have 2 instances in node_manager starting a dashboard agent and a runtime env agent.
Deletes the runtime env agent python code from dashboard agent.
Deletes the agent registration grpc interface, and the runtime env agent interface.
Starts the standalone runtime env http server in services.py.
Adds the extra port for the server everywhere: in services.py, node.py and gcs.proto.
added 1 more port to Node info: runtime_env_agent_port. Intended to be used with raylet_address, but in some cases (1 test IIRC) we don't have one and it'll be used with node_address
updated all related tests. Most tests used to use dashboard agent's port, now they use runtime env agent's port.

Signed-off-by: NripeshN <[email protected]>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…ect#37585)

Rewrites agent_manager.cc. Removed its ability to do agent registration (no longer needs registration) and proxying runtime env agent (moved to the runtime_env_agent_client.cc). It will only do agent starting but we will have 2 instances in node_manager starting a dashboard agent and a runtime env agent.
Deletes the runtime env agent python code from dashboard agent.
Deletes the agent registration grpc interface, and the runtime env agent interface.
Starts the standalone runtime env http server in services.py.
Adds the extra port for the server everywhere: in services.py, node.py and gcs.proto.
added 1 more port to Node info: runtime_env_agent_port. Intended to be used with raylet_address, but in some cases (1 test IIRC) we don't have one and it'll be used with node_address
updated all related tests. Most tests used to use dashboard agent's port, now they use runtime env agent's port.

Signed-off-by: e428265 <[email protected]>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…ect#37585)

Rewrites agent_manager.cc. Removed its ability to do agent registration (no longer needs registration) and proxying runtime env agent (moved to the runtime_env_agent_client.cc). It will only do agent starting but we will have 2 instances in node_manager starting a dashboard agent and a runtime env agent.
Deletes the runtime env agent python code from dashboard agent.
Deletes the agent registration grpc interface, and the runtime env agent interface.
Starts the standalone runtime env http server in services.py.
Adds the extra port for the server everywhere: in services.py, node.py and gcs.proto.
added 1 more port to Node info: runtime_env_agent_port. Intended to be used with raylet_address, but in some cases (1 test IIRC) we don't have one and it'll be used with node_address
updated all related tests. Most tests used to use dashboard agent's port, now they use runtime env agent's port.

Signed-off-by: Victor <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants