Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Look into request got stuck issue #47585

Closed
GeneDer opened this issue Sep 10, 2024 · 4 comments
Closed

[Serve] Look into request got stuck issue #47585

GeneDer opened this issue Sep 10, 2024 · 4 comments
Assignees

Comments

@GeneDer
Copy link
Contributor

GeneDer commented Sep 10, 2024

https://ray.slack.com/archives/CNCKBBRJL/p1724281728601689?thread_ts=1723916926.382399&cid=CNCKBBRJL

@GeneDer GeneDer self-assigned this Sep 10, 2024
@GeneDer
Copy link
Contributor Author

GeneDer commented Sep 19, 2024

This is actually from an Anyscale customer ofOne and is brought up again at https://anyscaleteam.slack.com/archives/C04UR8J709K/p1726705443604199

We are able to download their logs and have some example requests that got into this state
for example request id 85ece69a-4d51-4ab7-ade8-209f3b702ebf has the below logs

(ray) gene@geneanyscale2023 test_grcp_2 % grep -r "85ece69a-4d51-4ab7-ade8-209f3b702ebf" /tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5-01000000-5752.err:INFO 2024-09-18 22:21:50,573 app1_Ingress 5gsucce2 85ece69a-4d51-4ab7-ade8-209f3b702ebf / app.py:71 - Ray health check initiated.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5-01000000-5752.err:INFO 2024-09-18 22:21:50,574 app1_Ingress 5gsucce2 85ece69a-4d51-4ab7-ade8-209f3b702ebf / replica.py:408 - __CALL__ OK 2.5ms
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:18:52,989 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 50 attempts over 44.07s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:19:43,100 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 100 attempts over 94.18s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:20:33,204 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 150 attempts over 144.29s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:21:23,319 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 200 attempts over 194.40s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:22:13,431 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 250 attempts over 244.52s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:23:03,525 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 300 attempts over 294.61s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:23:53,621 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 350 attempts over 344.71s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:24:43,736 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 400 attempts over 394.82s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:25:53,989 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 50 attempts over 44.11s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:26:44,151 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 100 attempts over 94.27s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:27:34,364 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 150 attempts over 144.48s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:28:24,567 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 200 attempts over 194.68s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:18:52,989 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 50 attempts over 44.07s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:19:43,100 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 100 attempts over 94.18s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:20:33,204 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 150 attempts over 144.29s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:21:23,319 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 200 attempts over 194.40s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5-01000000-5752.err","message":"INFO 2024-09-18 22:21:50,573 app1_Ingress 5gsucce2 85ece69a-4d51-4ab7-ade8-209f3b702ebf / app.py:71 - Ray health check initiated."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5-01000000-5752.err","message":"INFO 2024-09-18 22:21:50,574 app1_Ingress 5gsucce2 85ece69a-4d51-4ab7-ade8-209f3b702ebf / replica.py:408 - __CALL__ OK 2.5ms"}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:22:13,431 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 250 attempts over 244.52s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:23:03,525 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 300 attempts over 294.61s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:23:53,621 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 350 attempts over 344.71s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:24:43,736 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 400 attempts over 394.82s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:25:53,989 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 50 attempts over 44.11s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:26:44,151 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 100 attempts over 94.27s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:27:34,364 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 150 attempts over 144.48s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:28:24,567 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 200 attempts over 194.68s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/replica_app1_Ingress_5gsucce2.log:{"asctime": "2024-09-18 22:21:50,392", "levelname": "WARNING", "message": "Replica at capacity of max_ongoing_requests=5, rejecting request 85ece69a-4d51-4ab7-ade8-209f3b702ebf.", "filename": "replica.py", "lineno": 556, "log_to_stderr": false, "worker_id": "52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5", "node_id": "b780a6ab26f3c037b81dbbb86b3b957f042d083e4a8d834cb125318d", "actor_id": "392919b96895babf9e9cc8fa01000000", "deployment": "app1_Ingress", "replica": "5gsucce2", "component_name": "replica"}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/replica_app1_Ingress_5gsucce2.log:{"asctime": "2024-09-18 22:21:50,572", "levelname": "INFO", "message": "Started executing request to method '__call__'.", "filename": "replica.py", "lineno": 1152, "log_to_stderr": false, "worker_id": "52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5", "node_id": "b780a6ab26f3c037b81dbbb86b3b957f042d083e4a8d834cb125318d", "actor_id": "392919b96895babf9e9cc8fa01000000", "route": "/", "request_id": "85ece69a-4d51-4ab7-ade8-209f3b702ebf", "application": "app1", "deployment": "app1_Ingress", "replica": "5gsucce2", "component_name": "replica"}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/replica_app1_Ingress_5gsucce2.log:{"asctime": "2024-09-18 22:21:50,573", "levelname": "INFO", "message": "Ray health check initiated.", "filename": "app.py", "lineno": 71, "route": "/", "request_id": "85ece69a-4d51-4ab7-ade8-209f3b702ebf", "application": "app1", "worker_id": "52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5", "node_id": "b780a6ab26f3c037b81dbbb86b3b957f042d083e4a8d834cb125318d", "actor_id": "392919b96895babf9e9cc8fa01000000", "deployment": "app1_Ingress", "replica": "5gsucce2", "component_name": "replica"}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/replica_app1_Ingress_5gsucce2.log:{"asctime": "2024-09-18 22:21:50,574", "levelname": "INFO", "message": "__CALL__ OK 2.5ms", "filename": "replica.py", "lineno": 408, "route": "/", "request_id": "85ece69a-4d51-4ab7-ade8-209f3b702ebf", "application": "app1", "worker_id": "52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5", "node_id": "b780a6ab26f3c037b81dbbb86b3b957f042d083e4a8d834cb125318d", "actor_id": "392919b96895babf9e9cc8fa01000000", "deployment": "app1_Ingress", "replica": "5gsucce2", "component_name": "replica"}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:18:52,989 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 50 attempts over 44.07s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:19:43,100 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 100 attempts over 94.18s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:20:33,204 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 150 attempts over 144.29s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:21:23,319 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 200 attempts over 194.40s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:22:13,431 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 250 attempts over 244.52s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:23:03,525 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 300 attempts over 294.61s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:23:53,621 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 350 attempts over 344.71s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:24:43,736 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 400 attempts over 394.82s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:25:53,989 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 50 attempts over 44.11s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:26:44,151 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 100 attempts over 94.27s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:27:34,364 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 150 attempts over 144.48s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:28:24,567 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 200 attempts over 194.68s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1.
(ray) gene@geneanyscale2023 test_grcp_2 %

Note that the OK is logged at 2024-09-18 22:21:50,574

/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5-01000000-5752.err:INFO 2024-09-18 22:21:50,574 app1_Ingress 5gsucce2 85ece69a-4d51-4ab7-ade8-209f3b702ebf / replica.py:408 - __CALL__ OK 2.5ms

but at 2024-09-18 22:22:13,431 proxy keeps retrying and failed to schedule the request

/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:22:13,431 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 250 attempts over 244.52s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.

Still don't have a good way to reproduce, but seems like a race condition between the scheduler and the replica

@GeneDer
Copy link
Contributor Author

GeneDer commented Sep 24, 2024

This seems to be reproducible by this deployment setup

import asyncio
import logging
from fastapi import FastAPI
from ray import serve
from ray.serve.handle import DeploymentHandle


webapp = FastAPI()
logger = logging.getLogger(__name__)


@serve.deployment(
    autoscaling_config={
        "max_ongoing_requests": 5,
        "min_replicas": 3,
        "max_replicas": 3,
    },
)
@serve.ingress(webapp)
class Ingress:
    def __init__(self, pipeline: DeploymentHandle):
        self.pipeline = pipeline

    @webapp.get("/")
    async def health_check(self) -> str:
        logger.info("Ray health check initiated.")
        return "Healthy!"

    @webapp.post("/conversations")
    async def start_conversation(self):
        logger.info("in start_conversation before sleep")
        await asyncio.gather(
            self.pipeline.start.remote(),
        )
        logger.info("in start_conversation after sleep")


@serve.deployment
class Pipeline:
    async def start(self):
        await asyncio.sleep(310)


app = Ingress.bind(Pipeline.bind())

Deploy to Anyscale service with anyscale service deploy --name gene-test-timeout2 deployment:app

Then use the following code to call the service

import requests
import ray


@ray.remote
def query_get(url, token):
    headers = {"Authorization": f"Bearer {token}"}
    response = requests.get(url, headers=headers)
    print(response.text)

@ray.remote
def query_post(url, token):
    headers = {"Authorization": f"Bearer {token}"}
    response = requests.post(url, headers=headers)
    print(response.text)

service_url = "https://gene-test-timeout2-bxauk.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata.com/conversations"
token = "ae_1me8Z5Mbm-hJ4eFEyzB01S33cvNQd4OmR6jtkrQE"
ray.get([query_post.remote(service_url, token) for _ in range(11)])

@zcin
Copy link
Contributor

zcin commented Sep 30, 2024

So there are two issues:

  1. If a request is cancelled while the scheduler is trying to find an available replica for that request, then the scheduler will not remove that request from processing until a replica is available (then it simply bypasses that request). This is an open PR for fixing that issue: [serve] Stop scheduling task early when requests have been cancelled #47847. In this case the request is correctly cancelled, it will just sometimes be delayed.
  2. Recursive cancellation from Ray doesn't cover all cases, so that if the scheduler is in the middle of trying to assign the request to a replica, if an asyncio.CancelledError isn't sent directly to the router (which isn't guaranteed to happen), then the request won't get cancelled at all (and a request could still get forwarded to the downstream deployment). For this we will have to implement full recursive cancellation in the Ray Serve layer.

zcin added a commit that referenced this issue Oct 8, 2024
…47847)

In `fulfill_pending_requests`, there are two nested loops:
- the outer loop greedily fulfills more requests so that if backoff
doesn't occur, it's not necessary for new asyncio tasks to be started to
fulfill each request
- the inner loop handles backoff if replicas can't be found to fulfill
the next request

The outer loop will be stopped if there are enough tasks to handle all
pending requests. However if all replicas are at max capacity, it's
possible for the inner loop to continue to loop even when the task is no
longer needed (e.g. when a request has been cancelled), because the
inner loop simply continues to try to find an available replica without
checking if the current task is even necessary.

This PR makes sure that at the end of each iteration of the inner loop,
it clears out requests in `pending_requests_to_fulfill` that have been
cancelled, and then breaks out of the loop if there are enough tasks to
handle the remaining requests.

Tests:
- Added a test that tests for the scenario where a request is cancelled
while it's trying to find an available replica
- Also modified the tests in `test_pow_2_scheduler.py` so that the
backoff sequence is small values (1ms), and the timeouts in the tests
are also low `10ms`, so that the unit tests run much faster (~5s now
compared to ~30s before).

## Related issue number

related: #47585

---------

Signed-off-by: Cindy Zhang <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this issue Oct 15, 2024
…ay-project#47847)

In `fulfill_pending_requests`, there are two nested loops:
- the outer loop greedily fulfills more requests so that if backoff
doesn't occur, it's not necessary for new asyncio tasks to be started to
fulfill each request
- the inner loop handles backoff if replicas can't be found to fulfill
the next request

The outer loop will be stopped if there are enough tasks to handle all
pending requests. However if all replicas are at max capacity, it's
possible for the inner loop to continue to loop even when the task is no
longer needed (e.g. when a request has been cancelled), because the
inner loop simply continues to try to find an available replica without
checking if the current task is even necessary.

This PR makes sure that at the end of each iteration of the inner loop,
it clears out requests in `pending_requests_to_fulfill` that have been
cancelled, and then breaks out of the loop if there are enough tasks to
handle the remaining requests.

Tests:
- Added a test that tests for the scenario where a request is cancelled
while it's trying to find an available replica
- Also modified the tests in `test_pow_2_scheduler.py` so that the
backoff sequence is small values (1ms), and the timeouts in the tests
are also low `10ms`, so that the unit tests run much faster (~5s now
compared to ~30s before).

## Related issue number

related: ray-project#47585

---------

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this issue Oct 15, 2024
…ay-project#47847)

In `fulfill_pending_requests`, there are two nested loops:
- the outer loop greedily fulfills more requests so that if backoff
doesn't occur, it's not necessary for new asyncio tasks to be started to
fulfill each request
- the inner loop handles backoff if replicas can't be found to fulfill
the next request

The outer loop will be stopped if there are enough tasks to handle all
pending requests. However if all replicas are at max capacity, it's
possible for the inner loop to continue to loop even when the task is no
longer needed (e.g. when a request has been cancelled), because the
inner loop simply continues to try to find an available replica without
checking if the current task is even necessary.

This PR makes sure that at the end of each iteration of the inner loop,
it clears out requests in `pending_requests_to_fulfill` that have been
cancelled, and then breaks out of the loop if there are enough tasks to
handle the remaining requests.

Tests:
- Added a test that tests for the scenario where a request is cancelled
while it's trying to find an available replica
- Also modified the tests in `test_pow_2_scheduler.py` so that the
backoff sequence is small values (1ms), and the timeouts in the tests
are also low `10ms`, so that the unit tests run much faster (~5s now
compared to ~30s before).

## Related issue number

related: ray-project#47585

---------

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this issue Oct 15, 2024
…ay-project#47847)

In `fulfill_pending_requests`, there are two nested loops:
- the outer loop greedily fulfills more requests so that if backoff
doesn't occur, it's not necessary for new asyncio tasks to be started to
fulfill each request
- the inner loop handles backoff if replicas can't be found to fulfill
the next request

The outer loop will be stopped if there are enough tasks to handle all
pending requests. However if all replicas are at max capacity, it's
possible for the inner loop to continue to loop even when the task is no
longer needed (e.g. when a request has been cancelled), because the
inner loop simply continues to try to find an available replica without
checking if the current task is even necessary.

This PR makes sure that at the end of each iteration of the inner loop,
it clears out requests in `pending_requests_to_fulfill` that have been
cancelled, and then breaks out of the loop if there are enough tasks to
handle the remaining requests.

Tests:
- Added a test that tests for the scenario where a request is cancelled
while it's trying to find an available replica
- Also modified the tests in `test_pow_2_scheduler.py` so that the
backoff sequence is small values (1ms), and the timeouts in the tests
are also low `10ms`, so that the unit tests run much faster (~5s now
compared to ~30s before).

## Related issue number

related: ray-project#47585

---------

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this issue Oct 15, 2024
…ay-project#47847)

In `fulfill_pending_requests`, there are two nested loops:
- the outer loop greedily fulfills more requests so that if backoff
doesn't occur, it's not necessary for new asyncio tasks to be started to
fulfill each request
- the inner loop handles backoff if replicas can't be found to fulfill
the next request

The outer loop will be stopped if there are enough tasks to handle all
pending requests. However if all replicas are at max capacity, it's
possible for the inner loop to continue to loop even when the task is no
longer needed (e.g. when a request has been cancelled), because the
inner loop simply continues to try to find an available replica without
checking if the current task is even necessary.

This PR makes sure that at the end of each iteration of the inner loop,
it clears out requests in `pending_requests_to_fulfill` that have been
cancelled, and then breaks out of the loop if there are enough tasks to
handle the remaining requests.

Tests:
- Added a test that tests for the scenario where a request is cancelled
while it's trying to find an available replica
- Also modified the tests in `test_pow_2_scheduler.py` so that the
backoff sequence is small values (1ms), and the timeouts in the tests
are also low `10ms`, so that the unit tests run much faster (~5s now
compared to ~30s before).

## Related issue number

related: ray-project#47585

---------

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this issue Oct 15, 2024
…ay-project#47847)

In `fulfill_pending_requests`, there are two nested loops:
- the outer loop greedily fulfills more requests so that if backoff
doesn't occur, it's not necessary for new asyncio tasks to be started to
fulfill each request
- the inner loop handles backoff if replicas can't be found to fulfill
the next request

The outer loop will be stopped if there are enough tasks to handle all
pending requests. However if all replicas are at max capacity, it's
possible for the inner loop to continue to loop even when the task is no
longer needed (e.g. when a request has been cancelled), because the
inner loop simply continues to try to find an available replica without
checking if the current task is even necessary.

This PR makes sure that at the end of each iteration of the inner loop,
it clears out requests in `pending_requests_to_fulfill` that have been
cancelled, and then breaks out of the loop if there are enough tasks to
handle the remaining requests.

Tests:
- Added a test that tests for the scenario where a request is cancelled
while it's trying to find an available replica
- Also modified the tests in `test_pow_2_scheduler.py` so that the
backoff sequence is small values (1ms), and the timeouts in the tests
are also low `10ms`, so that the unit tests run much faster (~5s now
compared to ~30s before).

## Related issue number

related: ray-project#47585

---------

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this issue Oct 15, 2024
…ay-project#47847)

In `fulfill_pending_requests`, there are two nested loops:
- the outer loop greedily fulfills more requests so that if backoff
doesn't occur, it's not necessary for new asyncio tasks to be started to
fulfill each request
- the inner loop handles backoff if replicas can't be found to fulfill
the next request

The outer loop will be stopped if there are enough tasks to handle all
pending requests. However if all replicas are at max capacity, it's
possible for the inner loop to continue to loop even when the task is no
longer needed (e.g. when a request has been cancelled), because the
inner loop simply continues to try to find an available replica without
checking if the current task is even necessary.

This PR makes sure that at the end of each iteration of the inner loop,
it clears out requests in `pending_requests_to_fulfill` that have been
cancelled, and then breaks out of the loop if there are enough tasks to
handle the remaining requests.

Tests:
- Added a test that tests for the scenario where a request is cancelled
while it's trying to find an available replica
- Also modified the tests in `test_pow_2_scheduler.py` so that the
backoff sequence is small values (1ms), and the timeouts in the tests
are also low `10ms`, so that the unit tests run much faster (~5s now
compared to ~30s before).

## Related issue number

related: ray-project#47585

---------

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this issue Oct 15, 2024
…ay-project#47847)

In `fulfill_pending_requests`, there are two nested loops:
- the outer loop greedily fulfills more requests so that if backoff
doesn't occur, it's not necessary for new asyncio tasks to be started to
fulfill each request
- the inner loop handles backoff if replicas can't be found to fulfill
the next request

The outer loop will be stopped if there are enough tasks to handle all
pending requests. However if all replicas are at max capacity, it's
possible for the inner loop to continue to loop even when the task is no
longer needed (e.g. when a request has been cancelled), because the
inner loop simply continues to try to find an available replica without
checking if the current task is even necessary.

This PR makes sure that at the end of each iteration of the inner loop,
it clears out requests in `pending_requests_to_fulfill` that have been
cancelled, and then breaks out of the loop if there are enough tasks to
handle the remaining requests.

Tests:
- Added a test that tests for the scenario where a request is cancelled
while it's trying to find an available replica
- Also modified the tests in `test_pow_2_scheduler.py` so that the
backoff sequence is small values (1ms), and the timeouts in the tests
are also low `10ms`, so that the unit tests run much faster (~5s now
compared to ~30s before).

## Related issue number

related: ray-project#47585

---------

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this issue Oct 15, 2024
…ay-project#47847)

In `fulfill_pending_requests`, there are two nested loops:
- the outer loop greedily fulfills more requests so that if backoff
doesn't occur, it's not necessary for new asyncio tasks to be started to
fulfill each request
- the inner loop handles backoff if replicas can't be found to fulfill
the next request

The outer loop will be stopped if there are enough tasks to handle all
pending requests. However if all replicas are at max capacity, it's
possible for the inner loop to continue to loop even when the task is no
longer needed (e.g. when a request has been cancelled), because the
inner loop simply continues to try to find an available replica without
checking if the current task is even necessary.

This PR makes sure that at the end of each iteration of the inner loop,
it clears out requests in `pending_requests_to_fulfill` that have been
cancelled, and then breaks out of the loop if there are enough tasks to
handle the remaining requests.

Tests:
- Added a test that tests for the scenario where a request is cancelled
while it's trying to find an available replica
- Also modified the tests in `test_pow_2_scheduler.py` so that the
backoff sequence is small values (1ms), and the timeouts in the tests
are also low `10ms`, so that the unit tests run much faster (~5s now
compared to ~30s before).

## Related issue number

related: ray-project#47585

---------

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this issue Oct 15, 2024
…ay-project#47847)

In `fulfill_pending_requests`, there are two nested loops:
- the outer loop greedily fulfills more requests so that if backoff
doesn't occur, it's not necessary for new asyncio tasks to be started to
fulfill each request
- the inner loop handles backoff if replicas can't be found to fulfill
the next request

The outer loop will be stopped if there are enough tasks to handle all
pending requests. However if all replicas are at max capacity, it's
possible for the inner loop to continue to loop even when the task is no
longer needed (e.g. when a request has been cancelled), because the
inner loop simply continues to try to find an available replica without
checking if the current task is even necessary.

This PR makes sure that at the end of each iteration of the inner loop,
it clears out requests in `pending_requests_to_fulfill` that have been
cancelled, and then breaks out of the loop if there are enough tasks to
handle the remaining requests.

Tests:
- Added a test that tests for the scenario where a request is cancelled
while it's trying to find an available replica
- Also modified the tests in `test_pow_2_scheduler.py` so that the
backoff sequence is small values (1ms), and the timeouts in the tests
are also low `10ms`, so that the unit tests run much faster (~5s now
compared to ~30s before).

## Related issue number

related: ray-project#47585

---------

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this issue Oct 15, 2024
…ay-project#47847)

In `fulfill_pending_requests`, there are two nested loops:
- the outer loop greedily fulfills more requests so that if backoff
doesn't occur, it's not necessary for new asyncio tasks to be started to
fulfill each request
- the inner loop handles backoff if replicas can't be found to fulfill
the next request

The outer loop will be stopped if there are enough tasks to handle all
pending requests. However if all replicas are at max capacity, it's
possible for the inner loop to continue to loop even when the task is no
longer needed (e.g. when a request has been cancelled), because the
inner loop simply continues to try to find an available replica without
checking if the current task is even necessary.

This PR makes sure that at the end of each iteration of the inner loop,
it clears out requests in `pending_requests_to_fulfill` that have been
cancelled, and then breaks out of the loop if there are enough tasks to
handle the remaining requests.

Tests:
- Added a test that tests for the scenario where a request is cancelled
while it's trying to find an available replica
- Also modified the tests in `test_pow_2_scheduler.py` so that the
backoff sequence is small values (1ms), and the timeouts in the tests
are also low `10ms`, so that the unit tests run much faster (~5s now
compared to ~30s before).

## Related issue number

related: ray-project#47585

---------

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this issue Oct 15, 2024
…ay-project#47847)

In `fulfill_pending_requests`, there are two nested loops:
- the outer loop greedily fulfills more requests so that if backoff
doesn't occur, it's not necessary for new asyncio tasks to be started to
fulfill each request
- the inner loop handles backoff if replicas can't be found to fulfill
the next request

The outer loop will be stopped if there are enough tasks to handle all
pending requests. However if all replicas are at max capacity, it's
possible for the inner loop to continue to loop even when the task is no
longer needed (e.g. when a request has been cancelled), because the
inner loop simply continues to try to find an available replica without
checking if the current task is even necessary.

This PR makes sure that at the end of each iteration of the inner loop,
it clears out requests in `pending_requests_to_fulfill` that have been
cancelled, and then breaks out of the loop if there are enough tasks to
handle the remaining requests.

Tests:
- Added a test that tests for the scenario where a request is cancelled
while it's trying to find an available replica
- Also modified the tests in `test_pow_2_scheduler.py` so that the
backoff sequence is small values (1ms), and the timeouts in the tests
are also low `10ms`, so that the unit tests run much faster (~5s now
compared to ~30s before).

## Related issue number

related: ray-project#47585

---------

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this issue Oct 15, 2024
…ay-project#47847)

In `fulfill_pending_requests`, there are two nested loops:
- the outer loop greedily fulfills more requests so that if backoff
doesn't occur, it's not necessary for new asyncio tasks to be started to
fulfill each request
- the inner loop handles backoff if replicas can't be found to fulfill
the next request

The outer loop will be stopped if there are enough tasks to handle all
pending requests. However if all replicas are at max capacity, it's
possible for the inner loop to continue to loop even when the task is no
longer needed (e.g. when a request has been cancelled), because the
inner loop simply continues to try to find an available replica without
checking if the current task is even necessary.

This PR makes sure that at the end of each iteration of the inner loop,
it clears out requests in `pending_requests_to_fulfill` that have been
cancelled, and then breaks out of the loop if there are enough tasks to
handle the remaining requests.

Tests:
- Added a test that tests for the scenario where a request is cancelled
while it's trying to find an available replica
- Also modified the tests in `test_pow_2_scheduler.py` so that the
backoff sequence is small values (1ms), and the timeouts in the tests
are also low `10ms`, so that the unit tests run much faster (~5s now
compared to ~30s before).

## Related issue number

related: ray-project#47585

---------

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this issue Oct 15, 2024
…ay-project#47847)

In `fulfill_pending_requests`, there are two nested loops:
- the outer loop greedily fulfills more requests so that if backoff
doesn't occur, it's not necessary for new asyncio tasks to be started to
fulfill each request
- the inner loop handles backoff if replicas can't be found to fulfill
the next request

The outer loop will be stopped if there are enough tasks to handle all
pending requests. However if all replicas are at max capacity, it's
possible for the inner loop to continue to loop even when the task is no
longer needed (e.g. when a request has been cancelled), because the
inner loop simply continues to try to find an available replica without
checking if the current task is even necessary.

This PR makes sure that at the end of each iteration of the inner loop,
it clears out requests in `pending_requests_to_fulfill` that have been
cancelled, and then breaks out of the loop if there are enough tasks to
handle the remaining requests.

Tests:
- Added a test that tests for the scenario where a request is cancelled
while it's trying to find an available replica
- Also modified the tests in `test_pow_2_scheduler.py` so that the
backoff sequence is small values (1ms), and the timeouts in the tests
are also low `10ms`, so that the unit tests run much faster (~5s now
compared to ~30s before).

## Related issue number

related: ray-project#47585

---------

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this issue Oct 15, 2024
…ay-project#47847)

In `fulfill_pending_requests`, there are two nested loops:
- the outer loop greedily fulfills more requests so that if backoff
doesn't occur, it's not necessary for new asyncio tasks to be started to
fulfill each request
- the inner loop handles backoff if replicas can't be found to fulfill
the next request

The outer loop will be stopped if there are enough tasks to handle all
pending requests. However if all replicas are at max capacity, it's
possible for the inner loop to continue to loop even when the task is no
longer needed (e.g. when a request has been cancelled), because the
inner loop simply continues to try to find an available replica without
checking if the current task is even necessary.

This PR makes sure that at the end of each iteration of the inner loop,
it clears out requests in `pending_requests_to_fulfill` that have been
cancelled, and then breaks out of the loop if there are enough tasks to
handle the remaining requests.

Tests:
- Added a test that tests for the scenario where a request is cancelled
while it's trying to find an available replica
- Also modified the tests in `test_pow_2_scheduler.py` so that the
backoff sequence is small values (1ms), and the timeouts in the tests
are also low `10ms`, so that the unit tests run much faster (~5s now
compared to ~30s before).

## Related issue number

related: ray-project#47585

---------

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this issue Oct 15, 2024
…ay-project#47847)

In `fulfill_pending_requests`, there are two nested loops:
- the outer loop greedily fulfills more requests so that if backoff
doesn't occur, it's not necessary for new asyncio tasks to be started to
fulfill each request
- the inner loop handles backoff if replicas can't be found to fulfill
the next request

The outer loop will be stopped if there are enough tasks to handle all
pending requests. However if all replicas are at max capacity, it's
possible for the inner loop to continue to loop even when the task is no
longer needed (e.g. when a request has been cancelled), because the
inner loop simply continues to try to find an available replica without
checking if the current task is even necessary.

This PR makes sure that at the end of each iteration of the inner loop,
it clears out requests in `pending_requests_to_fulfill` that have been
cancelled, and then breaks out of the loop if there are enough tasks to
handle the remaining requests.

Tests:
- Added a test that tests for the scenario where a request is cancelled
while it's trying to find an available replica
- Also modified the tests in `test_pow_2_scheduler.py` so that the
backoff sequence is small values (1ms), and the timeouts in the tests
are also low `10ms`, so that the unit tests run much faster (~5s now
compared to ~30s before).

## Related issue number

related: ray-project#47585

---------

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this issue Oct 15, 2024
…ay-project#47847)

In `fulfill_pending_requests`, there are two nested loops:
- the outer loop greedily fulfills more requests so that if backoff
doesn't occur, it's not necessary for new asyncio tasks to be started to
fulfill each request
- the inner loop handles backoff if replicas can't be found to fulfill
the next request

The outer loop will be stopped if there are enough tasks to handle all
pending requests. However if all replicas are at max capacity, it's
possible for the inner loop to continue to loop even when the task is no
longer needed (e.g. when a request has been cancelled), because the
inner loop simply continues to try to find an available replica without
checking if the current task is even necessary.

This PR makes sure that at the end of each iteration of the inner loop,
it clears out requests in `pending_requests_to_fulfill` that have been
cancelled, and then breaks out of the loop if there are enough tasks to
handle the remaining requests.

Tests:
- Added a test that tests for the scenario where a request is cancelled
while it's trying to find an available replica
- Also modified the tests in `test_pow_2_scheduler.py` so that the
backoff sequence is small values (1ms), and the timeouts in the tests
are also low `10ms`, so that the unit tests run much faster (~5s now
compared to ~30s before).

## Related issue number

related: ray-project#47585

---------

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this issue Oct 15, 2024
…ay-project#47847)

In `fulfill_pending_requests`, there are two nested loops:
- the outer loop greedily fulfills more requests so that if backoff
doesn't occur, it's not necessary for new asyncio tasks to be started to
fulfill each request
- the inner loop handles backoff if replicas can't be found to fulfill
the next request

The outer loop will be stopped if there are enough tasks to handle all
pending requests. However if all replicas are at max capacity, it's
possible for the inner loop to continue to loop even when the task is no
longer needed (e.g. when a request has been cancelled), because the
inner loop simply continues to try to find an available replica without
checking if the current task is even necessary.

This PR makes sure that at the end of each iteration of the inner loop,
it clears out requests in `pending_requests_to_fulfill` that have been
cancelled, and then breaks out of the loop if there are enough tasks to
handle the remaining requests.

Tests:
- Added a test that tests for the scenario where a request is cancelled
while it's trying to find an available replica
- Also modified the tests in `test_pow_2_scheduler.py` so that the
backoff sequence is small values (1ms), and the timeouts in the tests
are also low `10ms`, so that the unit tests run much faster (~5s now
compared to ~30s before).

## Related issue number

related: ray-project#47585

---------

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this issue Oct 15, 2024
…ay-project#47847)

In `fulfill_pending_requests`, there are two nested loops:
- the outer loop greedily fulfills more requests so that if backoff
doesn't occur, it's not necessary for new asyncio tasks to be started to
fulfill each request
- the inner loop handles backoff if replicas can't be found to fulfill
the next request

The outer loop will be stopped if there are enough tasks to handle all
pending requests. However if all replicas are at max capacity, it's
possible for the inner loop to continue to loop even when the task is no
longer needed (e.g. when a request has been cancelled), because the
inner loop simply continues to try to find an available replica without
checking if the current task is even necessary.

This PR makes sure that at the end of each iteration of the inner loop,
it clears out requests in `pending_requests_to_fulfill` that have been
cancelled, and then breaks out of the loop if there are enough tasks to
handle the remaining requests.

Tests:
- Added a test that tests for the scenario where a request is cancelled
while it's trying to find an available replica
- Also modified the tests in `test_pow_2_scheduler.py` so that the
backoff sequence is small values (1ms), and the timeouts in the tests
are also low `10ms`, so that the unit tests run much faster (~5s now
compared to ~30s before).

## Related issue number

related: ray-project#47585

---------

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
@zcin
Copy link
Contributor

zcin commented Oct 29, 2024

addressed by #47873

@zcin zcin closed this as completed Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants