[Serve] Look into request got stuck issue #47585

GeneDer · 2024-09-10T16:29:07Z

https://ray.slack.com/archives/CNCKBBRJL/p1724281728601689?thread_ts=1723916926.382399&cid=CNCKBBRJL

GeneDer · 2024-09-19T23:11:09Z

This is actually from an Anyscale customer ofOne and is brought up again at https://anyscaleteam.slack.com/archives/C04UR8J709K/p1726705443604199

We are able to download their logs and have some example requests that got into this state
for example request id 85ece69a-4d51-4ab7-ade8-209f3b702ebf has the below logs

(ray) gene@geneanyscale2023 test_grcp_2 % grep -r "85ece69a-4d51-4ab7-ade8-209f3b702ebf" /tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5-01000000-5752.err:INFO 2024-09-18 22:21:50,573 app1_Ingress 5gsucce2 85ece69a-4d51-4ab7-ade8-209f3b702ebf / app.py:71 - Ray health check initiated.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5-01000000-5752.err:INFO 2024-09-18 22:21:50,574 app1_Ingress 5gsucce2 85ece69a-4d51-4ab7-ade8-209f3b702ebf / replica.py:408 - __CALL__ OK 2.5ms
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:18:52,989 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 50 attempts over 44.07s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:19:43,100 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 100 attempts over 94.18s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:20:33,204 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 150 attempts over 144.29s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:21:23,319 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 200 attempts over 194.40s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:22:13,431 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 250 attempts over 244.52s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:23:03,525 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 300 attempts over 294.61s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:23:53,621 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 350 attempts over 344.71s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:24:43,736 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 400 attempts over 394.82s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:25:53,989 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 50 attempts over 44.11s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:26:44,151 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 100 attempts over 94.27s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:27:34,364 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 150 attempts over 144.48s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:28:24,567 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 200 attempts over 194.68s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:18:52,989 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 50 attempts over 44.07s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:19:43,100 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 100 attempts over 94.18s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:20:33,204 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 150 attempts over 144.29s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:21:23,319 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 200 attempts over 194.40s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5-01000000-5752.err","message":"INFO 2024-09-18 22:21:50,573 app1_Ingress 5gsucce2 85ece69a-4d51-4ab7-ade8-209f3b702ebf / app.py:71 - Ray health check initiated."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5-01000000-5752.err","message":"INFO 2024-09-18 22:21:50,574 app1_Ingress 5gsucce2 85ece69a-4d51-4ab7-ade8-209f3b702ebf / replica.py:408 - __CALL__ OK 2.5ms"}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:22:13,431 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 250 attempts over 244.52s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:23:03,525 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 300 attempts over 294.61s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:23:53,621 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 350 attempts over 344.71s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:24:43,736 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 400 attempts over 394.82s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:25:53,989 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 50 attempts over 44.11s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:26:44,151 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 100 attempts over 94.27s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:27:34,364 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 150 attempts over 144.48s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:28:24,567 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 200 attempts over 194.68s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/replica_app1_Ingress_5gsucce2.log:{"asctime": "2024-09-18 22:21:50,392", "levelname": "WARNING", "message": "Replica at capacity of max_ongoing_requests=5, rejecting request 85ece69a-4d51-4ab7-ade8-209f3b702ebf.", "filename": "replica.py", "lineno": 556, "log_to_stderr": false, "worker_id": "52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5", "node_id": "b780a6ab26f3c037b81dbbb86b3b957f042d083e4a8d834cb125318d", "actor_id": "392919b96895babf9e9cc8fa01000000", "deployment": "app1_Ingress", "replica": "5gsucce2", "component_name": "replica"}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/replica_app1_Ingress_5gsucce2.log:{"asctime": "2024-09-18 22:21:50,572", "levelname": "INFO", "message": "Started executing request to method '__call__'.", "filename": "replica.py", "lineno": 1152, "log_to_stderr": false, "worker_id": "52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5", "node_id": "b780a6ab26f3c037b81dbbb86b3b957f042d083e4a8d834cb125318d", "actor_id": "392919b96895babf9e9cc8fa01000000", "route": "/", "request_id": "85ece69a-4d51-4ab7-ade8-209f3b702ebf", "application": "app1", "deployment": "app1_Ingress", "replica": "5gsucce2", "component_name": "replica"}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/replica_app1_Ingress_5gsucce2.log:{"asctime": "2024-09-18 22:21:50,573", "levelname": "INFO", "message": "Ray health check initiated.", "filename": "app.py", "lineno": 71, "route": "/", "request_id": "85ece69a-4d51-4ab7-ade8-209f3b702ebf", "application": "app1", "worker_id": "52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5", "node_id": "b780a6ab26f3c037b81dbbb86b3b957f042d083e4a8d834cb125318d", "actor_id": "392919b96895babf9e9cc8fa01000000", "deployment": "app1_Ingress", "replica": "5gsucce2", "component_name": "replica"}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/replica_app1_Ingress_5gsucce2.log:{"asctime": "2024-09-18 22:21:50,574", "levelname": "INFO", "message": "__CALL__ OK 2.5ms", "filename": "replica.py", "lineno": 408, "route": "/", "request_id": "85ece69a-4d51-4ab7-ade8-209f3b702ebf", "application": "app1", "worker_id": "52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5", "node_id": "b780a6ab26f3c037b81dbbb86b3b957f042d083e4a8d834cb125318d", "actor_id": "392919b96895babf9e9cc8fa01000000", "deployment": "app1_Ingress", "replica": "5gsucce2", "component_name": "replica"}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:18:52,989 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 50 attempts over 44.07s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:19:43,100 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 100 attempts over 94.18s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:20:33,204 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 150 attempts over 144.29s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:21:23,319 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 200 attempts over 194.40s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:22:13,431 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 250 attempts over 244.52s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:23:03,525 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 300 attempts over 294.61s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:23:53,621 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 350 attempts over 344.71s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:24:43,736 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 400 attempts over 394.82s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:25:53,989 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 50 attempts over 44.11s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:26:44,151 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 100 attempts over 94.27s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:27:34,364 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 150 attempts over 144.48s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:28:24,567 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 200 attempts over 194.68s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1.
(ray) gene@geneanyscale2023 test_grcp_2 %

Note that the OK is logged at 2024-09-18 22:21:50,574

/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5-01000000-5752.err:INFO 2024-09-18 22:21:50,574 app1_Ingress 5gsucce2 85ece69a-4d51-4ab7-ade8-209f3b702ebf / replica.py:408 - __CALL__ OK 2.5ms

but at 2024-09-18 22:22:13,431 proxy keeps retrying and failed to schedule the request

/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:22:13,431 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 250 attempts over 244.52s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.

Still don't have a good way to reproduce, but seems like a race condition between the scheduler and the replica

GeneDer · 2024-09-24T06:03:47Z

This seems to be reproducible by this deployment setup

import asyncio
import logging
from fastapi import FastAPI
from ray import serve
from ray.serve.handle import DeploymentHandle


webapp = FastAPI()
logger = logging.getLogger(__name__)


@serve.deployment(
    autoscaling_config={
        "max_ongoing_requests": 5,
        "min_replicas": 3,
        "max_replicas": 3,
    },
)
@serve.ingress(webapp)
class Ingress:
    def __init__(self, pipeline: DeploymentHandle):
        self.pipeline = pipeline

    @webapp.get("/")
    async def health_check(self) -> str:
        logger.info("Ray health check initiated.")
        return "Healthy!"

    @webapp.post("/conversations")
    async def start_conversation(self):
        logger.info("in start_conversation before sleep")
        await asyncio.gather(
            self.pipeline.start.remote(),
        )
        logger.info("in start_conversation after sleep")


@serve.deployment
class Pipeline:
    async def start(self):
        await asyncio.sleep(310)


app = Ingress.bind(Pipeline.bind())

Deploy to Anyscale service with anyscale service deploy --name gene-test-timeout2 deployment:app

Then use the following code to call the service

import requests
import ray


@ray.remote
def query_get(url, token):
    headers = {"Authorization": f"Bearer {token}"}
    response = requests.get(url, headers=headers)
    print(response.text)

@ray.remote
def query_post(url, token):
    headers = {"Authorization": f"Bearer {token}"}
    response = requests.post(url, headers=headers)
    print(response.text)

service_url = "https://gene-test-timeout2-bxauk.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata.com/conversations"
token = "ae_1me8Z5Mbm-hJ4eFEyzB01S33cvNQd4OmR6jtkrQE"
ray.get([query_post.remote(service_url, token) for _ in range(11)])

zcin · 2024-09-30T21:26:15Z

So there are two issues:

If a request is cancelled while the scheduler is trying to find an available replica for that request, then the scheduler will not remove that request from processing until a replica is available (then it simply bypasses that request). This is an open PR for fixing that issue: [serve] Stop scheduling task early when requests have been cancelled #47847. In this case the request is correctly cancelled, it will just sometimes be delayed.
Recursive cancellation from Ray doesn't cover all cases, so that if the scheduler is in the middle of trying to assign the request to a replica, if an asyncio.CancelledError isn't sent directly to the router (which isn't guaranteed to happen), then the request won't get cancelled at all (and a request could still get forwarded to the downstream deployment). For this we will have to implement full recursive cancellation in the Ray Serve layer.

…47847) In `fulfill_pending_requests`, there are two nested loops: - the outer loop greedily fulfills more requests so that if backoff doesn't occur, it's not necessary for new asyncio tasks to be started to fulfill each request - the inner loop handles backoff if replicas can't be found to fulfill the next request The outer loop will be stopped if there are enough tasks to handle all pending requests. However if all replicas are at max capacity, it's possible for the inner loop to continue to loop even when the task is no longer needed (e.g. when a request has been cancelled), because the inner loop simply continues to try to find an available replica without checking if the current task is even necessary. This PR makes sure that at the end of each iteration of the inner loop, it clears out requests in `pending_requests_to_fulfill` that have been cancelled, and then breaks out of the loop if there are enough tasks to handle the remaining requests. Tests: - Added a test that tests for the scenario where a request is cancelled while it's trying to find an available replica - Also modified the tests in `test_pow_2_scheduler.py` so that the backoff sequence is small values (1ms), and the timeouts in the tests are also low `10ms`, so that the unit tests run much faster (~5s now compared to ~30s before). ## Related issue number related: #47585 --------- Signed-off-by: Cindy Zhang <[email protected]>

…ay-project#47847) In `fulfill_pending_requests`, there are two nested loops: - the outer loop greedily fulfills more requests so that if backoff doesn't occur, it's not necessary for new asyncio tasks to be started to fulfill each request - the inner loop handles backoff if replicas can't be found to fulfill the next request The outer loop will be stopped if there are enough tasks to handle all pending requests. However if all replicas are at max capacity, it's possible for the inner loop to continue to loop even when the task is no longer needed (e.g. when a request has been cancelled), because the inner loop simply continues to try to find an available replica without checking if the current task is even necessary. This PR makes sure that at the end of each iteration of the inner loop, it clears out requests in `pending_requests_to_fulfill` that have been cancelled, and then breaks out of the loop if there are enough tasks to handle the remaining requests. Tests: - Added a test that tests for the scenario where a request is cancelled while it's trying to find an available replica - Also modified the tests in `test_pow_2_scheduler.py` so that the backoff sequence is small values (1ms), and the timeouts in the tests are also low `10ms`, so that the unit tests run much faster (~5s now compared to ~30s before). ## Related issue number related: ray-project#47585 --------- Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

zcin · 2024-10-29T16:25:30Z

addressed by #47873

GeneDer self-assigned this Sep 10, 2024

GeneDer assigned zcin Sep 24, 2024

zcin mentioned this issue Sep 30, 2024

[serve] Stop scheduling task early when requests have been cancelled #47847

Merged

8 tasks

zcin closed this as completed Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Look into request got stuck issue #47585

[Serve] Look into request got stuck issue #47585

GeneDer commented Sep 10, 2024

GeneDer commented Sep 19, 2024

GeneDer commented Sep 24, 2024

zcin commented Sep 30, 2024

zcin commented Oct 29, 2024

[Serve] Look into request got stuck issue #47585

[Serve] Look into request got stuck issue #47585

Comments

GeneDer commented Sep 10, 2024

GeneDer commented Sep 19, 2024

GeneDer commented Sep 24, 2024

zcin commented Sep 30, 2024

zcin commented Oct 29, 2024