-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed in get: Hummock error: ObjectStore failed with IO error #7002
Comments
happened again in https://risingwave-labs.slack.com/archives/C048NM5LNKX/p1672121457286999 Namespace: |
This comment was marked as outdated.
This comment was marked as outdated.
Why we don't wait and retry rather than panicking here? |
It's retried 3 times here, which doesn't work out. |
In rwc-3-longevity-20230104-180851 we are using c5a.8xlarge (10 Gbps network capacity) for compute nodes. Testing larger retry max attempts. |
+1. We should add random latency to retry rather than panicking here. |
Increase the connect_timeout does work around this issue (3.1s by default, I use 60s which is a large enough but not a practical value). |
Agree not a good idea
As the kernel manages CPU resources by setting parallelism not more than the number of CPUs, and manages memory resources by having it feels sort of strange if it manages network resources in a reactive way, although being proactive seems a more difficult task indeed. made-up cases:
|
Seems to be caused by insufficient bandwidth |
rwc-3-longevity-20230131-171156 |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
We encountered another case, where network transmit byte/transmit packet/current established TCP are all small. BTW in this case we observe 2 type of error, all from hyper:
|
fixed by #9160 |
Describe the bug
Slack link:
https://risingwave-labs.slack.com/archives/C048NM5LNKX/p1671603133726859
Namespace:
rwc-3-longevity-20221220-180642
Pod:
risingwave-compute-2
To Reproduce
No response
Expected behavior
No response
Additional context
Or this is expected?
The text was updated successfully, but these errors were encountered: