Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed in get: Hummock error: ObjectStore failed with IO error #7002

Closed
Tracked by #6640
lmatz opened this issue Dec 21, 2022 · 16 comments
Closed
Tracked by #6640

Failed in get: Hummock error: ObjectStore failed with IO error #7002

lmatz opened this issue Dec 21, 2022 · 16 comments
Assignees
Labels
found-by-longevity-test help wanted Issues that need help from contributors type/bug Something isn't working
Milestone

Comments

@lmatz
Copy link
Contributor

lmatz commented Dec 21, 2022

Describe the bug

Slack link:
https://risingwave-labs.slack.com/archives/C048NM5LNKX/p1671603133726859
Namespace: rwc-3-longevity-20221220-180642
Pod: risingwave-compute-2

2022-12-20T18:11:34.875003Z ERROR risingwave_storage::monitor::monitored_store: Failed in get: Hummock error: ObjectStore failed with IO error Internal error: read "rls-apse1-eks-a-rwc-3-longevity-20221220-180642/255/1.data" in block Some(BlockLocation { offset: 11008083, size: 37158 }) failed, error: timeout: error trying to connect: HTTP connect timeout occurred after 3.1s
  backtrace of `ObjectError`:
   0: <risingwave_object_store::object::error::ObjectError as core::convert::From<risingwave_object_store::object::error::ObjectErrorInner>>::from
             at ./risingwave/src/object_store/src/object/error.rs:38:10
   1: <T as core::convert::Into<U>>::into
             at ./rustc/bdb07a8ec8e77aa10fb84fae1d4ff71c21180bb4/library/core/src/convert/mod.rs:726:9
   2: <risingwave_object_store::object::error::ObjectError as core::convert::From<aws_smithy_http::result::SdkError<E>>>::from
             at ./risingwave/src/object_store/src/object/error.rs:81:9
   3: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
             at ./rustc/bdb07a8ec8e77aa10fb84fae1d4ff71c21180bb4/library/core/src/result.rs:2108:27
   4: <risingwave_object_store::object::s3::S3ObjectStore as risingwave_object_store::object::ObjectStore>::read::{{closure}}
             at ./risingwave/src/object_store/src/object/s3.rs:351:20
   5: <async_stack_trace::StackTraced<F,_> as core::future::future::Future>::poll
   6: risingwave_object_store::object::MonitoredObjectStore<OS>::read::{{closure}}
             at ./risingwave/src/object_store/src/object/mod.rs:643:13
   7: risingwave_object_store::object::ObjectStoreImpl::read::{{closure}}
             at ./risingwave/src/object_store/src/object/mod.rs:334:9
   8: risingwave_storage::hummock::sstable_store::SstableStore::sstable::{{closure}}::{{closure}}::{{closure}}
             at ./risingwave/src/storage/src/hummock/sstable_store.rs:346:25
   9: risingwave_common::cache::LruCache<K,T>::lookup_with_request_dedup::{{closure}}::{{closure}}
             at ./risingwave/src/common/src/cache.rs:818:58
  10: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/github.jparrowsec.cn-1ecc6299db9ec823/tracing-0.1.37/src/instrument.rs:272:9

To Reproduce

No response

Expected behavior

No response

Additional context

Or this is expected?

@lmatz lmatz added the type/bug Something isn't working label Dec 21, 2022
@github-actions github-actions bot added this to the release-0.1.16 milestone Dec 21, 2022
@lmatz
Copy link
Contributor Author

lmatz commented Dec 27, 2022

happened again in https://risingwave-labs.slack.com/archives/C048NM5LNKX/p1672121457286999

Namespace: rwc-3-longevity-20221226-180525
Pod: risingwave-compute-1

@zwang28

This comment was marked as outdated.

@liurenjie1024
Copy link
Contributor

Why we don't wait and retry rather than panicking here?

@zwang28
Copy link
Contributor

zwang28 commented Jan 3, 2023

Why we don't wait and retry rather than panicking here?

It's retried 3 times here, which doesn't work out.

@zwang28
Copy link
Contributor

zwang28 commented Jan 5, 2023

In rwc-3-longevity-20230104-180851 we are using c5a.8xlarge (10 Gbps network capacity) for compute nodes.
The rate of (node_network_transmit_bytes + node_network_receive_bytes) does reach 10Gbps.
We'd better impose more fixing besides merely increasing node's network capacity.

Testing larger retry max attempts.

@liurenjie1024
Copy link
Contributor

In rwc-3-longevity-20230104-180851 we are using c5a.8xlarge (10 Gbps network capacity) for compute nodes. The rate of (node_network_transmit_bytes + node_network_receive_bytes) does reach 10Gbps. We'd better impose more strategy besides merely increasing node's network capacity.

+1. We should add random latency to retry rather than panicking here.

@zwang28
Copy link
Contributor

zwang28 commented Jan 7, 2023

Increase the connect_timeout does work around this issue (3.1s by default, I use 60s which is a large enough but not a practical value).
But I'm not sure it's a good idea to increase it. Can this kind of IO error be more of an indicator that current cluster size cannot handle the workload, and we should consider to reduce the load of each worker node?

@lmatz
Copy link
Contributor Author

lmatz commented Jan 7, 2023

But I'm not sure it's a good idea to increase it.

Agree not a good idea

Can this kind of IO error be more of an indicator that current cluster size cannot handle the workload, and we should consider to reduce the load of current worker nodes?

As the kernel manages CPU resources by setting parallelism not more than the number of CPUs, and manages memory resources by having GlobalMemoryManager evict states from time to time, both in a proactive way,

it feels sort of strange if it manages network resources in a reactive way, although being proactive seems a more difficult task indeed.

made-up cases:

  1. it's a temporary increase in the usage of the network, and probably we just slow down the processing for a short period of time and then we can get back to normal again, e.g. another form of backpressure. Rescheduling or scaling in/out may cost too much when it comes to this kind of network usage fluctuation, or it's not allowed because (2)
    or
  2. users only want to allocate a certain of resources for some jobs but are willing to tolerate lower throughput and higher latency

@fuyufjh
Copy link
Member

fuyufjh commented Jan 30, 2023

Seems to be caused by insufficient bandwidth

@fuyufjh fuyufjh closed this as completed Jan 30, 2023
@zwang28
Copy link
Contributor

zwang28 commented Feb 1, 2023

rwc-3-longevity-20230131-171156
seems not caused by bandwidth this time.

@zwang28 zwang28 reopened this Feb 1, 2023
@zwang28 zwang28 modified the milestones: release-0.1.16, release-0.1.17 Feb 1, 2023
@zwang28

This comment was marked as resolved.

@zwang28
Copy link
Contributor

zwang28 commented Feb 2, 2023

In terms of network bandwidth, neither of these 2 failed case reaches 10Gbps limit.

rwc-3-longevity-20230131-171156
rwc-3-longevity-20230131-171156

rwc-3-longevity-20230201-170952
rwc-3-longevity-20230201-170952

Will look into s3 client SDK's connection cache/pool impl first, if any. No, SDK uses hyper.

@zwang28

This comment was marked as outdated.

@zwang28
Copy link
Contributor

zwang28 commented Feb 27, 2023

We encountered another case, where network transmit byte/transmit packet/current established TCP are all small.
The error keeps occurring and failing the recovery, until we restart the compute nodes and the error is gone immediately 🤔 . The restarted computes nodes are still in the same k8s nodes.
So I think it's not likely related to EC2 quota this time.

BTW in this case we observe 2 type of error, all from hyper:

  • ObjectStore failed with IO error Internal error: read "****" in block Some(BlockLocation { offset: ***, size: *** }) failed, error: timeout: error trying to connect: HTTP connect timeout occurred after 3.1s"
  • ObjectStore failed with IO error Internal error: channel closed.

@zwang28 zwang28 added the help wanted Issues that need help from contributors label Feb 27, 2023
@hzxa21 hzxa21 removed this from the release-0.18 milestone Mar 22, 2023
@lmatz
Copy link
Contributor Author

lmatz commented Mar 27, 2023

#8796

@zwang28
Copy link
Contributor

zwang28 commented May 9, 2023

fixed by #9160

@zwang28 zwang28 closed this as completed May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
found-by-longevity-test help wanted Issues that need help from contributors type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants