Failed in get: Hummock error: ObjectStore failed with IO error #7002

lmatz · 2022-12-21T07:25:32Z

Describe the bug

Slack link:
https://risingwave-labs.slack.com/archives/C048NM5LNKX/p1671603133726859
Namespace: rwc-3-longevity-20221220-180642
Pod: risingwave-compute-2

2022-12-20T18:11:34.875003Z ERROR risingwave_storage::monitor::monitored_store: Failed in get: Hummock error: ObjectStore failed with IO error Internal error: read "rls-apse1-eks-a-rwc-3-longevity-20221220-180642/255/1.data" in block Some(BlockLocation { offset: 11008083, size: 37158 }) failed, error: timeout: error trying to connect: HTTP connect timeout occurred after 3.1s
  backtrace of `ObjectError`:
   0: <risingwave_object_store::object::error::ObjectError as core::convert::From<risingwave_object_store::object::error::ObjectErrorInner>>::from
             at ./risingwave/src/object_store/src/object/error.rs:38:10
   1: <T as core::convert::Into<U>>::into
             at ./rustc/bdb07a8ec8e77aa10fb84fae1d4ff71c21180bb4/library/core/src/convert/mod.rs:726:9
   2: <risingwave_object_store::object::error::ObjectError as core::convert::From<aws_smithy_http::result::SdkError<E>>>::from
             at ./risingwave/src/object_store/src/object/error.rs:81:9
   3: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
             at ./rustc/bdb07a8ec8e77aa10fb84fae1d4ff71c21180bb4/library/core/src/result.rs:2108:27
   4: <risingwave_object_store::object::s3::S3ObjectStore as risingwave_object_store::object::ObjectStore>::read::{{closure}}
             at ./risingwave/src/object_store/src/object/s3.rs:351:20
   5: <async_stack_trace::StackTraced<F,_> as core::future::future::Future>::poll
   6: risingwave_object_store::object::MonitoredObjectStore<OS>::read::{{closure}}
             at ./risingwave/src/object_store/src/object/mod.rs:643:13
   7: risingwave_object_store::object::ObjectStoreImpl::read::{{closure}}
             at ./risingwave/src/object_store/src/object/mod.rs:334:9
   8: risingwave_storage::hummock::sstable_store::SstableStore::sstable::{{closure}}::{{closure}}::{{closure}}
             at ./risingwave/src/storage/src/hummock/sstable_store.rs:346:25
   9: risingwave_common::cache::LruCache<K,T>::lookup_with_request_dedup::{{closure}}::{{closure}}
             at ./risingwave/src/common/src/cache.rs:818:58
  10: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/github.jparrowsec.cn-1ecc6299db9ec823/tracing-0.1.37/src/instrument.rs:272:9

To Reproduce

No response

Expected behavior

No response

Additional context

Or this is expected?

The text was updated successfully, but these errors were encountered:

lmatz · 2022-12-27T06:24:59Z

happened again in https://risingwave-labs.slack.com/archives/C048NM5LNKX/p1672121457286999

Namespace: rwc-3-longevity-20221226-180525
Pod: risingwave-compute-1

liurenjie1024 · 2023-01-03T10:08:34Z

Why we don't wait and retry rather than panicking here?

zwang28 · 2023-01-03T11:09:22Z

Why we don't wait and retry rather than panicking here?

It's retried 3 times here, which doesn't work out.

zwang28 · 2023-01-05T08:21:11Z

In rwc-3-longevity-20230104-180851 we are using c5a.8xlarge (10 Gbps network capacity) for compute nodes.
The rate of (node_network_transmit_bytes + node_network_receive_bytes) does reach 10Gbps.
We'd better impose more fixing besides merely increasing node's network capacity.

Testing larger retry max attempts.

liurenjie1024 · 2023-01-05T08:23:55Z

In rwc-3-longevity-20230104-180851 we are using c5a.8xlarge (10 Gbps network capacity) for compute nodes. The rate of (node_network_transmit_bytes + node_network_receive_bytes) does reach 10Gbps. We'd better impose more strategy besides merely increasing node's network capacity.

+1. We should add random latency to retry rather than panicking here.

zwang28 · 2023-01-07T05:35:05Z

Increase the connect_timeout does work around this issue (3.1s by default, I use 60s which is a large enough but not a practical value).
But I'm not sure it's a good idea to increase it. Can this kind of IO error be more of an indicator that current cluster size cannot handle the workload, and we should consider to reduce the load of each worker node?

lmatz · 2023-01-07T05:52:13Z

But I'm not sure it's a good idea to increase it.

Agree not a good idea

Can this kind of IO error be more of an indicator that current cluster size cannot handle the workload, and we should consider to reduce the load of current worker nodes?

As the kernel manages CPU resources by setting parallelism not more than the number of CPUs, and manages memory resources by having GlobalMemoryManager evict states from time to time, both in a proactive way,

it feels sort of strange if it manages network resources in a reactive way, although being proactive seems a more difficult task indeed.

made-up cases:

it's a temporary increase in the usage of the network, and probably we just slow down the processing for a short period of time and then we can get back to normal again, e.g. another form of backpressure. Rescheduling or scaling in/out may cost too much when it comes to this kind of network usage fluctuation, or it's not allowed because (2)
or
users only want to allocate a certain of resources for some jobs but are willing to tolerate lower throughput and higher latency

fuyufjh · 2023-01-30T04:11:32Z

Seems to be caused by insufficient bandwidth

zwang28 · 2023-02-01T07:57:05Z

rwc-3-longevity-20230131-171156
seems not caused by bandwidth this time.

zwang28 · 2023-02-02T09:50:26Z

In terms of network bandwidth, neither of these 2 failed case reaches 10Gbps limit.

rwc-3-longevity-20230131-171156

rwc-3-longevity-20230201-170952

~~Will look into s3 client SDK's connection cache/pool impl first, if any.~~ No, SDK uses hyper.

zwang28 · 2023-02-27T09:51:03Z

We encountered another case, where network transmit byte/transmit packet/current established TCP are all small.
The error keeps occurring and failing the recovery, until we restart the compute nodes and the error is gone immediately 🤔 . The restarted computes nodes are still in the same k8s nodes.
So I think it's not likely related to EC2 quota this time.

BTW in this case we observe 2 type of error, all from hyper:

ObjectStore failed with IO error Internal error: read "****" in block Some(BlockLocation { offset: ***, size: *** }) failed, error: timeout: error trying to connect: HTTP connect timeout occurred after 3.1s"
ObjectStore failed with IO error Internal error: channel closed.

lmatz · 2023-03-27T16:08:08Z

#8796

zwang28 · 2023-05-09T03:59:17Z

fixed by #9160

lmatz added the type/bug Something isn't working label Dec 21, 2022

github-actions bot added this to the release-0.1.16 milestone Dec 21, 2022

sumittal mentioned this issue Dec 21, 2022

Tracking: Critical Performance & Stability Issues #6640

Open

65 tasks

sumittal added the found-by-longevity-test label Dec 21, 2022

lmatz mentioned this issue Dec 28, 2022

Dicussion: monitor network usage by ourselves and manage it #7093

Closed

zwang28 self-assigned this Dec 28, 2022

This comment was marked as outdated.

Sign in to view

zwang28 mentioned this issue Jan 6, 2023

bug: hummock error NoSuchKey #7232

Closed

fuyufjh closed this as completed Jan 30, 2023

zwang28 reopened this Feb 1, 2023

zwang28 modified the milestones: release-0.1.16, release-0.1.17 Feb 1, 2023

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

zwang28 modified the milestones: release-0.1.17, release-0.1.18 Feb 15, 2023

zwang28 added the help wanted Issues that need help from contributors label Feb 27, 2023

zwang28 mentioned this issue Mar 15, 2023

Avoid bursting object store reads caused by too many SSTs in compaction task input #8559

Closed

hzxa21 removed this from the release-0.18 milestone Mar 22, 2023

hzxa21 added this to the release-0.19 milestone Mar 22, 2023

lmatz mentioned this issue Mar 27, 2023

bug: batch query dropped table may get wrong result #7615

Closed

zwang28 closed this as completed May 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed in get: Hummock error: ObjectStore failed with IO error #7002

Failed in get: Hummock error: ObjectStore failed with IO error #7002

lmatz commented Dec 21, 2022 •

edited

Loading

lmatz commented Dec 27, 2022

This comment was marked as outdated.

liurenjie1024 commented Jan 3, 2023

zwang28 commented Jan 3, 2023

zwang28 commented Jan 5, 2023 •

edited

Loading

liurenjie1024 commented Jan 5, 2023

zwang28 commented Jan 7, 2023 •

edited

Loading

lmatz commented Jan 7, 2023

fuyufjh commented Jan 30, 2023

zwang28 commented Feb 1, 2023 •

edited

Loading

This comment was marked as resolved.

zwang28 commented Feb 2, 2023 •

edited

Loading

This comment was marked as outdated.

zwang28 commented Feb 27, 2023 •

edited

Loading

lmatz commented Mar 27, 2023

zwang28 commented May 9, 2023

Failed in get: Hummock error: ObjectStore failed with IO error #7002

Failed in get: Hummock error: ObjectStore failed with IO error #7002

Comments

lmatz commented Dec 21, 2022 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

lmatz commented Dec 27, 2022

This comment was marked as outdated.

liurenjie1024 commented Jan 3, 2023

zwang28 commented Jan 3, 2023

zwang28 commented Jan 5, 2023 • edited Loading

liurenjie1024 commented Jan 5, 2023

zwang28 commented Jan 7, 2023 • edited Loading

lmatz commented Jan 7, 2023

fuyufjh commented Jan 30, 2023

zwang28 commented Feb 1, 2023 • edited Loading

This comment was marked as resolved.

zwang28 commented Feb 2, 2023 • edited Loading

This comment was marked as outdated.

zwang28 commented Feb 27, 2023 • edited Loading

lmatz commented Mar 27, 2023

zwang28 commented May 9, 2023

lmatz commented Dec 21, 2022 •

edited

Loading

zwang28 commented Jan 5, 2023 •

edited

Loading

zwang28 commented Jan 7, 2023 •

edited

Loading

zwang28 commented Feb 1, 2023 •

edited

Loading

zwang28 commented Feb 2, 2023 •

edited

Loading

zwang28 commented Feb 27, 2023 •

edited

Loading