Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating thread local cluster with LoadBalancerType::OriginalDst can't find its ClusterData in active_clusters_ and crash #7500

Closed
l8huang opened this issue Jul 8, 2019 · 10 comments · Fixed by #7820
Labels
Milestone

Comments

@l8huang
Copy link
Contributor

l8huang commented Jul 8, 2019

Title: Creating thread local cluster with LoadBalancerType::OriginalDst can't find its ClusterData in active_clusters_ and crash

Description:
In our k8s+Istio environment, sometimes Envoy crashes when constructing ClusterManagerImpl::ThreadLocalClusterManagerImpl::ClusterEntry::ClusterEntry which type is LoadBalancerType::OriginalDst. It tried to look up its ClusterData in active_clusters_ by std::map's at() method, but std::out_of_range was thrown because element not exist:

    case LoadBalancerType::OriginalDst: {
      ASSERT(lb_factory_ == nullptr);
      lb_ = std::make_unique<OriginalDstCluster::LoadBalancer>(
          priority_set_, parent.parent_.active_clusters_.at(cluster->name())->cluster_,
          cluster->lbOriginalDstConfig());
      break;
    }

Looks like in normal case this couldn't happen:

  • CdsApiImpl::onConfigUpdate() makes to_add_repeated and to_remove_repeated being exclusive.
  • ClusterManagerImpl::addOrUpdateCluster() makes sure active_clusters_ has needed ClusterData

So this might be a concurrency issue, I doubt there were 2 config updating, the first one created a cluster, and the second one removed it before ThreadLocalClusterManagerImpl::ClusterEntry got constructed on worker threads.

The Envoy version is:

version: 4f5b5e101a081e05924990b1903d9d46553558d4/1.11.0-dev/Clean/RELEASE/BoringSSL

Considering Envoy guarantees eventually consistence, I guess the solution could be adding defensive code to handle the exception gracefully instead of crash. But from design point of view, do you have any rule like worker thread should not access active_clusters_ directly? I just wanna to figure out what's the suitable way to resolve such kind of issue ;)

If there are any similar issue was fixed before, could you please kindly let me know the PR or issue number?

Repro steps:
We don't have a stable method to reproduce this issue right now, some cluster info related to the cash are:

(gdb) p ((Envoy::Upstream::ClusterInfoImpl*)cluster_info_._M_ptr)->type_
$6 = envoy::api::v2::Cluster_DiscoveryType_ORIGINAL_DST   

(gdb) p ((Envoy::Upstream::ClusterInfoImpl*)cluster_info_._M_ptr)->added_via_api_
$7 = true  

(gdb) p ((Envoy::Upstream::ClusterInfoImpl*)cluster_info_._M_ptr)->lb_type_   
$1 = Envoy::Upstream::LoadBalancerType::OriginalDst

(gdb) p cluster_manager.parent_.init_helper_.state_
$12 = Envoy::Upstream::ClusterManagerInitHelper::State::AllClustersInitialized

(gdb) p cluster_manager.parent_.warming_clusters_ 
$21 = std::map with 0 elements

Call Stack:

(gdb) bt
#0  0x00007f004e85c428 in __GI_raise (sig=sig@entry=6) 
     at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007f004e85e02a in __GI_abort () 
     at abort.c:89
#2  0x0000000001063375 in Envoy::TerminateHandler::logOnTerminate() const::$_0::operator()() const (this=<optimized out>) 
     at source/exe/terminate_handler.cc:15
#3  0x0000000001063279 in Envoy::TerminateHandler::logOnTerminate() const::$_0::__invoke() () 
     at source/exe/terminate_handler.cc:12
#4  0x00000000011d55a6 in __cxxabiv1::__terminate(void (*)()) ()
#5  0x00000000011d55f1 in std::terminate() ()
#6  0x00000000011e1794 in __cxa_throw ()
#7  0x00000000011eb9ef in std::__throw_out_of_range(char const*) ()
#8  0x0000000000b54f27 in std::map<
            std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, 
            std::unique_ptr<Envoy::Upstream::ClusterManagerImpl::ClusterData, std::default_delete<Envoy::Upstream::ClusterManagerImpl::ClusterData> >, 
            std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, 
            std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > 
            const, std::unique_ptr<Envoy::Upstream::ClusterManagerImpl::ClusterData, std::default_delete<Envoy::Upstream::ClusterManagerImpl::ClusterData> > > > >
            ::at (this=<optimized out>, __k=...) 
     at /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/c++/7.4.0/bits/stl_map.h:533
#9  0x0000000000b5c8db in Envoy::Upstream::ClusterManagerImpl::ThreadLocalClusterManagerImpl::ClusterEntry::ClusterEntry (this=0x438b4000, parent=..., cluster=..., lb_factory=...) 
     at source/common/upstream/cluster_manager_impl.cc:1099
#10 0x0000000000b60c8c in Envoy::Upstream::ClusterManagerImpl::createOrUpdateThreadLocalCluster(Envoy::Upstream::ClusterManagerImpl::ClusterData&)::$_10::operator()() const (this=0x1ecabd10) 
     at source/common/upstream/cluster_manager_impl.cc:530
#11 std::_Function_handler<void (), Envoy::Upstream::ClusterManagerImpl::createOrUpdateThreadLocalCluster(Envoy::Upstream::ClusterManagerImpl::ClusterData&)::$_10>::_M_invoke(std::_Any_data const&) (__functor=...) 
     at /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/c++/7.4.0/bits/std_function.h:316
#12 0x0000000000b1f10d in std::function<void ()>::operator()() const (this=<optimized out>) 
     at /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/c++/7.4.0/bits/std_function.h:706
#13 Envoy::Event::DispatcherImpl::runPostCallbacks (this=0x2bf1200) 
     at source/common/event/dispatcher_impl.cc:198
#14 0x0000000000b1efbe in Envoy::Event::DispatcherImpl::run (this=0x2bf1200, type=Envoy::Event::Dispatcher::RunType::Block) 
     at source/common/event/dispatcher_impl.cc:177
#15 0x0000000000b19442 in Envoy::Server::WorkerImpl::threadRoutine (this=0x2c3f2c0, guard_dog=...) 
     at source/server/worker_impl.cc:104
#16 0x00000000010639e5 in std::function<void ()>::operator()() const (this=0x18) 
     at /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/c++/7.4.0/bits/std_function.h:706
#17 Envoy::Thread::ThreadImplPosix::ThreadImplPosix(std::function<void ()>)::$_0::operator()(void*) const (this=<optimized out>, arg=0x0) 
     at source/common/common/posix/thread_impl.cc:39
#18 Envoy::Thread::ThreadImplPosix::ThreadImplPosix(std::function<void ()>)::$_0::__invoke(void*) (arg=0x0) 
     at source/common/common/posix/thread_impl.cc:38
#19 0x00007f004ef016ba in start_thread (arg=0x7f0030feb700) 
     at pthread_create.c:333
#20 0x00007f004e92e41d in clone () 
     at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
@alyssawilk alyssawilk added the bug label Jul 9, 2019
@alyssawilk
Copy link
Contributor

I'm not 100% sure who could best look into this.
@jrajahalme or @ramaraochavali would either of you be able to check this out?

@l8huang
Copy link
Contributor Author

l8huang commented Jul 9, 2019

I added a simple test case to test/integration_test.cc to reproduce this issue(that's CdsClusterCreateDelete in commit l8huang@279ae5f):

# bazel  test -c dbg  //test/integration:cds_integration_test      --test_env=ENVOY_IP_TEST_VERSIONS=v4only     --test_output=streamed --cache_test_results=no     --test_arg="-l info" 
WARNING: The following rc files are no longer being read, please transfer their contents or import their path into one of the standard rc files:
/develop/envoy/tools/bazel.rc
WARNING: Streamed test output requested. All tests will be run locally, without sharding, one at a time
INFO: Build options --run_under, --test_arg, --test_env, and 1 more have changed, discarding analysis cache.
INFO: Analyzed target //test/integration:cds_integration_test (0 packages loaded, 13781 targets configured).
INFO: Found 1 test target...
WARNING: Perftools heap leak checker is active -- Performance may suffer
[==========] Running 8 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 8 tests from IpVersionsClientTypeDelta/CdsIntegrationTest
[ RUN      ] IpVersionsClientTypeDelta/CdsIntegrationTest.CdsClusterUpDownUp/0
[       OK ] IpVersionsClientTypeDelta/CdsIntegrationTest.CdsClusterUpDownUp/0 (2250 ms)
[ RUN      ] IpVersionsClientTypeDelta/CdsIntegrationTest.CdsClusterUpDownUp/1
[       OK ] IpVersionsClientTypeDelta/CdsIntegrationTest.CdsClusterUpDownUp/1 (3167 ms)
[ RUN      ] IpVersionsClientTypeDelta/CdsIntegrationTest.CdsClusterCreateDelete/0
terminate called after throwing an instance of 'std::out_of_range'
  what():  map::at
[2019-07-09 17:47:23.029][42][critical][backtrace] [bazel-out/k8-dbg/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:81] Caught Aborted, suspect faulting address 0xe
[2019-07-09 17:47:23.029][42][critical][backtrace] [bazel-out/k8-dbg/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:69] Backtrace (use tools/stack_decode.py to get line numbers):
[2019-07-09 17:47:23.050][42][critical][backtrace] [bazel-out/k8-dbg/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #0: Envoy::SignalAction::sigHandler() [0x1a5d36e]
[2019-07-09 17:47:23.050][42][critical][backtrace] [bazel-out/k8-dbg/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #1: __restore_rt [0x7fd5524dc390]
external/bazel_tools/tools/test/test-setup.sh: line 310:    14 Aborted                 "${TEST_PATH}" "$@" 2>&1
FAIL: //test/integration:cds_integration_test (see /root/.cache/bazel/_bazel_root/8a829d5918f76371d36df361a9235db6/execroot/envoy/bazel-out/k8-dbg/testlogs/test/integration/cds_integration_test/test.log)
Target //test/integration:cds_integration_test up-to-date:
  bazel-bin/test/integration/cds_integration_test
INFO: Elapsed time: 14.577s, Critical Path: 12.96s
INFO: 2 processes: 2 linux-sandbox.
INFO: Build completed, 1 test FAILED, 2 total actions

I'm trying to make a patch to fix this issue in a reasonable way(I would try to use defensive code firstly), if you have any suggestion please kindly let me know ;)

@mattklein123 mattklein123 added this to the 1.12.0 milestone Jul 9, 2019
@mattklein123 mattklein123 added the help wanted Needs help! label Jul 9, 2019
@l8huang
Copy link
Contributor Author

l8huang commented Jul 10, 2019

I'm working on fixing this issue, basic idea is using ClusterSharedPtr instead of ClusterInfoConstSharedPtr to construct ClusterManagerImpl::ThreadLocalClusterManagerImpl::ClusterEntry:

 ClusterManagerImpl::ThreadLocalClusterManagerImpl::ClusterEntry::ClusterEntry(
-    ThreadLocalClusterManagerImpl& parent, ClusterInfoConstSharedPtr cluster,
+    ThreadLocalClusterManagerImpl& parent, ClusterSharedPtr cluster,
     const LoadBalancerFactorySharedPtr& lb_factory)

In this way, it needn't to access active_clusters_ for getting the ClusterSharedPtr in worker thread.

@mattklein123
Copy link
Member

I don't think ^ is the correct fix, as the cluster is not generally thread safe, only the info. Do you have a summary of what the issue is?

@jrajahalme
Copy link
Contributor

I think the analysis above is correct, the access to active_clusters_ from a worker thread is wrong, as active_clusters_ is not thread safe. This originates from my initial commit that adds the support for OriginalDstCluster.

I checked for other cases for improper access and found that constructor of ThreadLocalClusterManagerImpl which is called from the ClusterMangerImpl via SlotImpl::set() also accesses active_clusters_. This seems intentional, as active_clusters_ is protected rather than private. However, the initializeCb given to SlotImpl::set() is called on each thread:

  /**
   * Set thread local data on all threads previously registered via registerThread().
   * @param initializeCb supplies the functor that will be called *on each thread*. The functor
   *                     returns the thread local object which is then stored. The storage is via
   *                     a shared_ptr. Thus, this is a flexible mechanism that can be used to share
   *                     the same data across all threads or to share different data on each thread.
   */

So the problem with improper access to active_clusters_ is a bit larger. The simplest fix would be to protect all access to active_clusters_ with a absl::Mutex like in source/common/upstream/thread_aware_lb_impl.cc for the members annotated with GUARDED_BY(mutex_), for example. Maybe there is a more elegant fix out there as well?

@l8huang
Copy link
Contributor Author

l8huang commented Jul 10, 2019

thanks @jrajahalme for summary.

Looks like adding protection to active_clusters_ with a mutex wouldn't help, a 'ClusterSharedPtr' is added to and removed from active_clusters_ sequentially by main thread, no need to use mutex in this case. The problem is when a worker thread callback accessing active_clusters_ for getting 'ClusterSharedPtr', it could be already removed by main thread. This is a more lifecycle problem than mutex protection issue.

Considering 'ClusterSharedPtr' is a shared pointer(ClusterDataPtr is a unique pointer, owner is active_clusters_), I guess it's fine to let callback to capture it and use it to create OriginalDstCluster which holds it with a weak pointer. If there were any concurrency issue when using ClusterSharedPtr(e.g lock weak pointer, modifying cluster) in worker thread, the cluster implementation should take care of that by it self.

@jrajahalme
Copy link
Contributor

jrajahalme commented Jul 10, 2019

STL map operations are not thread safe for concurrent reads and writes, so this is a more general concurrent access problem. All we know the worker thread may crash just reading from the map!

In more detail, a worker thread may be reading from a map that is in inconsistent state, e.g., partially updated by the main thread when in a middle of an insert operation, maybe reallocating the map to make more space.

@l8huang
Copy link
Contributor Author

l8huang commented Jul 10, 2019

Sorry for confusing. I meant callback can capture ClusterSharedPtr, then worker thread needn't to access active_clusters_ at all.

Here is a patch for fixing this issue: l8huang@08f9c01

Let me know if this fix is ok.

l8huang pushed a commit to l8huang/envoy that referenced this issue Jul 11, 2019
If a original_dst cluster got deleted right after it's created, on worker thread,
its creation callback may not be able to find its ClusterData in
ClusterManagerImpl::active_clusters_, then it would crash with std::out_of_rang
exception, because it tried to look up its ClusterData in
ClusterManagerImpl::active_clusters_ by std::map::at().

This change makes thread local cluster creation callback to capture ClusterSharedPtr
directly and use it to create OriginalDstCluster which holds it with a weak pointer.
In this way, worker thread needn't to access ClusterManagerImpl::active_clusters_
anymore.

Testing: Add cds integration test, create/delete original_dst cluster repeatedly.
Fixes issue envoyproxy#7500

Signed-off-by: lhuang8 <[email protected]>
@mattklein123
Copy link
Member

I will take a look at this in more detail tomorrow and advise on the fix.

@jrajahalme
Copy link
Contributor

@l8huang Your fix looks good to me, and resolves the bug you found. I left a couple of comments to remind you to clean up some testing/debugging related changes you still have in there.
However, the issue of invalid access to active_clusters_ in the constructor of ThreadLocalClusterManagerImpl still remains.

jrajahalme added a commit to jrajahalme/envoy that referenced this issue Aug 3, 2019
Use ThreadAwareLoadBalancer in OriginalDstCluster to pass
OriginalDstClusterSharedPtr to LoadBalancer.

Fixes: envoyproxy#7500
Signed-off-by: Jarno Rajahalme <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants