-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flake in IpVersionsClientType/ExtensionDiscoveryIntegrationTest.ReloadBoth/1 #26254
Comments
bazel test --config=remote //test/integration:extension_discovery_integration_test --runs_per_test=1000 does repro this one. |
This one is super-odd - runOnAllThreads is not posting to a worker thread correctly... |
The root cause is a logic bug that still allows filter configs to be destroyed on workers, likely the same reason #24574 and other v8 !IsMainThread assertions. |
Nice debug - thanks for the in-flight fix! |
Commit Message: Ensure extension config is always deleted on the main thread. Additional Description: There is an implicit data race with the following interleaving: TLS runOnAllThreads invoked to clear the extension config. TLS deleted, a post is invoked on all workers to delete its storage. TLS runOnAllThreads on workers does nothing because of step 2). TLS runOnAllThreads runs the completion callback on the main thread. TLS deletion post invoked on some worker, deleting the last instance of the extension config. The workaround is to prevent 2) from happening while runOnAllThreads is running. This is done by capturing TLS in the completion callback, and explicitly resetting it and the associated filter config in the callback. Note that we cannot rely on the main_config shared pointer destruction implicitly. It seems the destructor can still execute on a worker, so we explicitly empty the content in the PR. It's not clear to me why some worker has a copy of the function, but it's hard to track down without AnyInvocable or move-only functions. Risk Level: low Testing: run the flakey test 10000 times, all pass Docs Changes: none Release Notes: none Platform Specific Features: none Fixes: #26254
Commit Message: Ensure extension config is always deleted on the main thread. Additional Description: There is an implicit data race with the following interleaving: TLS runOnAllThreads invoked to clear the extension config. TLS deleted, a post is invoked on all workers to delete its storage. TLS runOnAllThreads on workers does nothing because of step 2). TLS runOnAllThreads runs the completion callback on the main thread. TLS deletion post invoked on some worker, deleting the last instance of the extension config. The workaround is to prevent 2) from happening while runOnAllThreads is running. This is done by capturing TLS in the completion callback, and explicitly resetting it and the associated filter config in the callback. Note that we cannot rely on the main_config shared pointer destruction implicitly. It seems the destructor can still execute on a worker, so we explicitly empty the content in the PR. It's not clear to me why some worker has a copy of the function, but it's hard to track down without AnyInvocable or move-only functions. Risk Level: low Testing: run the flakey test 10000 times, all pass Docs Changes: none Release Notes: none Platform Specific Features: none Fixes: envoyproxy#26254 Signed-off-by: River Phillips <[email protected]>
https://dev.azure.com/cncf/envoy/_build/results?buildId=131697&view=logs&j=4e2e4662-2d21-5726-b4bc-9ff5cd6a179b&t=393ef8cf-9eb2-5a8a-e529-cceb42dc347e
[2023-03-21 18:17:38.829][753][critical][backtrace] [./source/server/backtrace.h:96] #3: Envoy::ThreadLocal::InstanceImpl::SlotImpl::
SlotImpl() [0x315310d]__optional_destruct_base() [0x60d9934][2023-03-21 18:17:38.838][753][critical][backtrace] [./source/server/backtrace.h:96] #4: std::__1::__shared_ptr_emplace<>::__on_zero_shared() [0x709c8d7]
[2023-03-21 18:17:38.846][753][critical][backtrace] [./source/server/backtrace.h:96] #5: std::__1::__function::__func<>::destroy() [0x709d1f1]
[2023-03-21 18:17:38.855][753][critical][backtrace] [./source/server/backtrace.h:96] #6: std::__1::__optional_destruct_base<>::
[2023-03-21 18:17:38.863][753][critical][backtrace] [./source/server/backtrace.h:96] #7: Envoy::Filter::DynamicFilterConfigProviderImpl<>::ThreadLocalConfig::~ThreadLocalConfig() [0x60db108]
[2023-03-21 18:17:38.871][753][critical][backtrace] [./source/server/backtrace.h:96] #8: std::__1::__shared_ptr_emplace<>::__on_zero_shared() [0x60db01e]
[2023-03-21 18:17:38.880][753][critical][backtrace] [./source/server/backtrace.h:96] #9: std::__1::__function::__func<>::operator()() [0x315081f]
[2023-03-21 18:17:38.888][753][critical][backtrace] [./source/server/backtrace.h:96] #10: Envoy::Event::DispatcherImpl::runPostCallbacks() [0x72fca0c]
[2023-03-21 18:17:38.897][753][critical][backtrace] [./source/server/backtrace.h:96] #11: std::__1::__function::__func<>::operator()() [0x730764d]
[2023-03-21 18:17:38.905][753][critical][backtrace] [./source/server/backtrace.h:96] #12: Envoy::Event::SchedulableCallbackImpl::SchedulableCallbackImpl()::$_0::__invoke() [0x7a28609]
[2023-03-21 18:17:38.913][753][critical][backtrace] [./source/server/backtrace.h:96] #13: event_process_active_single_queue [0x8569a63]
[2023-03-21 18:17:38.922][753][critical][backtrace] [./source/server/backtrace.h:96] #14: event_base_loop [0x8557152]
[2023-03-21 18:17:38.930][753][critical][backtrace] [./source/server/backtrace.h:96] #15: Envoy::Event::LibeventScheduler::run() [0x7a25c4e]
[2023-03-21 18:17:38.938][753][critical][backtrace] [./source/server/backtrace.h:96] #16: Envoy::Event::DispatcherImpl::run() [0x72fc611]
[2023-03-21 18:17:38.947][753][critical][backtrace] [./source/server/backtrace.h:96] #17: Envoy::Server::WorkerImpl::threadRoutine() [0x3299d52]
[2023-03-21 18:17:38.955][753][critical][backtrace] [./source/server/backtrace.h:96] #18: std::__1::__function::__func<>::operator()() [0x329ffb9]
[2023-03-21 18:17:38.964][753][critical][backtrace] [./source/server/backtrace.h:96] #19: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()#1}::__invoke() [0x8518f5c]
[2023-03-21 18:17:38.964][753][critical][backtrace] [./source/server/backtrace.h:96] #20: start_thread [0x7f148e39a609]
The text was updated successfully, but these errors were encountered: