-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating thread local cluster with LoadBalancerType::OriginalDst can't find its ClusterData in active_clusters_ and crash #7500
Comments
I'm not 100% sure who could best look into this. |
I added a simple test case to test/integration_test.cc to reproduce this issue(that's
I'm trying to make a patch to fix this issue in a reasonable way(I would try to use defensive code firstly), if you have any suggestion please kindly let me know ;) |
I'm working on fixing this issue, basic idea is using
In this way, it needn't to access |
I don't think ^ is the correct fix, as the cluster is not generally thread safe, only the info. Do you have a summary of what the issue is? |
I think the analysis above is correct, the access to I checked for other cases for improper access and found that constructor of
So the problem with improper access to |
thanks @jrajahalme for summary. Looks like adding protection to Considering 'ClusterSharedPtr' is a shared pointer( |
STL map operations are not thread safe for concurrent reads and writes, so this is a more general concurrent access problem. All we know the worker thread may crash just reading from the map! In more detail, a worker thread may be reading from a map that is in inconsistent state, e.g., partially updated by the main thread when in a middle of an insert operation, maybe reallocating the map to make more space. |
Sorry for confusing. I meant callback can capture Here is a patch for fixing this issue: l8huang@08f9c01 Let me know if this fix is ok. |
If a original_dst cluster got deleted right after it's created, on worker thread, its creation callback may not be able to find its ClusterData in ClusterManagerImpl::active_clusters_, then it would crash with std::out_of_rang exception, because it tried to look up its ClusterData in ClusterManagerImpl::active_clusters_ by std::map::at(). This change makes thread local cluster creation callback to capture ClusterSharedPtr directly and use it to create OriginalDstCluster which holds it with a weak pointer. In this way, worker thread needn't to access ClusterManagerImpl::active_clusters_ anymore. Testing: Add cds integration test, create/delete original_dst cluster repeatedly. Fixes issue envoyproxy#7500 Signed-off-by: lhuang8 <[email protected]>
I will take a look at this in more detail tomorrow and advise on the fix. |
@l8huang Your fix looks good to me, and resolves the bug you found. I left a couple of comments to remind you to clean up some testing/debugging related changes you still have in there. |
Use ThreadAwareLoadBalancer in OriginalDstCluster to pass OriginalDstClusterSharedPtr to LoadBalancer. Fixes: envoyproxy#7500 Signed-off-by: Jarno Rajahalme <[email protected]>
Title: Creating thread local cluster with
LoadBalancerType::OriginalDst
can't find its ClusterData in active_clusters_ and crashDescription:
In our k8s+Istio environment, sometimes Envoy crashes when constructing
ClusterManagerImpl::ThreadLocalClusterManagerImpl::ClusterEntry::ClusterEntry
which type isLoadBalancerType::OriginalDst
. It tried to look up its ClusterData inactive_clusters_
bystd::map
'sat()
method, butstd::out_of_range
was thrown because element not exist:Looks like in normal case this couldn't happen:
CdsApiImpl::onConfigUpdate()
makesto_add_repeated
andto_remove_repeated
being exclusive.ClusterManagerImpl::addOrUpdateCluster()
makes sureactive_clusters_
has neededClusterData
So this might be a concurrency issue, I doubt there were 2 config updating, the first one created a cluster, and the second one removed it before
ThreadLocalClusterManagerImpl::ClusterEntry
got constructed on worker threads.The Envoy version is:
Considering Envoy guarantees eventually consistence, I guess the solution could be adding defensive code to handle the exception gracefully instead of crash. But from design point of view, do you have any rule like worker thread should not access
active_clusters_
directly? I just wanna to figure out what's the suitable way to resolve such kind of issue ;)If there are any similar issue was fixed before, could you please kindly let me know the PR or issue number?
Repro steps:
We don't have a stable method to reproduce this issue right now, some cluster info related to the cash are:
Call Stack:
The text was updated successfully, but these errors were encountered: