-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDS: destroy cluster info on master thread #14089
Conversation
Signed-off-by: Yuchen Dai <[email protected]>
Signed-off-by: Yuchen Dai <[email protected]>
Signed-off-by: Yuchen Dai <[email protected]>
CC @mattklein123 for early comment |
[&dispatcher](const ClusterInfoImpl* self) { | ||
FANCY_LOG(debug, "lambdai: schedule destroy cluster info {} on this thread", self->name()); | ||
if (!dispatcher.tryPost([self]() { | ||
// TODO(lambdai): Yet there is risk that master dispatcher receives the function but doesn't execute during the shutdown. | ||
// We can either | ||
// 1) Introduce folly::function which supports with unique_ptr capture and destroy cluster info by RAII, or | ||
// 2) Call run post callback in master thread after no worker post back. | ||
FANCY_LOG(debug, "lambdai: execute destroy cluster info {} on this thread. Master thread is expected.", self->name()); | ||
delete self; | ||
})) { | ||
FANCY_LOG(debug, "lambdai: cannot post. Has the master thread exited? Executing destroy cluster info {} on this thread.", self->name()); | ||
delete self; | ||
} | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Post cluster info to master thread by this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also prevent master-master post by returning false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high level I'm not crazy about this, but maybe it's the only way to fix this issue. Did you look at what it would take to fix the actual issue of why ClusterInfo is storing such complex information that needs to be deleted on the main thread? Can we decouple that somehow? IMO as I mentioned in the linked issue I think there is stuff in ClusterInfo that shouldn't be there?
If we do stick with this approach I left a few comments and this also needs a main merge. Thank you!
/wait
FANCY_LOG(debug, "lambdai: cannot post. Has the master thread exited? Executing destroy cluster info {} on this thread.", self->name()); | ||
delete self; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can this happen? All workers should shut down and join before the main thread finishes running. Even if things are cleaned up after the join, it should be possibly to delete everything on the main thread. Perhaps in this case there needs to be some other cleanup/execution queue for posts that should be run even after the main thread dispatcher has exited?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, possible, I just want to raise the concern here we need some extra shutdown steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: The current order is
- master stop dispatching
- shutdown TLS
- stop workers
Master must refuse clusterInfo destroy closure before TLS shutdown(step2). Otherwise the clusterInfo may trigger other TLS op and break TLS.
Alternatively we run some clean up in master queue, and disable TLS during the clean up.
Edit: add const std::map<std::string, ProtocolOptionsConfigConstSharedPtr> extension_protocol_options_;
|
filter_factories and factory_context_(underlying transport socket) are extended to filter impl or customized cluster. I don't have the full context. It seems listener guaranteed these dependencies destroyed in master thread |
Can you time box trying to dig into the actual underlying problem of the transport socket sharing and whether we can decouple that somehow? |
I did some homework last night and I can add my comments in #13209 |
Signed-off-by: Yuchen Dai <[email protected]>
Signed-off-by: Yuchen Dai <[email protected]>
Signed-off-by: Yuchen Dai <[email protected]>
Signed-off-by: Yuchen Dai <[email protected]>
Fixing regression |
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
Not ready to commit. Just a strawman.
The goal is to destroy cluster info on master thread by posting to master dispatcher.
See some issues:
Commit Message:
Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
Fix #13209
[Optional Deprecated:]