-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
embed server never becomes ready / can't be stopped in certain start failure conditions #9533
Comments
Hmm, that channel should've been closed when starting embedded etcd failed? |
Yes, I also have that case in a separate goroutine that starts after etcd has successfully started and became ready. I haven't verified if etcd pushes something there in the case above - but the reaction is about the same: call
|
To be extra clear, here is an 'easy' reproduction case:
|
Is this removing the member that was started as an embed server in step 3? |
Yes, that is pretty rare/stupid. However, from what I have seen, that maybe could also be reproduced by injecting a failure in the snapshot transmission (e.g. something happens to the member transmitting the snapshot, or to the network in between). I could try to do this if that's of any interest for you. Although it feels like it is just about a channel not being closed properly. Point being is that, albeit an human eye can notice the start failure from the logs, and a machine could potentially interpolate this from a 'longer than expected' start, recovering from such situation (i.e. etcd started, but was not able to ever become ready - for reasons) is impossible without panicking / exiting the whole wrapping application as Stop() is not able to stop the goroutines / free the ports. By the way, I just added |
@Quentin-M Got it. We will try to reproduce. Thanks! |
Sorry for delay. Just fyi, I am seeing similar blocking behaviors in other sets of testing. |
@gyuho Thank you so much for looking into it, I don't think there's any urgency from anyone at the moment anyways! |
I'm seeing the same issue in 3.5.9. My case seems to be very similar. I have a cluster with 3 nodes and while the a node is stopped it is being removed from the cluster. When I now try to start the node the start does not return an error but the server stops on its own. After that I can not invoke |
The issue seems to be:
A fix might be to wait for Ready and Stopping / Stopped using a select and in the stop case simply close the channel and return. |
It looks like a valid fix. #16754. Please feel free to continue to work on top of the PR if you are interested. |
Signed-off-by: Hendrik Haddorp <[email protected]>
Signed-off-by: Hendrik Haddorp <[email protected]>
Hi guys,
Please find below a simple excerpt where we try to start an embed server, wait for it to be ready up to a certain point or stop it otherwise.
When we are setting up an embed server to join an existing cluster, it may take some time for the server to become ready as it needs to sync data. If anything happens during the transmission of the snapshot, or if the member is removed in the meantime, the etcdserver will notice it and rafthttp will stop the peers (see below) and idle. The allocated time for etcd to start runs out and the context cancels (or we read errors from the error channel). Therefore, we now try to Close() the server so etcd can be stopped and the ports freed - so we can retry or move on without having zombie etcd goroutines/servers around.
However, Close() is not too happy about it and stale. From my debugging, this line right here is blocking. I am not exactly sure of the lifecycle of this channel.
I would be happy to just attempt at closing manually the Peers / Listeners (as they are exported) (I haven't tried yet - and I am not sure that would be enough anyways), but at least, the Metrics server is not exported and thus could not be closed.
Do you have any suggestions?
The text was updated successfully, but these errors were encountered: