Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow concurrent calls to agentClose #1533

Merged
merged 1 commit into from
Nov 2, 2016
Merged

Allow concurrent calls to agentClose #1533

merged 1 commit into from
Nov 2, 2016

Conversation

aboch
Copy link
Contributor

@aboch aboch commented Nov 1, 2016

  • This fixes a panic in memberlist.Leave() because called
    after memberlist.shutdown = false
    It happens because of two interlocking calls to NetworkDB.clusterLeave()
    It is easily reproducible with two back-to-back calls
    to docker swarm init && docker swarm leave --force
    While the first clusterLeave() is waiting for sendNodeEvent(NodeEventTypeLeave)
    to timeout (5 sec) a second clusterLeave() is called. The second clusterLeave()
    will end up invoking memberlist.Leave() after the previous call already did
    the same, therefore after memberlist.shutdown was set false.
  • The fix is to have agentClose() acquire the agent instance and reset the
    agent pointer right away under lock. Then execute the closing/leave functions
    on the agent instance.

This is exposed by running recent docker-py integration test on docker master and blocks moby/moby#26880 to get a succesfull jenkins.

Also related to docker/for-mac#849

To reproduce

$ docker swarm init && docker swarm leave --force &&  docker swarm init && docker swarm leave --force 

Result:

INFO[0039] Stopping manager                             
INFO[0039] Manager shut down                            
INFO[0039] Listening for connections                     addr="[::]:2377" proto=tcp
INFO[0039] Listening for local connections               addr="/var/run/docker/swarm/control.sock" proto=unix
INFO[0039] 785457d99e225831 became follower at term 0   
INFO[0039] newRaft 785457d99e225831 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0] 
INFO[0039] 785457d99e225831 became follower at term 1   
INFO[0039] 785457d99e225831 is starting a new election at term 1 
INFO[0039] 785457d99e225831 became candidate at term 2  
INFO[0039] 785457d99e225831 received vote from 785457d99e225831 at term 2 
INFO[0039] 785457d99e225831 became leader at term 2     
INFO[0039] raft.node: 785457d99e225831 elected leader 785457d99e225831 at term 2 
INFO[0000] Firewalld running: false                     
WARN[0040] secrets update ignored; executor does not support secrets  module="node/agent"
INFO[0040] Initializing Libnetwork Agent Listen-Addr=0.0.0.0 Local-addr=192.168.100.177 Adv-addr=192.168.100.177 Remote-addr = 
INFO[0040] Stopping manager                             
INFO[0040] shuting down certificate renewal routine      module="node/tls" node.id=v0oq5u0c2dy61sup800ahtxvd node.role=swarm-manager
INFO[0040] Manager shut down                            
INFO[0040] No non-localhost DNS nameservers are left in resolv.conf. Using default external servers : [nameserver 8.8.8.8 nameserver 8.8.4.4] 
INFO[0040] IPv6 enabled; Adding default IPv6 external servers : [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844] 
INFO[0000] Firewalld running: false                     
ERRO[0044] failed to send node leave: timed out broadcasting node event 
panic: leave after shutdown

goroutine 355 [running]:
panic(0x15ef4a0, 0xc4213e03a0)
	/usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/hashicorp/memberlist.(*Memberlist).Leave(0xc4200a4540, 0x3b9aca00, 0x1, 0x15ef4a0)
	/go/src/github.com/docker/docker/vendor/src/github.com/hashicorp/memberlist/memberlist.go:578 +0x42e
github.com/docker/libnetwork/networkdb.(*NetworkDB).clusterLeave(0xc421af0f00, 0xc421887e00, 0xc94658)
	/go/src/github.com/docker/docker/vendor/src/github.com/docker/libnetwork/networkdb/cluster.go:212 +0x24a
github.com/docker/libnetwork/networkdb.(*NetworkDB).Close(0xc421af0f00)
	/go/src/github.com/docker/docker/vendor/src/github.com/docker/libnetwork/networkdb/networkdb.go:198 +0x2f
github.com/docker/libnetwork.(*controller).agentClose(0xc4204e4000)
	/go/src/github.com/docker/docker/vendor/src/github.com/docker/libnetwork/agent.go:345 +0x129
github.com/docker/libnetwork.(*controller).clusterAgentInit(0xc4204e4000)
	/go/src/github.com/docker/docker/vendor/src/github.com/docker/libnetwork/controller.go:333 +0x250
created by github.com/docker/libnetwork.(*controller).SetClusterProvider
	/go/src/github.com/docker/docker/vendor/src/github.com/docker/libnetwork/controller.go:243 +0xe5

Signed-off-by: Alessandro Boch [email protected]

- This fixes a panic in memberlist.Leave() because called
  after memberlist.shutdown = false
  It happens because of two interlocking calls to NetworkDB.clusterLeave()
  It is easily reproducible with two back-to-back calls
  to docker swarm init && docker swarm leave --force
  While the first clusterLeave() is waiting for sendNodeEvent(NodeEventTypeLeave)
  to timeout (5 sec) a second clusterLeave() is called. The second clusterLeave()
  will end up invoking memberlist.Leave() after the previous call already did
  the same, therefore after memberlist.shutdown was set false.
- The fix is to have agentClose() acquire the agent instance and reset the
  agent pointer right away under lock. Then execute the closing/leave functions
  on the agent instance.

Signed-off-by: Alessandro Boch <[email protected]>
@aboch
Copy link
Contributor Author

aboch commented Nov 1, 2016

ping @mrjana

@mrjana
Copy link
Contributor

mrjana commented Nov 2, 2016

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants