Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probe reporter stuck #1576

Closed
2opremio opened this issue Jun 9, 2016 · 11 comments
Closed

Probe reporter stuck #1576

2opremio opened this issue Jun 9, 2016 · 11 comments
Assignees
Labels
bug Broken end user or developer functionality; not working as the developers intended it
Milestone

Comments

@2opremio
Copy link
Contributor

2opremio commented Jun 9, 2016

The user (blaines at #weave-users Slack channel) is running Scope version 0.14 and the report publisher in one of his probes seems to be stuck.

Full context: https://weaveworks.slack.com/archives/weave-users/p1465510035000191

Scope probe logs: https://gist.github.com/blaines/db726bf232c411a863a5faab74f24b56
Scope probe goroutine dump: https://gist.github.com/2opremio/1fe38671faf74c59b248aba03ed40051

@2opremio 2opremio added the bug Broken end user or developer functionality; not working as the developers intended it label Jun 9, 2016
@2opremio 2opremio added this to the 0.16.0 milestone Jun 9, 2016
@2opremio
Copy link
Contributor Author

2opremio commented Jun 9, 2016

Related? #881

@2opremio
Copy link
Contributor Author

2opremio commented Jun 9, 2016

After a first look to the goroutine dump, it seems like the deadlock is happening at the multiclient's (multi_client.go) mutex

@blaines
Copy link

blaines commented Jun 9, 2016

Nice! Thanks for looking into it.

@2opremio
Copy link
Contributor Author

2opremio commented Jun 9, 2016

Yep, that's the problem, we are double locking a mutex:

goroutine 42 [semacquire, 4132 minutes]:
sync.runtime_Semacquire(0xc82212309c)
    /usr/local/go/src/runtime/sema.go:47 +0x26
sync.(*WaitGroup).Wait(0xc822123090)
    /usr/local/go/src/sync/waitgroup.go:127 +0xb4
github.com/weaveworks/scope/probe/appclient.(*appClient).Stop(0xc822122fc0)
    /go/src/github.com/weaveworks/scope/probe/appclient/app_client.go:137 +0x1ac
github.com/weaveworks/scope/probe/appclient.(*multiClient).Set(0xc82032b320, 0xc821a736e0, 0x16, 0xc821a501c0, 0xe, 0xe)
    /go/src/github.com/weaveworks/scope/probe/appclient/multi_client.go:119 +0x76a

(*multiClient).Set() locks the mutex and invokes (*multiClient).Stop() which tries to lock it again.

All 0.15, 0.14 and master are affected. It's been present since at least ffa955a

@2opremio
Copy link
Contributor Author

2opremio commented Jun 9, 2016

Having #943 would had made the diagnosis much easier

@2opremio
Copy link
Contributor Author

2opremio commented Jun 9, 2016

@blaines Thank you for your patience and help!

@2opremio
Copy link
Contributor Author

2opremio commented Jun 10, 2016

I may have spoken too soon. Multiclient invokes Stop() from each individual client not itself. I will look at it more carefully.

@2opremio
Copy link
Contributor Author

My initial analysis was wrong, now I think that there's something wrong with the goroutine wait when stopping a client. https://github.com/weaveworks/scope/blob/v0.14.0/probe/appclient/app_client.go#L137

goroutine 42 [semacquire, 4132 minutes]:
sync.runtime_Semacquire(0xc82212309c)
    /usr/local/go/src/runtime/sema.go:47 +0x26
sync.(*WaitGroup).Wait(0xc822123090)
    /usr/local/go/src/sync/waitgroup.go:127 +0xb4
github.com/weaveworks/scope/probe/appclient.(*appClient).Stop(0xc822122fc0)
    /go/src/github.com/weaveworks/scope/probe/appclient/app_client.go:137 +0x1ac
github.com/weaveworks/scope/probe/appclient.(*multiClient).Set(0xc82032b320, 0xc821a736e0, 0x16, 0xc821a501c0, 0xe, 0xe)
    /go/src/github.com/weaveworks/scope/probe/appclient/multi_client.go:119 +0x76a

Maybe the client gets stuck here https://github.com/weaveworks/scope/blob/v0.14.0/probe/appclient/app_client.go#L170 not releasing goroutine.

@2opremio
Copy link
Contributor Author

2opremio commented Jun 13, 2016

This is what I believe happened:

  1. A probe http client got stuck in a publish request to an app (most probably related due to the Spot instance where the app was running going down, see https://weaveworks.slack.com/archives/weave-users/p1465513276000257 )
  2. The multiclient rightfully tried to Stop the client for the app in that host for which it had acquired its mutex.
  3. The invocation of Stop() blocked forever since it involves waiting for goroutines to end and one of them was stuck in (1). Due to the multiclient taking its lock in (2) no reports cloud be published anymore

To solve this:

  1. We should definitely set a timeout for publish requests in client_app since an app not responding shouldn't block the clients to the other apps. However, if this really happened due to (1) I don't understand why the go http client didn't return, it beats me.
  2. Maybe we should go as far as using net/context to cancel the latest request or saving the latest request for cancellation with Request.Cancel
  3. We should make sure this cannot happen with the control/pipe connections either.

@2opremio
Copy link
Contributor Author

2opremio commented Jun 13, 2016

After a conversation with @tomwilkie , looking at the app_client pointer (0xc822122fc0) confirms my theory (the goroutine calling Stop() and the goroutine calling publish() are accessing the same client)

Now the question is, why did the http request in publish() didn't return once it's connection was closed by Stop()?

@2opremio
Copy link
Contributor Author

2opremio commented Jun 13, 2016

Now the question is, why did the http request in publish() didn't return once it's connection was closed by Stop()?

Because the http client's connection is never registered, only the pipe and control connection :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Broken end user or developer functionality; not working as the developers intended it
Projects
None yet
Development

No branches or pull requests

2 participants