Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querying /health causes traefik to crash eventually #1013

Closed
zinob opened this issue Jan 3, 2017 · 5 comments
Closed

Querying /health causes traefik to crash eventually #1013

zinob opened this issue Jan 3, 2017 · 5 comments

Comments

@zinob
Copy link

zinob commented Jan 3, 2017

What version of Traefik are you using (traefik version)?

tested on 1.0.2 and 1.1.2

What is your environment & configuration (arguments, toml...)?

started using docker-compose with a traefik.yml somewhat like:

version: '2'
services:
  net-traefik:
    image: traefik
    restart: "always"
    command: --web --docker --docker.domain=example.com --logLevel=DEBUG
    ports:
      - "80:80"
      - "8080:8080"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /dev/null:/traefik.toml
  net-server:
    labels:
      - "traefik.backend=server"
      - "traefik.port=80"
      - "traefik.frontend.rule=PathPrefix:/rest"
  net-frontend:
    labels:
      - "traefik.backend=frontend"
      - "traefik.port=80"
      - "traefik.frontend.rule=PathPrefix:/"

What did you do?

I discovered this since traefik stopped responding when using collectd curl_json monitor to gather usage statistics via 127.0.0.1:8080/health
In that case it hangs after about 10 days or 14 400 queries
It can be provoked in a few minutes on my laptop with:
yes http://127.0.0.1:8080/health | xargs curl -s > /dev/null
strangely it then takes roughly 300k-700k queries. No excessive RAM-usage was detected, this does not happen if I request http://127.0.0.1:8080/ or http://127.0.0.1/foo.

What did you expect to see?

A json-answer

What did you see instead?

a stuck socket, complete failure to serve web-pages.

@joonas-fi
Copy link

I think I encountered this as well, just now.

Everything had been running completely fine. The very moment I opened the health page:

  • The graphs were empty and not populating.
  • Any services behind Traefik were not loading on my browser. At first I wondered if my Internet is down.
  • Nope, my Internet is not at fault - my phone started going off because my monitoring immediately started seeing timeouts.

I'm running d15a17b6347b867014c37f4bdb40664e85733d77 / cheddar.

There was nothing in the logs.

I don't think it's simply a "stuck socket", since the admin UI serving the /health endpoint is in a totally different port.

Now, I am definitely not opening the "health" page ever again in production, until this is fixed.

@joonas-fi
Copy link

This seems like a regression. This was fixed in #458 but it came back.

Said issue said this has been fixed by setting the vendor pointer to thoas/stats@152b5d0 (Jul 26, 2016). However, when looking at the current master, we see it pointing to thoas/stats@79b768f (Jul 18, 2016 -> older with the bug). All vendor pointers are stored in /glide.lock.

Why was it downgraded?

The commit of the downgrade points to dacde21 with message "Fix glide.yml go-units" but it changes so many more dependency versions than just go-units. Was this an error, @emilevauge?

@emilevauge emilevauge added bug priority/P0 needs hot fix labels Feb 7, 2017
@emilevauge emilevauge added this to the 1.2 milestone Feb 7, 2017
@joonas-fi
Copy link

I managed to replicate the error with @zinob's method. BTW, I love the clever use of yes! :)

You can also pipe it to pv to see interactively the speed of responses (and thus also notice when Traefik freezes):

$ yes http://127.0.0.1:8080/health | xargs curl -s | pv > /dev/null

On Traefik hanging, I $ kill -3 $TRAEFIK_PID and inspect logs:

$ cat traefik-stderr.log | grep 'goroutine ' | wc -l
745

$ cat traefik-stderr.log
... search for something interesting ...
goroutine 18 [semacquire, 9 minutes]:
sync.runtime_Semacquire(0xc4203080c4)
        /usr/local/go/src/runtime/sema.go:47 +0x30
sync.(*Mutex).Lock(0xc4203080c0)
        /usr/local/go/src/sync/mutex.go:85 +0xd0
sync.(*RWMutex).Lock(0xc4203080c0)
        /usr/local/go/src/sync/rwmutex.go:86 +0x31
github.com/containous/traefik/vendor/github.com/thoas/stats.(*Stats).ResetResponseCounts(0xc4203080c0)
        /go/src/github.com/containous/traefik/vendor/github.com/thoas/stats/stats.go:44 +0x31
github.com/containous/traefik/vendor/github.com/thoas/stats.New.func1(0xc4203080c0)
        /go/src/github.com/containous/traefik/vendor/github.com/thoas/stats/stats.go:33 +0x2b
created by github.com/containous/traefik/vendor/github.com/thoas/stats.New
        /go/src/github.com/containous/traefik/vendor/github.com/thoas/stats/stats.go:37 +0x169

$ grep 0xc4203080cc traefik-stderr.log
sync.runtime_Semacquire(0xc4203080cc)
sync.runtime_Semacquire(0xc4203080cc)
sync.runtime_Semacquire(0xc4203080cc)
... repeat 178 lines ...

# huge number of semaphore acquisitions for that same key, indicating deadlock I guess:
$ grep 0xc4203080cc traefik-stderr.log | wc -l
181

Extra reading: debugging deadlocks in Golang: https://lincolnloop.com/blog/lesson-learned-while-debugging-botbotme/

@emilevauge
Copy link
Member

I just created a PR #1141 that fixes this issue. It will be released in the next RC :)

@emilevauge
Copy link
Member

Fixed by #1141

@traefik traefik locked and limited conversation to collaborators Sep 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants