Querying /health causes traefik to crash eventually #1013

zinob · 2017-01-03T10:32:33Z

What version of Traefik are you using (`traefik version`)?

tested on 1.0.2 and 1.1.2

What is your environment & configuration (arguments, toml...)?

started using docker-compose with a traefik.yml somewhat like:

version: '2'
services:
  net-traefik:
    image: traefik
    restart: "always"
    command: --web --docker --docker.domain=example.com --logLevel=DEBUG
    ports:
      - "80:80"
      - "8080:8080"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /dev/null:/traefik.toml
  net-server:
    labels:
      - "traefik.backend=server"
      - "traefik.port=80"
      - "traefik.frontend.rule=PathPrefix:/rest"
  net-frontend:
    labels:
      - "traefik.backend=frontend"
      - "traefik.port=80"
      - "traefik.frontend.rule=PathPrefix:/"

What did you do?

I discovered this since traefik stopped responding when using collectd curl_json monitor to gather usage statistics via 127.0.0.1:8080/health
In that case it hangs after about 10 days or 14 400 queries
It can be provoked in a few minutes on my laptop with:
yes http://127.0.0.1:8080/health | xargs curl -s > /dev/null
strangely it then takes roughly 300k-700k queries. No excessive RAM-usage was detected, this does not happen if I request http://127.0.0.1:8080/ or http://127.0.0.1/foo.

What did you expect to see?

A json-answer

What did you see instead?

a stuck socket, complete failure to serve web-pages.

The text was updated successfully, but these errors were encountered:

joonas-fi · 2017-02-07T10:55:10Z

I think I encountered this as well, just now.

Everything had been running completely fine. The very moment I opened the health page:

The graphs were empty and not populating.
Any services behind Traefik were not loading on my browser. At first I wondered if my Internet is down.
Nope, my Internet is not at fault - my phone started going off because my monitoring immediately started seeing timeouts.

I'm running d15a17b6347b867014c37f4bdb40664e85733d77 / cheddar.

There was nothing in the logs.

I don't think it's simply a "stuck socket", since the admin UI serving the /health endpoint is in a totally different port.

Now, I am definitely not opening the "health" page ever again in production, until this is fixed.

joonas-fi · 2017-02-07T13:06:40Z

This seems like a regression. This was fixed in #458 but it came back.

Said issue said this has been fixed by setting the vendor pointer to thoas/stats@152b5d0 (Jul 26, 2016). However, when looking at the current master, we see it pointing to thoas/stats@79b768f (Jul 18, 2016 -> older with the bug). All vendor pointers are stored in /glide.lock.

Why was it downgraded?

The commit of the downgrade points to dacde21 with message "Fix glide.yml go-units" but it changes so many more dependency versions than just go-units. Was this an error, @emilevauge?

joonas-fi · 2017-02-07T17:54:42Z

I managed to replicate the error with @zinob's method. BTW, I love the clever use of yes! :)

You can also pipe it to pv to see interactively the speed of responses (and thus also notice when Traefik freezes):

$ yes http://127.0.0.1:8080/health | xargs curl -s | pv > /dev/null

On Traefik hanging, I $ kill -3 $TRAEFIK_PID and inspect logs:

$ cat traefik-stderr.log | grep 'goroutine ' | wc -l
745

$ cat traefik-stderr.log
... search for something interesting ...
goroutine 18 [semacquire, 9 minutes]:
sync.runtime_Semacquire(0xc4203080c4)
        /usr/local/go/src/runtime/sema.go:47 +0x30
sync.(*Mutex).Lock(0xc4203080c0)
        /usr/local/go/src/sync/mutex.go:85 +0xd0
sync.(*RWMutex).Lock(0xc4203080c0)
        /usr/local/go/src/sync/rwmutex.go:86 +0x31
github.com/containous/traefik/vendor/github.com/thoas/stats.(*Stats).ResetResponseCounts(0xc4203080c0)
        /go/src/github.com/containous/traefik/vendor/github.com/thoas/stats/stats.go:44 +0x31
github.com/containous/traefik/vendor/github.com/thoas/stats.New.func1(0xc4203080c0)
        /go/src/github.com/containous/traefik/vendor/github.com/thoas/stats/stats.go:33 +0x2b
created by github.com/containous/traefik/vendor/github.com/thoas/stats.New
        /go/src/github.com/containous/traefik/vendor/github.com/thoas/stats/stats.go:37 +0x169

$ grep 0xc4203080cc traefik-stderr.log
sync.runtime_Semacquire(0xc4203080cc)
sync.runtime_Semacquire(0xc4203080cc)
sync.runtime_Semacquire(0xc4203080cc)
... repeat 178 lines ...

# huge number of semaphore acquisitions for that same key, indicating deadlock I guess:
$ grep 0xc4203080cc traefik-stderr.log | wc -l
181

Extra reading: debugging deadlocks in Golang: https://lincolnloop.com/blog/lesson-learned-while-debugging-botbotme/

emilevauge · 2017-02-08T19:17:17Z

I just created a PR #1141 that fixes this issue. It will be released in the next RC :)

emilevauge · 2017-03-01T13:36:56Z

Fixed by #1141

emilevauge added bug priority/P0 needs hot fix labels Feb 7, 2017

emilevauge added this to the 1.2 milestone Feb 7, 2017

emilevauge mentioned this issue Feb 8, 2017

Fix stats race condition #1141

Merged

emilevauge closed this as completed Mar 1, 2017

traefik locked and limited conversation to collaborators Sep 1, 2019

traefiker added the status/5-frozen-due-to-age label Sep 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Querying /health causes traefik to crash eventually #1013

Querying /health causes traefik to crash eventually #1013

zinob commented Jan 3, 2017

joonas-fi commented Feb 7, 2017

joonas-fi commented Feb 7, 2017

joonas-fi commented Feb 7, 2017

emilevauge commented Feb 8, 2017

emilevauge commented Mar 1, 2017

Querying /health causes traefik to crash eventually #1013

Querying /health causes traefik to crash eventually #1013

Comments

zinob commented Jan 3, 2017

What version of Traefik are you using (traefik version)?

What is your environment & configuration (arguments, toml...)?

What did you do?

What did you expect to see?

What did you see instead?

joonas-fi commented Feb 7, 2017

joonas-fi commented Feb 7, 2017

joonas-fi commented Feb 7, 2017

emilevauge commented Feb 8, 2017

emilevauge commented Mar 1, 2017

What version of Traefik are you using (`traefik version`)?