Segfault after shutting down ebpf tracker #3711

bboreham · 2019-10-13T19:57:38Z

BUG REPORT

Seen in logs on a Kubernetes cluster. Sometimes the process would restart a few times then work; sometimes it would restart endlessly.

scope version 1.11.6 (1.11.4 and 1.11.3 had same symptom)

I believe this is a race condition from running getInitialState() on a background thread - if we shut down the ebpf tracker before that has finished it will crash.

These nodes seem to have a lot of TCP connections, e.g. conntrack -L reports 245155 on one, only around 3,000 in TIME_WAIT.

Logs:

2019-10-13T19:27:25.821787845Z time="2019-10-13T19:27:25Z" level=info msg="publishing to: https://<redacted>@cloud.weave.works."
2019-10-13T19:27:25.821833615Z <probe> INFO: 2019/10/13 19:27:25.821238 Basic authentication disabled
2019-10-13T19:27:25.849864019Z <probe> INFO: 2019/10/13 19:27:25.849501 command line args: --mode=probe --probe-only=true --probe.docker=true --probe.docker.bridge=docker0 --probe.kubernetes.role=host https://<elided>@cloud.weave.works.
2019-10-13T19:27:25.849895079Z <probe> INFO: 2019/10/13 19:27:25.849589 probe starting, version 1.11.6, ID 4240307090b3c701
2019-10-13T19:27:25.856663104Z <probe> WARN: 2019/10/13 19:27:25.856137 Cannot resolve 'scope.weave.local.': dial tcp 172.17.0.1:53: connect: connection refused
2019-10-13T19:27:26.177284533Z <probe> INFO: 2019/10/13 19:27:26.176898 Control connection to cloud.weave.works. starting
2019-10-13T19:27:26.563771966Z <probe> WARN: 2019/10/13 19:27:26.563445 Error collecting weave status, backing off 10s: Get http://127.0.0.1:6784/report: dial tcp 127.0.0.1:6784: connect: connection refused. If you are not running Weave Net, you may wish to suppress this warning by launching scope with the `--weave=false` option.
2019-10-13T19:27:26.604484396Z <probe> ERRO: 2019/10/13 19:27:26.604152 tcp tracer received event with timestamp 12652644202032895 even though the last timestamp was 12652644205786052. Stopping the eBPF tracker.
2019-10-13T19:27:26.604712068Z <probe> ERRO: 2019/10/13 19:27:26.604464 tcp tracer received event with timestamp 12652644202138598 even though the last timestamp was 12652644205786052. Stopping the eBPF tracker.
2019-10-13T19:27:26.624841063Z <probe> INFO: 2019/10/13 19:27:26.624417 Publish loop for cloud.weave.works. starting
2019-10-13T19:27:26.715718105Z <probe> WARN: 2019/10/13 19:27:26.714537 Dropping report to cloud.weave.works.
2019-10-13T19:27:26.758087095Z <probe> WARN: 2019/10/13 19:27:26.757677 Dropping report to cloud.weave.works.
[[[.]
2019-10-13T19:27:27.629609655Z <probe> WARN: 2019/10/13 19:27:27.629252 Dropping report to cloud.weave.works.
2019-10-13T19:27:27.661782679Z <probe> WARN: 2019/10/13 19:27:27.659335 ebpf tracker died, restarting it
2019-10-13T19:27:27.662414461Z <probe> WARN: 2019/10/13 19:27:27.661235 Dropping report to cloud.weave.works.
[...]
2019-10-13T19:27:28.205549324Z <probe> WARN: 2019/10/13 19:27:28.205110 Dropping report to cloud.weave.works.
2019-10-13T19:27:28.659833456Z <probe> WARN: 2019/10/13 19:27:28.659448 Endpoint reporter took longer than 1s
2019-10-13T19:27:28.820249185Z <probe> WARN: 2019/10/13 19:27:28.819872 Endpoint reporter took 1.160715041s (longer than 1s)
2019-10-13T19:27:28.878843652Z <probe> ERRO: 2019/10/13 19:27:28.875629 tcp tracer received event with timestamp 12652646202038856 even though the last timestamp was 12652646204876031. Stopping the eBPF tracker.
2019-10-13T19:27:28.878955243Z <probe> ERRO: 2019/10/13 19:27:28.875708 tcp tracer received event with timestamp 12652646202110612 even though the last timestamp was 12652646204876031. Stopping the eBPF tracker.
2019-10-13T19:27:28.963800677Z <probe> WARN: 2019/10/13 19:27:28.961378 ebpf tracker died again, gently falling back to proc scanning
2019-10-13T19:27:31.067514362Z panic: runtime error: invalid memory address or nil pointer dereference
2019-10-13T19:27:31.067560271Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x15a49fa]
2019-10-13T19:27:31.067568748Z 
2019-10-13T19:27:31.067573672Z goroutine 1545 [running]:
2019-10-13T19:27:31.067579267Z github.com/weaveworks/scope/probe/endpoint.(*EbpfTracker).feedInitialConnections(0x0, 0x22d8fc0, 0xc42104ca80, 0xc425108600, 0xc428593f48, 0x0, 0x0, 0xc420b7f8e0, 0x18)
2019-10-13T19:27:31.067583045Z  /go/src/github.com/weaveworks/scope/probe/endpoint/ebpf.go:324 +0x3a
2019-10-13T19:27:31.067586942Z github.com/weaveworks/scope/probe/endpoint.(*connectionTracker).getInitialState(0xc420628cc0)
2019-10-13T19:27:31.067590264Z  /go/src/github.com/weaveworks/scope/probe/endpoint/connection_tracker.go:186 +0x2bc
2019-10-13T19:27:31.06759352Z created by github.com/weaveworks/scope/probe/endpoint.(*connectionTracker).ReportConnections
2019-10-13T19:27:31.067596559Z  /go/src/github.com/weaveworks/scope/probe/endpoint/connection_tracker.go:99 +0x383

The text was updated successfully, but these errors were encountered:

bboreham added the bug Broken end user or developer functionality; not working as the developers intended it label Oct 13, 2019

bboreham mentioned this issue Oct 14, 2019

fix(probe/ebpf): feed initial connections synchronously on restart #3712

Merged

bboreham closed this as completed in #3712 Oct 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault after shutting down ebpf tracker #3711

Segfault after shutting down ebpf tracker #3711

bboreham commented Oct 13, 2019 •

edited

Loading

Segfault after shutting down ebpf tracker #3711

Segfault after shutting down ebpf tracker #3711

Comments

bboreham commented Oct 13, 2019 • edited Loading

BUG REPORT

Logs:

bboreham commented Oct 13, 2019 •

edited

Loading