Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingester panics if partition is reassigned #1200

Closed
lw346 opened this issue Nov 19, 2018 · 3 comments · Fixed by #1485
Closed

Ingester panics if partition is reassigned #1200

lw346 opened this issue Nov 19, 2018 · 3 comments · Fixed by #1485

Comments

@lw346
Copy link

lw346 commented Nov 19, 2018

Requirement - what kind of business use case are you trying to solve?

Ingester should not fall over when partition is reassigned to another consumer.

Problem - what in Jaeger blocks you from solving the requirement?

There is an existing jaeger-spans queue (using protobuf encoder), with 6 partitions across 6 kafka nodes.

When running with multiple ingesters (v1.8.0), each ingester in turn will crash with logs that look like the following:

{"level":"info","ts":1542640177.7420416,"caller":"cassandra/factory.go:92","msg":"Cassandra archive storage configuration is empty, skipping"}
{"level":"info","ts":1542640177.7505202,"caller":"ingester/main.go:100","msg":"Registering metrics handler with HTTP server","route":"/metrics"}
{"level":"info","ts":1542640177.7506182,"caller":"ingester/main.go:106","msg":"Starting HTTP server","http-port":14271}
{"level":"info","ts":1542640177.7506611,"caller":"healthcheck/handler.go:133","msg":"Health Check state change","status":"ready"}
{"level":"info","ts":1542640177.7506845,"caller":"consumer/consumer.go:76","msg":"Starting main loop"}
{"level":"info","ts":1542640221.4679706,"caller":"consumer/consumer.go:154","msg":"Starting error handler","partition":4}
{"level":"info","ts":1542640221.4680126,"caller":"consumer/consumer.go:154","msg":"Starting error handler","partition":5}
{"level":"info","ts":1542640221.4679477,"caller":"consumer/consumer.go:105","msg":"Starting message handler","partition":5}
{"level":"info","ts":1542640221.4680393,"caller":"consumer/consumer.go:105","msg":"Starting message handler","partition":4}
{"level":"info","ts":1542640221.4846787,"caller":"consumer/processor_factory.go:62","msg":"Creating new processors","partition":5}
{"level":"info","ts":1542640221.4894123,"caller":"consumer/processor_factory.go:62","msg":"Creating new processors","partition":4}
{"level":"info","ts":1542640227.298487,"caller":"consumer/consumer.go:163","msg":"Finished handling errors","partition":5}
{"level":"info","ts":1542640227.3106964,"caller":"consumer/consumer.go:163","msg":"Finished handling errors","partition":4}
{"level":"info","ts":1542640227.3175905,"caller":"consumer/consumer.go:123","msg":"Message channel closed. ","partition":4}
{"level":"info","ts":1542640227.3269305,"caller":"consumer/consumer.go:123","msg":"Message channel closed. ","partition":5}
{"level":"info","ts":1542640227.9139948,"caller":"processor/parallel_processor.go:78","msg":"Completed shutdown of processor goroutines"}
{"level":"info","ts":1542640227.9140894,"caller":"consumer/consumer.go:147","msg":"Closing partition consumer","partition":5}
{"level":"info","ts":1542640227.9141328,"caller":"consumer/consumer.go:150","msg":"Closed partition consumer","partition":5}
panic: counter cannot decrease in value

goroutine 895 [running]:
github.com/jaegertracing/jaeger/vendor/github.com/prometheus/client_golang/prometheus.(*counter).Add(0xc000790600, 0xbff0000000000000)
	/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/prometheus/client_golang/prometheus/counter.go:71 +0xa3
github.com/jaegertracing/jaeger/vendor/github.com/uber/jaeger-lib/metrics/prometheus.(*counter).Inc(0xc0006b42a0, 0xffffffffffffffff)
	/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/uber/jaeger-lib/metrics/prometheus/factory.go:183 +0x46
github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*Consumer).handleMessages(0xc0004c4300, 0xf08c60, 0xc00054e630)
	/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/consumer.go:124 +0x893
created by github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*Consumer).Start.func1
	/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/consumer.go:87 +0xbd

This results in ingesters never being able to handle any spans, as each ingester only runs for a short time before crashing and being restarted.

Proposal - what do you suggest to solve the problem or improve the existing situation?

n/a

Any open questions to address

n/a

@yurishkuro
Copy link
Member

Looks like we need to change the counter that tracks actively used partitions to a gauge. And we might want to emit it more often than just once when starting to consume a partition, as otherwise a dashboard query may not find any samples from the time series.

cc @vprithvi

@yurishkuro
Copy link
Member

I ran into similar issue in #1253, temporary workaround is to use expvar metrics backend.

bobrik added a commit to bobrik/jaeger that referenced this issue Apr 23, 2019
Counters cannot be decremented in Prometheus:

```
panic: counter cannot decrease in value

goroutine 895 [running]:
github.com/jaegertracing/jaeger/vendor/github.com/prometheus/client_golang/prometheus.(*counter).Add(0xc000790600, 0xbff0000000000000)
    /home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/prometheus/client_golang/prometheus/counter.go:71 +0xa3
github.com/jaegertracing/jaeger/vendor/github.com/uber/jaeger-lib/metrics/prometheus.(*counter).Inc(0xc0006b42a0, 0xffffffffffffffff)
    /home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/uber/jaeger-lib/metrics/prometheus/factory.go:183 +0x46
github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*Consumer).handleMessages(0xc0004c4300, 0xf08c60, 0xc00054e630)
    /home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/consumer.go:124 +0x893
created by github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*Consumer).Start.func1
    /home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/consumer.go:87 +0xbd
```

Gauges can, even though we have to keep an extra variable
around to keep count. In Prometheus Go library itself that
is not necessary as Gauge type provides `Inc` and `Dec`,
but Jaeger's wrapper does not have those exposed.

Fixes jaegertracing#1200.

Signed-off-by: Ivan Babrou <[email protected]>
bobrik added a commit to bobrik/jaeger that referenced this issue Apr 23, 2019
Counters cannot be decremented in Prometheus:

```
panic: counter cannot decrease in value

goroutine 895 [running]:
github.com/jaegertracing/jaeger/vendor/github.com/prometheus/client_golang/prometheus.(*counter).Add(0xc000790600, 0xbff0000000000000)
    /home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/prometheus/client_golang/prometheus/counter.go:71 +0xa3
github.com/jaegertracing/jaeger/vendor/github.com/uber/jaeger-lib/metrics/prometheus.(*counter).Inc(0xc0006b42a0, 0xffffffffffffffff)
    /home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/uber/jaeger-lib/metrics/prometheus/factory.go:183 +0x46
github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*Consumer).handleMessages(0xc0004c4300, 0xf08c60, 0xc00054e630)
    /home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/consumer.go:124 +0x893
created by github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*Consumer).Start.func1
    /home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/consumer.go:87 +0xbd
```

Gauges can, even though we have to keep an extra variable
around to keep count. In Prometheus Go library itself that
is not necessary as Gauge type provides `Inc` and `Dec`,
but Jaeger's wrapper does not have those exposed.

Fixes jaegertracing#1200.

Signed-off-by: Ivan Babrou <[email protected]>
@bobrik
Copy link
Contributor

bobrik commented Apr 23, 2019

#1485 addresses the issue.

yurishkuro pushed a commit that referenced this issue Apr 23, 2019
* Switch from counter to a gauge for partitions held

Counters cannot be decremented in Prometheus:

```
panic: counter cannot decrease in value

goroutine 895 [running]:
github.com/jaegertracing/jaeger/vendor/github.com/prometheus/client_golang/prometheus.(*counter).Add(0xc000790600, 0xbff0000000000000)
    /home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/prometheus/client_golang/prometheus/counter.go:71 +0xa3
github.com/jaegertracing/jaeger/vendor/github.com/uber/jaeger-lib/metrics/prometheus.(*counter).Inc(0xc0006b42a0, 0xffffffffffffffff)
    /home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/uber/jaeger-lib/metrics/prometheus/factory.go:183 +0x46
github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*Consumer).handleMessages(0xc0004c4300, 0xf08c60, 0xc00054e630)
    /home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/consumer.go:124 +0x893
created by github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*Consumer).Start.func1
    /home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/consumer.go:87 +0xbd
```

Gauges can, even though we have to keep an extra variable
around to keep count. In Prometheus Go library itself that
is not necessary as Gauge type provides `Inc` and `Dec`,
but Jaeger's wrapper does not have those exposed.

Fixes #1200.

Signed-off-by: Ivan Babrou <[email protected]>

* Protect partitionsHeld in consumer by lock

Signed-off-by: Ivan Babrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants