-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collector crashing with invalid memory address or nil pointer dereference
on collector exposing a GRPC
endpoint behind an AWS ALB
#9296
Comments
I also have issues with 0.92.0 behind AWS ALB, which worked fine in all previous versions I've ran (including 0.91.0), where mine just fails the health checks and I don't get why.. But I do seem to get the same error message as you, when I restart the service (via systemd).
I'm running this on al2023 (but I have seen the same issue on amzn2 using x86 as well)
uname -a
config:
|
Thanks for reporting! Might be a regression go OTel Go GRPC instrumentation. Looking into it. cc @open-telemetry/go-instrumentation-approvers |
Some findings:
|
The diff between v1.59.0 and v1.60.0 has these two PRs which are potentially related to the instrumentation code: grpc/grpc-go/pull/6716 and grpc/grpc-go/pull/6750. 6176 looks more suspicious. |
…ndleRPC To help prevent crashes such as open-telemetry/opentelemetry-collector/issues/9296, check if the context is nil before proceeding. This is inspired by: https://github.com/grpc/grpc-go/blob/ddd377f19841eae70862559c854d957d61b3b692/stats/opencensus/opencensus.go#L204-L207
I don't know why the crash happens, but open-telemetry/opentelemetry-go-contrib/pull/4825 would prevent it and it follows existing practice in grpc-go, so I think this would at least make the issue less serious. |
@tcoville-coveo @neeej can you confirm that restarting the collector w/ the following feature gate prevents the issue from occurring:
I'm wondering if this issue is caused by the change to using otel for internal metrics, but it might not be |
@codeboten I tried what you suggested but I didn't notice any difference, the container is still crashing with the same error. FYI i deploy via the operator so I added the following config in the CRD:
which results in the pod args generated as:
|
I also get the same error on restarts with that setting.
|
Filed grpc/grpc-go/issues/6928 |
…eRPC receives a gRPCContext (#4825) * [instrumentation/google.golang.org/grpc/otelgrpc] Add safeguard to HandleRPC To help prevent crashes such as open-telemetry/opentelemetry-collector/issues/9296, check if the context is nil before proceeding. This is inspired by: https://github.com/grpc/grpc-go/blob/ddd377f19841eae70862559c854d957d61b3b692/stats/opencensus/opencensus.go#L204-L207 * Add changelog entry * Create and handle error * Make check a bit more clear * Update instrumentation/google.golang.org/grpc/otelgrpc/stats_handler.go Co-authored-by: Anthony Mirabella <[email protected]> * Remove unused code * additional check for gctx --------- Co-authored-by: Anthony Mirabella <[email protected]>
Potentially fix open-telemetry#9296 Signed-off-by: Alex Boten <[email protected]>
Potentially fix #9296 --------- Signed-off-by: Alex Boten <[email protected]>
I think we should not close this issue until grpc/grpc-go#6928 is resolved. Or file another one. It sounds potentially pretty impactful. Maybe even worth putting a release blocker label. @codeboten @codeboten WDYT? |
Hello, I replied in the issue in gRPC-Go, but is there a way to reproduce this? I wrote a test locally and triaged this repo's stats handler code and it all seems to be in good order (my test had no problem accessing context values in the server side trailers case from WriteStatus). Here to help, hopefully we can get to the bottom of this. |
@dmitryax i think its fine to keep it opened until its been confirmed to be fixed |
@tcoville-coveo @neeej the latest dev build of contrib is available, would you be able to test it out and see if the original crash occurs? No worries if not, we just haven't been able to repro on our end. The docker tag is:
|
@codeboten YES ! the issue is fixed with the |
Thanks to you both for providing us with feedback and detailed information about this as well :) While the issue is fixed at the Collector level, it would be great to also fix this at grpc-go. It would be extremely helpful if you can help us reproduce this locally @tcoville-coveo @neeej. To this end, I think the following would be helpful:
Anything else you can think of that would help is also welcome :)
I think this issue can be marked as solved. I would still want to keep the discussion going to help the folks at grpc-go and prevent further similar issues. |
I have an AWS ALB configured which has a listener on HTTPS:4317 (and a certificate connected to it). The TG has a health check configured to check for valid grpc codes 0-99 I use a form of this as my AWS CF template, if that might help: I looked a bit further on my logs, and it does crash the same as @tcoville-coveo So, I stripped my config to this, and I still see the same issue:
I also tried installing the otelcol rpm instead, used above config and that also gave me the same result. I tried running otelcol-contrib with one time each of these below as options, and got the same result then as well:
Then I installed https://github.com/rmedvedev/grpcdump and got the below output from it:
Which when compared to the output I get from an instance running with 0.91.0 (and with the config from my first post in this issue), it looks very different:
Let me know if I should test something more. |
Potentially fix open-telemetry#9296 --------- Signed-off-by: Alex Boten <[email protected]>
I confirm that the issue is fixed for me on 0.93.0 🎉 |
I am going to close this as done, further discussion can happen over at grpc/grpc-go#6928. Thanks again everybody :) |
Describe the bug
Collector is crashing continuously a few seconds after starting up after upgrading to
0.92.0
Steps to reproduce
I have several collectors running with the same version but only one is having the issue. the only difference I can see is that it is used as a Gateway behind an AWS ALB. the ALB healthcheck in GRPC polls on the
/
path of the receiver and the failure seem to correlate with the probe frequency. Also noted that if I remove the ingress the error stops (along with the traffic so not sure if this proves anything....)I also tried to downgrade the relay collector to 0.91.0 and send the traces "as-is" to a second collector running on
0.92.0
with the same config but without ingress and I don't get the error so it would tend to confirm the ALB probe theory....What did you expect to see?
The collector should at least survive the error and log a meaningful message
What did you see instead?
What version did you use?
otel-operator: 0.91.0 , collector image: 0.92.0
What config did you use?
collector:
ingress:
Environment
otel-contrib image
Additional context
I did not manage to repro locally on my machine, the issue only seem to happen on the cluster
The text was updated successfully, but these errors were encountered: