-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trace Pub/Sub publishing #1
base: main
Are you sure you want to change the base?
Conversation
The Pub/Sub client publishes messages by placing them onto a client buffer and having a bundler (working asynchronously) pull them from the buffer and submit them in batch to GCP. This means there's a disconnect from our app publishing messages and the process that publishes them. As we're seeing timeouts around publishing we want to connect the app publish to the bundler publish. We achieve this by: 1. Adding tracing around flow control so waiting on a topic semaphore is exposed in our traces 2. Creates spans that get linked between the application publish and the bundler publish that allow inspecting the gRPC operation that actually submits things to Pub/Sub ^[3]: open-telemetry/opentelemetry-go#5032 This isn't as nice as it could be because OTEL does not yet support adding links to spans after they've been created (see comment for more detail). We will be able to clean this up once that feature lands (see [3]).
Worth noting:
In an example where an endpoint publishes several Pub/Sub messages we see: Endpoint (POST /actions)Bundler (inside the Pub/Sub client) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand here, before we attempt publishing, we have to create the entire span relationship graph for the individual messages to their parent publish span, and then attempt publishing?
Yeah sadly that is the case. I'm hoping we can clean this up when the AddLinks support lands in the Go OTEL client but for now links are immutable beyond span creation which sucks a bit. If we don't have a link from the parent to the child (links are stored on nodes, so we need to add the link to the parent) then our normal debugging flow won't work. That's why we need the additional span. Worth linking to the official feature branch for OTEL in the Pub/Sub client which I'm hoping we can move over to when it lands: googleapis#4665 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I'm the maintainer of cloud.google.com/go/pubsub, just wanted to take a look since you linked to the main otel tracing issue and was curious how your team planned to trace messages. Good news is that your implementation is fairly similar to what we plan to add, just left a few comments for context of what we've considered.
@@ -873,6 +878,83 @@ func (t *Topic) publishMessageBundle(ctx context.Context, bms []*bundledMessage) | |||
if err != nil { | |||
log.Printf("pubsub: cannot create context with tag in publishMessageBundle: %v", err) | |||
} | |||
|
|||
// We have to do some weird things to instrument this batcher to get around | |||
// OpenTelemetry limitations, namely that you can't add span links after you've created |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is technically changing at some point soon with AddLinks
now being stable, go being tracked here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I found this after looking around for some time. Very glad it will be changing soon, if a bit too late to help us in this case.
// Dummy bundle span which is primarily for linking from the publish traces into our | ||
// context, the batcher. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something we considered doing as well, but decided to implement it differently since we figure users might be confused by the presence of a dummy span. We're opting to wait until the AddLinks
API is added, so that bidirectional linking will happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is a correct decision on your part. For us, the primary reason we're motivated to make this change is to allow easy linking from the code that publishes a message into the bundler (as we were seeing timeouts and other failures) so without a span we can add to the parent, that linking is much harder.
We use Cloud Trace and Grafana Tempo. It seems CT definitely doesn't do a lookup for spans which link to the one you are currently viewing as a parent, and I'm unsure Grafana does either. So a dummy span to maintain the link seemed prudent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again though: when AddLinks lands, we can mutate the parent span with a new link and everything should be fine. This worked when we used OpenCensus as it allowed adding links after span creation, alas!
The Pub/Sub client publishes messages by placing them onto a client buffer and having a bundler (working asynchronously) pull them from the buffer and submit them in batch to GCP.
This means there's a disconnect from our app publishing messages and the process that publishes them. As we're seeing timeouts around publishing we want to connect the app publish to the bundler publish.
We achieve this by:
^[3]: open-telemetry/opentelemetry-go#5032
This isn't as nice as it could be because OTEL does not yet support adding links to spans after they've been created (see comment for more detail). We will be able to clean this up once that feature lands (see [3]).