Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trace Pub/Sub publishing #1

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

lawrencejones
Copy link
Member

The Pub/Sub client publishes messages by placing them onto a client buffer and having a bundler (working asynchronously) pull them from the buffer and submit them in batch to GCP.

This means there's a disconnect from our app publishing messages and the process that publishes them. As we're seeing timeouts around publishing we want to connect the app publish to the bundler publish.

We achieve this by:

  1. Adding tracing around flow control so waiting on a topic semaphore is exposed in our traces
  2. Creates spans that get linked between the application publish and the bundler publish that allow inspecting the gRPC operation that actually submits things to Pub/Sub

^[3]: open-telemetry/opentelemetry-go#5032

This isn't as nice as it could be because OTEL does not yet support adding links to spans after they've been created (see comment for more detail). We will be able to clean this up once that feature lands (see [3]).

The Pub/Sub client publishes messages by placing them onto a client
buffer and having a bundler (working asynchronously) pull them from the
buffer and submit them in batch to GCP.

This means there's a disconnect from our app publishing messages and the
process that publishes them. As we're seeing timeouts around publishing
we want to connect the app publish to the bundler publish.

We achieve this by:

1. Adding tracing around flow control so waiting on a topic semaphore is
   exposed in our traces
2. Creates spans that get linked between the application publish and the
   bundler publish that allow inspecting the gRPC operation that
   actually submits things to Pub/Sub

^[3]: open-telemetry/opentelemetry-go#5032

This isn't as nice as it could be because OTEL does not yet support
adding links to spans after they've been created (see comment for more
detail). We will be able to clean this up once that feature lands (see
[3]).
@lawrencejones
Copy link
Member Author

lawrencejones commented Mar 20, 2024

Worth noting:

  • We instrument flow control so we can see if our topic is blocking
  • The endpoint has a span that links to the bundler
  • Bundler is spanned such that it links to the publisher

In an example where an endpoint publishes several Pub/Sub messages we see:

Endpoint (POST /actions)

Screenshot 2024-03-20 at 20 15 22
Screenshot 2024-03-20 at 20 15 45

Bundler (inside the Pub/Sub client)

Screenshot 2024-03-20 at 20 15 55

Copy link

@rorymalcolm rorymalcolm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand here, before we attempt publishing, we have to create the entire span relationship graph for the individual messages to their parent publish span, and then attempt publishing?

@lawrencejones
Copy link
Member Author

we have to create the entire span relationship graph

Yeah sadly that is the case. I'm hoping we can clean this up when the AddLinks support lands in the Go OTEL client but for now links are immutable beyond span creation which sucks a bit.

If we don't have a link from the parent to the child (links are stored on nodes, so we need to add the link to the parent) then our normal debugging flow won't work. That's why we need the additional span.

Worth linking to the official feature branch for OTEL in the Pub/Sub client which I'm hoping we can move over to when it lands: googleapis#4665

Copy link

@hongalex hongalex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I'm the maintainer of cloud.google.com/go/pubsub, just wanted to take a look since you linked to the main otel tracing issue and was curious how your team planned to trace messages. Good news is that your implementation is fairly similar to what we plan to add, just left a few comments for context of what we've considered.

@@ -873,6 +878,83 @@ func (t *Topic) publishMessageBundle(ctx context.Context, bms []*bundledMessage)
if err != nil {
log.Printf("pubsub: cannot create context with tag in publishMessageBundle: %v", err)
}

// We have to do some weird things to instrument this batcher to get around
// OpenTelemetry limitations, namely that you can't add span links after you've created

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically changing at some point soon with AddLinks now being stable, go being tracked here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I found this after looking around for some time. Very glad it will be changing soon, if a bit too late to help us in this case.

Comment on lines +943 to +944
// Dummy bundle span which is primarily for linking from the publish traces into our
// context, the batcher.
Copy link

@hongalex hongalex Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we considered doing as well, but decided to implement it differently since we figure users might be confused by the presence of a dummy span. We're opting to wait until the AddLinks API is added, so that bidirectional linking will happen.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is a correct decision on your part. For us, the primary reason we're motivated to make this change is to allow easy linking from the code that publishes a message into the bundler (as we were seeing timeouts and other failures) so without a span we can add to the parent, that linking is much harder.

We use Cloud Trace and Grafana Tempo. It seems CT definitely doesn't do a lookup for spans which link to the one you are currently viewing as a parent, and I'm unsure Grafana does either. So a dummy span to maintain the link seemed prudent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again though: when AddLinks lands, we can mutate the parent span with a new link and everything should be fine. This worked when we used OpenCensus as it allowed adding links after span creation, alas!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants