Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: include Raft application on leaseholder in request trace #72738

Merged
merged 3 commits into from
Nov 18, 2021

Conversation

andreimatei
Copy link
Contributor

Before this patch, the decoding and application of a Raft command was
not included in the recording of the request that generated the
respective command, even in the case where consensus is synchronous with
respect to the request (i.e. non-AsyncConsensus requests). This was
because, although we plumb tracing information below Raft, the span
under which Raft processing was occurring was Fork()ed from the parent
span (i.e. the request evaluation's span). The reasons why that was are
not good:

  1. forking (as opposed to creating a regular child) was introduced in
    storage: create new tracing span for each replicatedCmd during application #39425. I'm not sure whether there was a particular reason for this
    decision. Perhaps there was fear at the time about the child
    outliving the parent - although I don't think that it does because,
    in the case of async consensus, we fork a parent which I think will
    outlive the child:
    https://github.com/cockroachdb/cockroach/blame/13669f9c9bd92a4c3b0378a558d7735f122c4e72/pkg/kv/kvserver/replica_raft.go#L157

    Regardless of the exact details of the lifetimes, with time it has become
    clear that it's appropriate to create a child when it's expected that
    it will usually not outlive the parent even if, on occasion, it can
    outlive it.

  2. forked spans used to be included in the parent's recording until tracing: use manual collection in ForkCtxSpan #59815.
    This explains why we regressed here - Raft application used to be
    included in recordings properly.

This patch makes the application span be a regular child, and so the
raft application span is included in the request trace.

Touches #70864. That issue asks for a new tracing feature but I think
what motivated it is the application info missing from query traces.
This patch is sufficient for that.

Release note (general change): Tracing transaction commits now includes
details about replication.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker
Copy link
Contributor

Nice, thanks!

When I run a single-node cluster I get a populated span (I added in the processing command event myself):

     2.105ms      0.109ms                        === operation:raft application _verbose:1 node:1 range:44/1:/{Table/52-Max} store:1
     5.015ms      2.910ms                        event:kv/kvserver/replica_application_state_machine.go:442 [n=1,s=1,r=44/1:/{Table/52-Max}] processing command "\x0e\x19\xa4\f\xb8\xf0\xbe\x05": raftIndex=15 maxLeaseIndex=2 closedts=1636972833.981371611,0
     5.037ms      0.022ms                        event:kv/kvserver/replica_application_state_machine.go:464 [n=1,s=1,r=44/1:/{Table/52-Max}] applying command
     5.168ms      0.131ms                        event:kv/kvserver/replica_application_state_machine.go:1084 [n=1,s=1,r=44/1:/{Table/52-Max}] LocalResult (reply: (err: <nil>), *roachpb.ConditionalPutResponse, *roachpb.EndTxnResponse, #encountered intents: 0, #acquired locks: 0, #resolved locks: 1#updated txns: 1 #end txns: 0, GossipFirstRange:false MaybeGossipSystemConfig:false MaybeGossipSystemConfigIfHaveFailure:false MaybeAddToSplitQueue:false MaybeGossipNodeLiveness:<nil>

However, when running a 3-node cluster I only get the span operation itself, and none of the contained events, regardless of which node I'm connected to (including leaseholder):

     7.692ms      5.532ms                        === operation:raft application _unfinished:1 _verbose:1 node:3 range:44/2:/{Table/52-Max} store:3

I'll very occasionally get another event here, but usually not. Also notice the _unfinished:1 bit.

Regardless, this will still shows how long was spent on replication, which is great. It'd be nice if we could add in the actual Raft replication trace at some point too (including network propagation to followers and fsync latency).

@andreimatei andreimatei force-pushed the tracing.raft-application branch 2 times, most recently from c1986ab to d996767 Compare November 15, 2021 17:04
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, when running a 3-node cluster I only get the span operation itself, and none of the contained events, regardless of which node I'm connected to (including leaseholder):

(FWIW, in case that was surprising, the node you're connected to is not supposed to matter.)
I think you've discovered that Raft application is frequently async wrt the client acknowledgement (i.e. the end of the BatchRequest RPC). My commit message deserved to call that out, and now it does.
So, in these cases, command application races with the RPC finish, which collects the trace. That's why you might see fewer or more messages, depending on timing.

I've taken the opportunity to include a log message hinting at async application which I think would have helped. It required some plumbing; PTAL at the new prepended commit.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @irfansharif, @nvanbenschoten, and @tbg)

@tbg tbg removed their request for review November 16, 2021 12:18
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, suspected as much -- thanks for clarifying. The additional trace event is helpful.

Reviewed 6 of 6 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @nvanbenschoten)

@irfansharif irfansharif removed their request for review November 16, 2021 15:48
Before this patch, CheckedCommand.AckSuccess() was called with a Raft
worker context. That's wasn't great because each command captures a
better context to use - one that derives from the proposal's ctx in the
case of local proposals. This patch switches to using that by exposing
the captured context through the Command interface. Taking advantage of
the new ctx, we also log a message now about early acks, as it seems
like a notable hint to see in a trace.

This patch also cleans up most existing uses of that captured context to
use the new interface method; before, various code paths were type
asserting the implementation of the Command, and getting the internal
context that way. This patch moves the resposibility of deciding what
context to use upwards, to callers.

Release note: None
Before this patch, the decoding and application of a Raft command was
not included in the recording of the request that generated the
respective command, even in the case where consensus is synchronous with
respect to the request (i.e. non-AsyncConsensus requests). This was
because, although we plumb tracing information below Raft, the span
under which Raft processing was occurring was Fork()ed from the parent
span (i.e. the request evaluation's span). The reasons why that was are
not good:

1) forking (as opposed to creating a regular child) was introduced in
   cockroachdb#39425. I'm not sure whether there was a particular reason for this
   decision. Perhaps there was fear at the time about the child
   outliving the parent - although generally it doesn't because,
   in the case of async consensus, we fork a parent which I think will
   outlive the child:
   https://github.com/cockroachdb/cockroach/blame/13669f9c9bd92a4c3b0378a558d7735f122c4e72/pkg/kv/kvserver/replica_raft.go#L157

   In case of sync consensus requests, it's possible for the Raft application
   span to outlive the evaluation span in cases when the application part
   (as opposed to the consensus part) can be done async (see
   CanAckBeforeApplication). Still, regardless of the exact details of the
   lifetimes, with time it has become clear that it's appropriate to create
   a child when it's expected that it will usually not outlive the parent
   even if, on occasion, it can outlive it.

2) forked spans used to be included in the parent's recording until cockroachdb#59815.
   This explains why we regressed here - Raft application used to be
   included in recordings properly.

This patch makes the application span be a regular child, and so the
raft application span is included in the request trace.

Touches cockroachdb#70864. That issue asks for a new tracing feature but I think
what motivated it is the application info missing from query traces.
This patch is sufficient for that.

Release note (general change): Tracing transaction commits now includes
details about replication.
Raft application was forking a span on followers, instead of the more
usual creation of child spans. This was done probably because the parent
span is usually a long-running Raft worker span, and adding children to
such long-running spans is generally not a great idea. But these days
there are better ways of dealing with long-running tasks - they can have
Sterile spans, meaning that the parent declares that it doesn't want any
children. And indeed the task in question uses such a Sterile span:
https://github.com/cockroachdb/cockroach/blob/2ad2bee257e78970ce2c457ddd6996099ed6727a/pkg/kv/kvserver/scheduler.go#L217

So, the forking is not needed and this patch changes it to a regular
child relationship.

This has a couple of benefits:
- the parent was not always a long-running taks. I believe it can also
  be a snapshot application, in which case we want the span in question
  to be a regular child (e.g. because tracing a snapshot application
  should also include this span). This shows that the decision about
  what kind if relationship a child should have to its parent is not
  appropriate to be taken by the child.
- less contention on the active spans registry. Forking a span
  internally creates a "local root" span that needs to be added to the
  registry directly, instead of being added indirectly through its
  parent.

Release note: None
@andreimatei andreimatei force-pushed the tracing.raft-application branch from d996767 to f2bc7d9 Compare November 18, 2021 00:58
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)

@craig
Copy link
Contributor

craig bot commented Nov 18, 2021

Build succeeded:

@craig craig bot merged commit 4d82c5f into cockroachdb:master Nov 18, 2021
@andreimatei andreimatei deleted the tracing.raft-application branch January 22, 2022 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants