-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: add some span recordings to the backup resumer #65725
Conversation
Just wanted to get some eyes on this general approach before I went any further. Primarily, if there are any better ideas of how we would like to organize trace recording protobufs. Also welcoming any high-level pointers on if we want to be more structured about where we want to add more recording. I've just used my own discretion at the moment to identify interesting phases in the resumer. |
Are we trying to be too structured? For example:
We already have a way of tracing string values, don't we? Just thinking out loud: what are we getting for having |
We do, but if we only rely on structured recordings then we don't need the span to be verbose. Only verbose spans accumulate free form logs, and knz had recommended sticking to lightweight structured protos to avoid perf impacts. re: whether we should have these kinds of recordings at all, I believe they would serve as visual cues when pulling a jobs trace to help narrow down the "phase" in which a job is stuck? |
I meant types.StringValue -- it's a proto, and it has a "Value" in it. What do we get when using bulk specific proto messages? |
oh yes! That I'm 100% onboard with. I just made it its own proto for the time being incase I figured we needed more bulk-specific fields in it. Will change it to types.StringValue if that is not the case. |
pkg/ccl/backupccl/backup.proto
Outdated
message Spans { | ||
repeated roachpb.Span spans = 1 [(gogoproto.nullable) = false]; | ||
} | ||
map<int32, Spans> node_to_spans = 1 [(gogoproto.nullable) = false]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I'm curious about is how large we can/want these protos to be? Something like node_to_spans
would grow O(ranges). Is this okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah crossed my mind while I was adding this proto. I don't have a very good intuition about what serializing/deserializing an O(ranges) proto costs us. I believe we will need to serialize/deserialize if we pull the trace from a node other than the coordinator node, as well as when we use the existing payloads_for_span
generator -
payload, err := protoreflect.MessageToJSON(item, true /* emitDefaults */) |
I thought it would be valuable to know which span is being executed on where, but maybe we can get away with adding the span to the trace at the ExportRequest level, when the node actually serves the export request?
6759808
to
b19b02f
Compare
The first 4 commits are from the refactor in #65576, so the last two commits are where the changes have been made. I'm considering adding a separate PR with trace recordings for the external storage implementations. |
pkg/ccl/backupccl/backup.proto
Outdated
string retryable_error = 5; | ||
} | ||
message Response { | ||
util.hlc.Timestamp resp_received_time = 1 [(gogoproto.nullable) = false]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe duration would be more useful here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to duration since that seems more useful than a timestamp.
b19b02f
to
ac94438
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The events LGTM; same comments from the other tracing PR also apply to this (re: redaction and possible truncation of the error messages)
pkg/ccl/backupccl/backup.proto
Outdated
@@ -170,3 +170,15 @@ message RestoreProgress { | |||
int64 progressIdx = 2; | |||
roachpb.Span dataSpan = 3 [(gogoproto.nullable) = false]; | |||
} | |||
|
|||
message BackupPlanningTraceEvent { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: perhaps BackupProcessorPlanningTraceEvent
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
aa94417
to
0620495
Compare
Yup, moved the helper to a utils file in tracing, and passed errors through that. |
(not sure if you wanted to have another look at this) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with naming nit
pkg/util/tracing/utils.go
Outdated
|
||
// RedactAndTruncateErrorForTracing redacts the error and truncates the string | ||
// representation of the error to a fixed length. | ||
func RedactAndTruncateErrorForTracing(err error) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could we just call this RedactAndTruncateError
since it's already in the tracing
package (so callers will already see tracing.RedactAndTruncateError
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
This change adds trace recordings to noteworthy places during backup exection. Release note: None
0620495
to
94d9723
Compare
Bazel CI is red because of a macOS cross-build failure. When I trigger a run manually it passes but doesn't seem to reflect here. Going to merge since I don't believe it is related to this PR, will bring it up with dev inf. bors r=pbardea |
Build succeeded: |
This change adds some initial span recordings to the backup
resumer that might be interesting when inspecting an executing
backup. The processor and KV level requests will be dealt
with in subsequent commits.
Informs: #64992
Release note: None