-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Consolidate subsystem spans so they are all children of the leaf-activated root span #6458
Conversation
…t/add-missing-traces
…t/add-missing-traces
…t/add-missing-traces
…t/add-missing-traces
- Add block db insertion span
- Add missing argument
- Add candidate-hash and traceID to check-and-import-approval span
…t/add-missing-traces
node/network/availability-distribution/src/pov_requester/mod.rs
Outdated
Show resolved
Hide resolved
node/network/availability-distribution/src/requester/fetch_task/mod.rs
Outdated
Show resolved
Hide resolved
// Also spawn or bump tasks for candidates in ancestry in the same session. | ||
for hash in std::iter::once(leaf).chain(ancestors_in_session) { | ||
let span = span |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this kind of verbosity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't include a span here, how do you suggest we pass a span to add_cores()
in the loop? Without passing a child span to add_cores()
there seems to be no way to trace the execution of FetchTask
as a child span of the leaf-activated
span.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean. In that case, the span name sounds misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that add_cores also creates an additional span. Maybe we can create the span there and just pass it a ref to the request-chunks
span we created above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will take a look now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - naming was off. I renamed this span and the one above as request-chunks-new-head
and request-chunks-ancestor
accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding add_cores
also creating an additional span;
- The
add_cores
call is invoked per ancestor hash. If I understood it correctly, we go [SESSION_ANCESTRY_LEN
] back when requesting chunks, so we would need to associate arequest-chunk-ancestor
span with each ancestor. - Inside
add_cores
, there is a loop over all the occupied cores, and for some of the occupied cores the entries are vacant. If they are vacant, the newFetchTaskConfig
is created, and eventually thatFetchTask
will run and create therun-fetch-chunk-task
span (and its children).
This way we have a relationship like this:
request-chunk-new-head
|
|__ request-chunk-ancestor (ancestor_0)
|
|___ check-fetch-candidate (occupied_core_0)
|___ check-fetch-candidate (occupied_core_1)
|___ fetch-task-config (when entry for occupied_core is vacant)
|__ run-fetch-chunk-task
|__ ++
|___ check-fetch-candidate (occupied_core_N)
|
|__ request-chunk-ancestor (ancestor_1)
|___ check-fetch-candidate (occupied_core_0)
|___ check-fetch-candidate (occupied_core_1)
|___ check-fetch-candidate (occupied_core_N)
In this case, I think it makes the most sense to keep the additional span for the added context that it belongs to a particular ancestor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! Thanks, it will be interesting to see anything relate to what is slowing down PoV recovery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely!
…t/add-missing-traces
…t/add-missing-traces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks @bredamatt . One last suggestion, but otherwise LGTM
// Also spawn or bump tasks for candidates in ancestry in the same session. | ||
for hash in std::iter::once(leaf).chain(ancestors_in_session) { | ||
let span = span |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean. In that case, the span name sounds misleading.
// Also spawn or bump tasks for candidates in ancestry in the same session. | ||
for hash in std::iter::once(leaf).chain(ancestors_in_session) { | ||
let span = span |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that add_cores also creates an additional span. Maybe we can create the span there and just pass it a ref to the request-chunks
span we created above.
@@ -281,6 +287,11 @@ impl RunningTask { | |||
continue | |||
}, | |||
}; | |||
drop(_chunk_fetch_span); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to drop manually? Not just here but wondering in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default spans are "dropped" when they go out of scope. However, it is not truly dropped (hence the ""). Under the hood, there are various ways a "dropped" span can be sent to a tracing backend from a tracing agent, like Grafana tempo for example. Therefore, sometimes, you may want to drop a span to control how and when spans are being sent for further processing.
In general though, creating spans is a bit of an art since it is based on how long one wants a particular span to to be in scope, based on some higher-level understanding of how a codebase works. In this case, the drop is done manually because the _chunk_fetch_span
was created with the intention to only capture the duration of the request sent to get a chunk, and not anything else. Further calls are handled with other spans accordingly, and the outer span (the parent of the _chunk_fetch_span
), would still conclude when that has itself gone out of scope / been "dropped". In other words, manually "dropping" gives some more flexibility / precision with regards to the end of a span.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bredamatt Documenting that in a comment seems like it would be good! In general, manual drop
s should be accompanied with some justification IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default spans are "dropped" when they go out of scope. However, it is not truly dropped (hence the ""). Under the hood, there are various ways a "dropped" span can be sent to a tracing backend from a tracing agent, like Grafana tempo for example. Therefore, sometimes, you may want to drop a span to control how and when spans are being sent for further processing.
If I get this right, would it be a good idea to wrap the drop in something like send_span_to_XXX(span_thing: YYY) { drop(span_thing); }
and then follow @mrcnski comment about documentation?
Thanks for the explanation though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexgparity yes, you are right at a high-level. This is somewhat how it works under the hood, but obviously with more detail as it is based on which tracing stack is used and how it is configured from a performance tuning perspective: https://www.jaegertracing.io/docs/1.23/performance-tuning/
In our case, there is an external crate which handles how spans are being passed to Grafana Tempo - this is the mick-jaeger
crate, a wrapper around jaeger
. Changing mick-jaeger
is not really in scope for this PR since we are using mick-jaeger
throughout all subsystems from beforehand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice from the quick glance. I haven't looked at the generated spans, but we can adjust them later if something looks off.
…t/add-missing-traces
…t/add-missing-traces
bot merge |
* master: (28 commits) Remove years from copyright notes (#7034) Onchain scraper in `dispute-coordinator` will scrape `SCRAPED_FINALIZED_BLOCKS_COUNT` blocks before finality (#7013) PVF: Minor refactor in workers code (#7012) Expose WASM bulk memory extension in execution environment parameters (#7008) Co #13699: Remove old calls (#7003) Companion for paritytech/substrate#13811 (#6998) PR review rules, include all rs files except weights (#6990) Substrate companion: Remove deprecated batch verification (#6999) Added `origin` to config for `universal_origin` benchmark (#6986) Cache `SessionInfo` on new activated leaf in `dispute-distribution` (#6993) Update Substrate to fix Substrate companions (#6994) Consolidate subsystem spans so they are all children of the leaf-activated root span (#6458) Avoid redundant clone. (#6989) bump zombienet version (#6985) avoid triggering unwanted room_id for the release notifs (#6984) Add crowdloan to SafeCallFilter (#6903) Drop timers for new requests of active participations (#6974) Use `SIGTERM` instead of `SIGKILL` on PVF worker version mismatch (#6981) Tighter bound on asset types teleported so that weight is cheaper (#6980) staking miner: less aggresive submissions (#6978) ...
* master: (25 commits) [Deps] bump scale-info to be in line with cumulus (#7049) Invoke cargo build commands with `--locked` (#7057) use stable rust toolchain in ci apply clippy 1.68 suggestions Remove years from copyright notes (#7034) Onchain scraper in `dispute-coordinator` will scrape `SCRAPED_FINALIZED_BLOCKS_COUNT` blocks before finality (#7013) PVF: Minor refactor in workers code (#7012) Expose WASM bulk memory extension in execution environment parameters (#7008) Co #13699: Remove old calls (#7003) Companion for paritytech/substrate#13811 (#6998) PR review rules, include all rs files except weights (#6990) Substrate companion: Remove deprecated batch verification (#6999) Added `origin` to config for `universal_origin` benchmark (#6986) Cache `SessionInfo` on new activated leaf in `dispute-distribution` (#6993) Update Substrate to fix Substrate companions (#6994) Consolidate subsystem spans so they are all children of the leaf-activated root span (#6458) Avoid redundant clone. (#6989) bump zombienet version (#6985) avoid triggering unwanted room_id for the release notifs (#6984) Add crowdloan to SafeCallFilter (#6903) ...
…slashing-client * ao-past-session-slashing-runtime: (25 commits) [Deps] bump scale-info to be in line with cumulus (#7049) Invoke cargo build commands with `--locked` (#7057) use stable rust toolchain in ci apply clippy 1.68 suggestions Remove years from copyright notes (#7034) Onchain scraper in `dispute-coordinator` will scrape `SCRAPED_FINALIZED_BLOCKS_COUNT` blocks before finality (#7013) PVF: Minor refactor in workers code (#7012) Expose WASM bulk memory extension in execution environment parameters (#7008) Co #13699: Remove old calls (#7003) Companion for paritytech/substrate#13811 (#6998) PR review rules, include all rs files except weights (#6990) Substrate companion: Remove deprecated batch verification (#6999) Added `origin` to config for `universal_origin` benchmark (#6986) Cache `SessionInfo` on new activated leaf in `dispute-distribution` (#6993) Update Substrate to fix Substrate companions (#6994) Consolidate subsystem spans so they are all children of the leaf-activated root span (#6458) Avoid redundant clone. (#6989) bump zombienet version (#6985) avoid triggering unwanted room_id for the release notifs (#6984) Add crowdloan to SafeCallFilter (#6903) ...
See: #6044
Consolidates all spans so they are children spans of an active leaf span (this is except for those spans which cannot be by design).
This allows us to trace the protocol execution from a new leaf is activated until it is finalized.
A few changes have to be made to allow for this:
This PR touches upon:
The current jaeger crate can create a new
jaeger::Span
with a traceID based on the candidate hash. However, the candidate hash is not available in every subsystem. Therefore, some spans have previously been orphan spans (i.e. a span without a leaf-activated parent). Some changes have been made so that a traceID tag can be created during child span creation, given a candidate hash. This is to allow consolidation under a shared parent span (the leaf-activated span).The downside to tracing every subsystem based on the leaf-activated rather than as a separate trace (the root span is the start of a trace, hence every orphan span has a unique trace id) is that the traceID for individual subsystems inherit the traceID of the leaf activated span. The tag was introduced to have a holistic view of how a candidate flows through the protocol across all subsystems, whilst still allowing us to search for spans using the traceID in logs.
Here is a screen of some of the more relevant spans from availability-distribution:
As is noticeable, the fetch-task-config has a traceID tag.
Similarly, for approval-voting it is possible to see the traceID on a processed wakeup:
More details on the approval-voting span, and how the traceID is used throughout spans can be seen here: