Consolidate subsystem spans so they are all children of the leaf-activated root span #6458

bredamatt · 2022-12-20T12:49:25Z

Consolidates all spans so they are children spans of an active leaf span (this is except for those spans which cannot be by design).

This allows us to trace the protocol execution from a new leaf is activated until it is finalized.

A few changes have to be made to allow for this:

We need to store spans in some subsystems' state
We need to prune spans on block finalized signals to prevent memory leaks in accordance with what the subsystem expects

This PR touches upon:

approval-voting
approval-distribution
availability-distribution

The current jaeger crate can create a new jaeger::Span with a traceID based on the candidate hash. However, the candidate hash is not available in every subsystem. Therefore, some spans have previously been orphan spans (i.e. a span without a leaf-activated parent). Some changes have been made so that a traceID tag can be created during child span creation, given a candidate hash. This is to allow consolidation under a shared parent span (the leaf-activated span).

The downside to tracing every subsystem based on the leaf-activated rather than as a separate trace (the root span is the start of a trace, hence every orphan span has a unique trace id) is that the traceID for individual subsystems inherit the traceID of the leaf activated span. The tag was introduced to have a holistic view of how a candidate flows through the protocol across all subsystems, whilst still allowing us to search for spans using the traceID in logs.

Here is a screen of some of the more relevant spans from availability-distribution:

As is noticeable, the fetch-task-config has a traceID tag.

Similarly, for approval-voting it is possible to see the traceID on a processed wakeup:

More details on the approval-voting span, and how the traceID is used throughout spans can be seen here:

…t/add-missing-traces

- Add block db insertion span

- Add missing argument

- Add candidate-hash and traceID to check-and-import-approval span

…t/add-missing-traces

node/core/approval-voting/src/import.rs

node/core/approval-voting/src/lib.rs

node/network/availability-distribution/src/lib.rs

node/network/availability-distribution/src/pov_requester/mod.rs

node/network/availability-distribution/src/requester/fetch_task/mod.rs

sandreim · 2023-02-27T18:05:52Z

node/network/availability-distribution/src/requester/mod.rs

 		// Also spawn or bump tasks for candidates in ancestry in the same session.
 		for hash in std::iter::once(leaf).chain(ancestors_in_session) {
+			let span = span


I don't think we need this kind of verbosity.

If we don't include a span here, how do you suggest we pass a span to add_cores() in the loop? Without passing a child span to add_cores() there seems to be no way to trace the execution of FetchTask as a child span of the leaf-activated span.

I see what you mean. In that case, the span name sounds misleading.

I see that add_cores also creates an additional span. Maybe we can create the span there and just pass it a ref to the request-chunks span we created above.

Will take a look now :)

Agreed - naming was off. I renamed this span and the one above as request-chunks-new-head and request-chunks-ancestor accordingly

Regarding add_cores also creating an additional span;

The add_cores call is invoked per ancestor hash. If I understood it correctly, we go [SESSION_ANCESTRY_LEN] back when requesting chunks, so we would need to associate a request-chunk-ancestor span with each ancestor.

Inside add_cores, there is a loop over all the occupied cores, and for some of the occupied cores the entries are vacant. If they are vacant, the new FetchTaskConfig is created, and eventually that FetchTask will run and create the run-fetch-chunk-task span (and its children).

This way we have a relationship like this:

request-chunk-new-head | |__ request-chunk-ancestor (ancestor_0) | |___ check-fetch-candidate (occupied_core_0) |___ check-fetch-candidate (occupied_core_1) |___ fetch-task-config (when entry for occupied_core is vacant) |__ run-fetch-chunk-task |__ ++ |___ check-fetch-candidate (occupied_core_N) | |__ request-chunk-ancestor (ancestor_1) |___ check-fetch-candidate (occupied_core_0) |___ check-fetch-candidate (occupied_core_1) |___ check-fetch-candidate (occupied_core_N)

In this case, I think it makes the most sense to keep the additional span for the added context that it belongs to a particular ancestor.

Got it! Thanks, it will be interesting to see anything relate to what is slowing down PoV recovery.

Definitely!

…t/add-missing-traces

sandreim

Nice! Thanks @bredamatt . One last suggestion, but otherwise LGTM

sandreim · 2023-03-10T10:45:27Z

node/network/availability-distribution/src/requester/mod.rs

 		// Also spawn or bump tasks for candidates in ancestry in the same session.
 		for hash in std::iter::once(leaf).chain(ancestors_in_session) {
+			let span = span


I see what you mean. In that case, the span name sounds misleading.

sandreim · 2023-03-10T10:48:34Z

node/network/availability-distribution/src/requester/mod.rs

 		// Also spawn or bump tasks for candidates in ancestry in the same session.
 		for hash in std::iter::once(leaf).chain(ancestors_in_session) {
+			let span = span


I see that add_cores also creates an additional span. Maybe we can create the span there and just pass it a ref to the request-chunks span we created above.

…t/add-missing-traces

ghost · 2023-03-10T13:31:15Z

node/network/availability-distribution/src/requester/fetch_task/mod.rs

@@ -281,6 +287,11 @@ impl RunningTask {
 					continue
 				},
 			};
+			drop(_chunk_fetch_span);


Why do you need to drop manually? Not just here but wondering in general.

By default spans are "dropped" when they go out of scope. However, it is not truly dropped (hence the ""). Under the hood, there are various ways a "dropped" span can be sent to a tracing backend from a tracing agent, like Grafana tempo for example. Therefore, sometimes, you may want to drop a span to control how and when spans are being sent for further processing.

In general though, creating spans is a bit of an art since it is based on how long one wants a particular span to to be in scope, based on some higher-level understanding of how a codebase works. In this case, the drop is done manually because the _chunk_fetch_span was created with the intention to only capture the duration of the request sent to get a chunk, and not anything else. Further calls are handled with other spans accordingly, and the outer span (the parent of the _chunk_fetch_span), would still conclude when that has itself gone out of scope / been "dropped". In other words, manually "dropping" gives some more flexibility / precision with regards to the end of a span.

@bredamatt Documenting that in a comment seems like it would be good! In general, manual drops should be accompanied with some justification IMO.

By default spans are "dropped" when they go out of scope. However, it is not truly dropped (hence the ""). Under the hood, there are various ways a "dropped" span can be sent to a tracing backend from a tracing agent, like Grafana tempo for example. Therefore, sometimes, you may want to drop a span to control how and when spans are being sent for further processing.

If I get this right, would it be a good idea to wrap the drop in something like send_span_to_XXX(span_thing: YYY) { drop(span_thing); } and then follow @mrcnski comment about documentation?

Thanks for the explanation though!

@alexgparity yes, you are right at a high-level. This is somewhat how it works under the hood, but obviously with more detail as it is based on which tracing stack is used and how it is configured from a performance tuning perspective: https://www.jaegertracing.io/docs/1.23/performance-tuning/

In our case, there is an external crate which handles how spans are being passed to Grafana Tempo - this is the mick-jaeger crate, a wrapper around jaeger. Changing mick-jaeger is not really in scope for this PR since we are using mick-jaeger throughout all subsystems from beforehand.

ordian

Looks nice from the quick glance. I haven't looked at the generated spans, but we can adjust them later if something looks off.

node/core/approval-voting/src/lib.rs

- Add docs for drops

node/core/approval-voting/src/lib.rs

…t/add-missing-traces

sandreim · 2023-03-31T15:54:10Z

bot merge

* master: (28 commits) Remove years from copyright notes (#7034) Onchain scraper in `dispute-coordinator` will scrape `SCRAPED_FINALIZED_BLOCKS_COUNT` blocks before finality (#7013) PVF: Minor refactor in workers code (#7012) Expose WASM bulk memory extension in execution environment parameters (#7008) Co #13699: Remove old calls (#7003) Companion for paritytech/substrate#13811 (#6998) PR review rules, include all rs files except weights (#6990) Substrate companion: Remove deprecated batch verification (#6999) Added `origin` to config for `universal_origin` benchmark (#6986) Cache `SessionInfo` on new activated leaf in `dispute-distribution` (#6993) Update Substrate to fix Substrate companions (#6994) Consolidate subsystem spans so they are all children of the leaf-activated root span (#6458) Avoid redundant clone. (#6989) bump zombienet version (#6985) avoid triggering unwanted room_id for the release notifs (#6984) Add crowdloan to SafeCallFilter (#6903) Drop timers for new requests of active participations (#6974) Use `SIGTERM` instead of `SIGKILL` on PVF worker version mismatch (#6981) Tighter bound on asset types teleported so that weight is cheaper (#6980) staking miner: less aggresive submissions (#6978) ...

* master: (25 commits) [Deps] bump scale-info to be in line with cumulus (#7049) Invoke cargo build commands with `--locked` (#7057) use stable rust toolchain in ci apply clippy 1.68 suggestions Remove years from copyright notes (#7034) Onchain scraper in `dispute-coordinator` will scrape `SCRAPED_FINALIZED_BLOCKS_COUNT` blocks before finality (#7013) PVF: Minor refactor in workers code (#7012) Expose WASM bulk memory extension in execution environment parameters (#7008) Co #13699: Remove old calls (#7003) Companion for paritytech/substrate#13811 (#6998) PR review rules, include all rs files except weights (#6990) Substrate companion: Remove deprecated batch verification (#6999) Added `origin` to config for `universal_origin` benchmark (#6986) Cache `SessionInfo` on new activated leaf in `dispute-distribution` (#6993) Update Substrate to fix Substrate companions (#6994) Consolidate subsystem spans so they are all children of the leaf-activated root span (#6458) Avoid redundant clone. (#6989) bump zombienet version (#6985) avoid triggering unwanted room_id for the release notifs (#6984) Add crowdloan to SafeCallFilter (#6903) ...

…slashing-client * ao-past-session-slashing-runtime: (25 commits) [Deps] bump scale-info to be in line with cumulus (#7049) Invoke cargo build commands with `--locked` (#7057) use stable rust toolchain in ci apply clippy 1.68 suggestions Remove years from copyright notes (#7034) Onchain scraper in `dispute-coordinator` will scrape `SCRAPED_FINALIZED_BLOCKS_COUNT` blocks before finality (#7013) PVF: Minor refactor in workers code (#7012) Expose WASM bulk memory extension in execution environment parameters (#7008) Co #13699: Remove old calls (#7003) Companion for paritytech/substrate#13811 (#6998) PR review rules, include all rs files except weights (#6990) Substrate companion: Remove deprecated batch verification (#6999) Added `origin` to config for `universal_origin` benchmark (#6986) Cache `SessionInfo` on new activated leaf in `dispute-distribution` (#6993) Update Substrate to fix Substrate companions (#6994) Consolidate subsystem spans so they are all children of the leaf-activated root span (#6458) Avoid redundant clone. (#6989) bump zombienet version (#6985) avoid triggering unwanted room_id for the release notifs (#6984) Add crowdloan to SafeCallFilter (#6903) ...

Pass the PerLeafSpan as mutable reference to handle_new_head function

7786864

github-actions bot added the A3-inprogress label Dec 20, 2022

Merge branch 'master' of github.com:paritytech/polkadot into bredamat…

ed87d46

…t/add-missing-traces

bredamatt self-assigned this Dec 20, 2022

bredamatt linked an issue Dec 20, 2022 that may be closed by this pull request

approval-voting: Add missing traces #6044

Closed

bredamatt added B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Dec 20, 2022

bredamatt added 17 commits December 20, 2022 12:53

cargo +nightly fmt --all

538a7ac

Add mock span for test

cfd4249

cargo +nightly fmt --all

926c7e6

Merge branch 'master' of github.com:paritytech/polkadot into bredamat…

3739d88

…t/add-missing-traces

Merge branch 'master' of github.com:paritytech/polkadot into bredamat…

97502e2

…t/add-missing-traces

add new-blocks-hashes to span

dbd38ba

ref span in match statement, set span to disabled if not passed

baba907

remove second match clause, make handle_new_head_span mutable

0202014

cargo +nightly fmt --all

f51e2c1

improve tag on error and warning

e7aa6de

add imported blocks and info span

20bd43a

Merge branch 'master' of github.com:paritytech/polkadot into bredamat…

e06f1c0

…t/add-missing-traces

cargo +nightly fmt --all

eb283d9

Improve error for imported_blocks_and_info trace

2506e14

format tags on get_header_span

67838d9

add lost-to-finality tag

59ecc9e

add missing bracket

c799dc4

mrcnski self-requested a review December 30, 2022 14:28

bredamatt added 5 commits January 3, 2023 11:52

- Add bitfield child span

241bc18

- Add block db insertion span

- fix update-bitfield span tag

a478d34

- Fix type conversion to u64

3251f88

- Add missing argument

- Cargo fmt

c5eb910

- Test add_follows_from

582da10

bredamatt added 3 commits February 24, 2023 11:25

- Add tranche and should-trigger tag to process-wakeup span

aa601b9

- Add candidate-hash and traceID to check-and-import-approval span

cargo fmt

75ade4f

Merge branch 'master' of github.com:paritytech/polkadot into bredamat…

b12d381

…t/add-missing-traces

sandreim reviewed Feb 27, 2023

View reviewed changes

bredamatt added 2 commits February 28, 2023 14:09

Merge branch 'master' of github.com:paritytech/polkadot into bredamat…

f05fa73

…t/add-missing-traces

- Adjustments after PR comments

209e1c9

bredamatt requested a review from sandreim February 28, 2023 15:54

bredamatt added 4 commits February 28, 2023 16:18

Move span pruning after other pruning logic

8bb2773

Remove DerefMut - no longer needed

3614167

Merge branch 'master' of github.com:paritytech/polkadot into bredamat…

c86dfb4

…t/add-missing-traces

Merge branch 'master' of github.com:paritytech/polkadot into bredamat…

d5a4e4c

…t/add-missing-traces

sandreim approved these changes Mar 10, 2023

View reviewed changes

bredamatt added 2 commits March 10, 2023 11:42

Merge branch 'master' of github.com:paritytech/polkadot into bredamat…

7e8af24

…t/add-missing-traces

Relabel request-chunk spans

4ec2c2d

ghost reviewed Mar 10, 2023

View reviewed changes

ordian approved these changes Mar 10, 2023

View reviewed changes

node/core/approval-voting/src/lib.rs Outdated Show resolved Hide resolved

bredamatt added 4 commits March 10, 2023 15:31

- Fix typo in span label

2b49b79

- Add docs for drops

Add new approval-voting span pruning logic

cc9f7c9

Undo removal of !

7542161

cargo fmt

405e8c8

ordian reviewed Mar 10, 2023

View reviewed changes

node/core/approval-voting/src/lib.rs Show resolved Hide resolved

altaua removed the A0-pleasereview label Mar 15, 2023

bredamatt added 2 commits March 20, 2023 12:56

Merge branch 'master' of github.com:paritytech/polkadot into bredamat…

6fa18c2

…t/add-missing-traces

Merge branch 'master' of github.com:paritytech/polkadot into bredamat…

559b73b

…t/add-missing-traces

paritytech-processbot bot merged commit 21c8734 into master Mar 31, 2023

paritytech-processbot bot deleted the bredamatt/add-missing-traces branch March 31, 2023 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate subsystem spans so they are all children of the leaf-activated root span #6458

Consolidate subsystem spans so they are all children of the leaf-activated root span #6458

bredamatt commented Dec 20, 2022 •

edited

Loading

sandreim Feb 27, 2023

bredamatt Feb 28, 2023 •

edited

Loading

sandreim Mar 10, 2023

sandreim Mar 10, 2023

bredamatt Mar 10, 2023

bredamatt Mar 10, 2023

bredamatt Mar 10, 2023 •

edited

Loading

sandreim Mar 10, 2023

bredamatt Mar 10, 2023

sandreim left a comment

sandreim Mar 10, 2023

sandreim Mar 10, 2023

ghost Mar 10, 2023

bredamatt Mar 10, 2023 •

edited

Loading

mrcnski Mar 10, 2023

ghost Mar 10, 2023

bredamatt Mar 10, 2023 •

edited

Loading

ordian left a comment

sandreim commented Mar 31, 2023

Consolidate subsystem spans so they are all children of the leaf-activated root span #6458

Consolidate subsystem spans so they are all children of the leaf-activated root span #6458

Conversation

bredamatt commented Dec 20, 2022 • edited Loading

Choose a reason for hiding this comment

bredamatt Feb 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bredamatt Mar 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sandreim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bredamatt Mar 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bredamatt Mar 10, 2023 • edited Loading

Choose a reason for hiding this comment

ordian left a comment

Choose a reason for hiding this comment

sandreim commented Mar 31, 2023

bredamatt commented Dec 20, 2022 •

edited

Loading

bredamatt Feb 28, 2023 •

edited

Loading

bredamatt Mar 10, 2023 •

edited

Loading

bredamatt Mar 10, 2023 •

edited

Loading

bredamatt Mar 10, 2023 •

edited

Loading