-
Notifications
You must be signed in to change notification settings - Fork 488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
walreceiver: lifecycle traced events without any context #3330
Labels
Comments
@koivunej was this addressed by wal_reciever PR that you merged? |
#4090 was not supposed to address this issue, unless done accidentally. |
Agreed, it was not supposed to be closed. |
koivunej
added a commit
that referenced
this issue
Jun 5, 2023
walreceiver logs are a bit hard to understand because of partial span usage, extra messages, ignored errors popping up as huge stacktraces. Fixes #3330 (by spans, also demote info -> debug). - arrange walreceivers spans into a hiearchy: - `wal_connection_manager{tenant_id, timeline_id}` -> `connection{node_id}` -> `poller` - unifies the error reporting inside `wal_receiver`: - All ok errors are now `walreceiver connection handling ended: {e:#}` - All unknown errors are still stacktraceful task_mgr reported errors with context `walreceiver connection handling failure` - Remove `connect` special casing, was: `DB connection stream finished` for ok errors - Remove `done replicating` special casing, was `Replication stream finished` for ok errors - lowered log levels for (non-exhaustive list): - `WAL receiver manager started, connecting to broker` (at startup) - `WAL receiver shutdown requested, shutting down` (at shutdown) - `Connection manager loop ended, shutting down` (at shutdown) - `sender is dropped while join handle is still alive` (at lucky shutdown, see #2885) - `timeline entered terminal state {:?}, stopping wal connection manager loop` (at shutdown) - `connected!` (at startup) - `Walreceiver db connection closed` (at disconnects?, was without span) - `Connection cancelled` (at shutdown, was without span) - `observed timeline state change, new state is {new_state:?}` (never after Timeline::activate was made infallible) - changed: - `Timeline dropped state updates sender, stopping wal connection manager loop` - was out of date; sender is not dropped but `Broken | Stopping` state transition - also made `debug!` - `Timeline dropped state updates sender before becoming active, stopping wal connection manager loop` - was out of date: sender is again not dropped but `Broken | Stopping` state transition - also made `debug!` - log fixes: - stop double reporting panics via JoinError
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Noted while watching the production logs on
zenith-1-ps-2
, because of an ongoing cleanup we now find randomly spaced plaintimestamp INFO Connection cancelled
which cannot be directly correlated to any timeline fromneon/pageserver/src/walreceiver/walreceiver_connection.rs
Line 121 in 16baa91
Right above previous callsite is similar
timestamp INFO Walreceiver db connection closed
, which also appears in logs:neon/pageserver/src/walreceiver/walreceiver_connection.rs
Line 113 in 16baa91
Not saying these cannot be useful, but they need more context, but perhaps their overall usefulness should be considered.
Looking around however, in the same scope of
handle_walreceiver_connection
we have access totimeline
andwal_source_connconf
, so we should be able to add context by using#[instrument(...)]
and propagating that at to thetask_mgr::spawn
'd future which will run bothconnection
and be the context for the two lifecycle events.It might be this is on already someone's list so this issue can be closed.
Cc: #3218
The text was updated successfully, but these errors were encountered: