Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

walreceiver: lifecycle traced events without any context #3330

Closed
koivunej opened this issue Jan 13, 2023 · 3 comments · Fixed by #4402
Closed

walreceiver: lifecycle traced events without any context #3330

koivunej opened this issue Jan 13, 2023 · 3 comments · Fixed by #4402
Assignees
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver

Comments

@koivunej
Copy link
Member

Noted while watching the production logs on zenith-1-ps-2, because of an ongoing cleanup we now find randomly spaced plain timestamp INFO Connection cancelled which cannot be directly correlated to any timeline from

_ = connection_cancellation.cancelled() => info!("Connection cancelled"),

Right above previous callsite is similar timestamp INFO Walreceiver db connection closed, which also appears in logs:

Ok(()) => info!("Walreceiver db connection closed"),

Not saying these cannot be useful, but they need more context, but perhaps their overall usefulness should be considered.

Looking around however, in the same scope of handle_walreceiver_connection we have access to timeline and wal_source_connconf, so we should be able to add context by using #[instrument(...)] and propagating that at to the task_mgr::spawn'd future which will run both connection and be the context for the two lifecycle events.

It might be this is on already someone's list so this issue can be closed.

Cc: #3218

@SomeoneToIgnore SomeoneToIgnore self-assigned this Jan 16, 2023
@koivunej koivunej added a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver labels Feb 20, 2023
@shanyp
Copy link
Contributor

shanyp commented May 8, 2023

@koivunej was this addressed by wal_reciever PR that you merged?

@SomeoneToIgnore
Copy link
Contributor

#4090 was not supposed to address this issue, unless done accidentally.

@koivunej
Copy link
Member Author

koivunej commented May 8, 2023

Agreed, it was not supposed to be closed.

koivunej added a commit that referenced this issue Jun 5, 2023
walreceiver logs are a bit hard to understand because of partial span
usage, extra messages, ignored errors popping up as huge stacktraces.

Fixes #3330 (by spans, also demote info -> debug).

- arrange walreceivers spans into a hiearchy: 
    - `wal_connection_manager{tenant_id, timeline_id}` ->
      `connection{node_id}` -> `poller`
- unifies the error reporting inside `wal_receiver`:
- All ok errors are now `walreceiver connection handling ended: {e:#}`
- All unknown errors are still stacktraceful task_mgr reported errors
  with context `walreceiver connection handling failure`
- Remove `connect` special casing, was: `DB connection stream finished`
  for ok errors
- Remove `done replicating` special casing, was `Replication stream
  finished` for ok errors
- lowered log levels for (non-exhaustive list):
    - `WAL receiver manager started, connecting to broker` (at startup)
    - `WAL receiver shutdown requested, shutting down` (at shutdown)
    - `Connection manager loop ended, shutting down` (at shutdown)
    - `sender is dropped while join handle is still alive` (at lucky
      shutdown, see #2885)
    - `timeline entered terminal state {:?}, stopping wal connection manager
      loop` (at shutdown)
    - `connected!` (at startup)
- `Walreceiver db connection closed` (at disconnects?, was without span)
    - `Connection cancelled` (at shutdown, was without span)
- `observed timeline state change, new state is {new_state:?}` (never
  after Timeline::activate was made infallible)
- changed:
    - `Timeline dropped state updates sender, stopping wal connection
      manager loop`
    - was out of date; sender is not dropped but `Broken | Stopping` state
      transition
        - also made `debug!`
    - `Timeline dropped state updates sender before becoming active,
      stopping wal connection manager loop`
    - was out of date: sender is again not dropped but `Broken | Stopping`
      state transition
        - also made `debug!`
- log fixes:
    - stop double reporting panics via JoinError
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants