Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: don't report SIGTERM & friends to Sentry #587

Merged
merged 6 commits into from
Sep 5, 2024

Conversation

bajtos
Copy link
Member

@bajtos bajtos commented Sep 2, 2024

When Zinniad is killed via SIGTERM/SIGKILL/SIGINT, don't report the error to Sentry.

These signal codes are triggered when somebody terminates the process from outside. It's not a problem in Zinnia, there is nothing we can do about this.

Before:

ERROR Spark crashed via signal SIGTERM
Reporting the problem to Sentry for inspection by the Station team.
Zinnia main loop ended
Starting zinniad with config CliArgs { /* ... */ }

After:

ERROR Spark crashed via signal SIGTERM
Zinnia main loop ended
Starting zinniad with config CliArgs { /* ... */ }

The Sentry issue:

When Zinniad is killed via SIGTERM/SIGKILL/SIGINT, don't report the
error to Sentry.

These signal codes are triggered when somebody terminates the process
from outside. It's not a problem in Zinnia, there is nothing we can
do about this.

Before:

```
ERROR Spark crashed via signal SIGTERM
Reporting the problem to Sentry for inspection by the Station team.
Zinnia main loop ended
Starting zinniad with config CliArgs { /* ... */ }
```

After:

```
ERROR Spark crashed via signal SIGTERM
Zinnia main loop ended
Starting zinniad with config CliArgs { /* ... */ }
```

Signed-off-by: Miroslav Bajtoš <[email protected]>
@bajtos bajtos requested a review from juliangruber September 2, 2024 11:41
@bajtos bajtos requested a review from juliangruber September 2, 2024 13:46
@bajtos
Copy link
Member Author

bajtos commented Sep 2, 2024

@juliangruber ready for the (final?🤞🏻) review

// Don't report this error to Sentry and don't print the stack trace to stderr,
// treat this as a regular exit (successful completion of the process).
// (Note that this event has been already logged via `onActivity()` call above.)
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you verified that Core properly shuts down here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not shut down, it restarts the Zinnia.

ERROR Spark crashed via signal SIGTERM
Zinnia main loop ended
Starting zinniad with config CliArgs { /* ... */ }

I'll double check this is still true.

Or did you mean something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right! Yeah would be great if you could double check this behavior

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test scenario:

  1. Run npm test in one terminal tab
  2. After a bit, execute killall zinniad in the second tab

Console output from Station Core - the version from the main branch (e895f0b):

[03/09/2024, 17:07:21] ERROR Spark crashed via signal SIGTERM
Reporting the problem to Sentry for inspection by the Station team.
Zinnia main loop errored ExecaError: Command was killed with SIGTERM (Termination): /Users/bajtos/src/filstation/core/modules/zinnia/zinniad spark/main.js

[2024-09-03T15:07:17Z INFO  zinniad] Starting zinniad with config CliArgs { wallet_address: "0x000000000000000000000000000000000000dEaD", station_id: "302a300506032b65700321005c92a0dff7f4fad282bd8a2f111fd0e3f6e9f13a02315a0b8f02f874b8ffcdac", state_root: ".state/modules/zinnia", cache_root: ".cache/modules/zinnia", files: ["spark/main.js"] }
[2024-09-03T15:07:17Z INFO  lassie] Starting Lassie Daemon
[2024-09-03T15:07:17Z INFO  lassie] Lassie Daemon is listening on port 50076
[2024-09-03T15:07:17Z INFO  zinniad] Starting module file:///Users/bajtos/src/filstation/core/.cache/sources/spark/main.js
[2024-09-03T15:07:17Z INFO  module:spark/main] Checking the current SPARK round...
[2024-09-03T15:07:17Z INFO  module:spark/main] Fetching round details at location /rounds/meridian/0x8460766edc62b525fc1fa4d628fc79229dc73031/14429
[2024-09-03T15:07:17Z INFO  module:spark/main] Current SPARK round: { roundId: "19853", startEpoch: "4235611" }
[2024-09-03T15:07:17Z INFO  module:spark/main]   15 max tasks per round
[2024-09-03T15:07:17Z INFO  module:spark/main]   1000 retrieval tasks
[2024-09-03T15:07:17Z INFO  module:spark/main]   randomness: a21aac13c9d47f6e86c33b2ff1cd969bdea81006269b025868744e3b7f7cb13e
[2024-09-03T15:07:17Z INFO  module:spark/main] {
      retrieval: {
        cid: "bafykbzacecwbt7xlwmmkyt3invwneklrb557ua2rnrxnuzjazyxasz2mabyjg",
        minerId: "f0122215",
        clients: [ "f02098873" ]
      }
    }
[2024-09-03T15:07:17Z INFO  module:spark/main] Calling Filecoin JSON-RPC to get PeerId of miner f0122215
[2024-09-03T15:07:18Z INFO  module:spark/main] Found peer id: 12D3KooWPE6aMwnbvPdQkYeWQytGNz1LJZTU7jrySBhEazwA6aYe
[2024-09-03T15:07:18Z INFO  module:spark/main] Querying IPNI to find retrieval providers for bafykbzacecwbt7xlwmmkyt3invwneklrb557ua2rnrxnuzjazyxasz2mabyjg
[2024-09-03T15:07:18Z ERROR module:spark/main] IPNI query failed, HTTP response: 404
[2024-09-03T15:07:18Z INFO  module:spark/main] Submitting measurement...
[2024-09-03T15:07:18Z INFO  module:spark/main] {
      sparkVersion: "1.14.2",
      zinniaVersion: "0.20.2",
      cid: "bafykbzacecwbt7xlwmmkyt3invwneklrb557ua2rnrxnuzjazyxasz2mabyjg",
      minerId: "f0122215",
      clients: [ "f02098873" ],
      timeout: false,
      startAt: 2024-09-03T15:07:17.766Z,
      firstByteAt: null,
      endAt: null,
      carTooLarge: false,
      byteLength: 0,
      carChecksum: null,
      statusCode: null,
      providerId: "12D3KooWPE6aMwnbvPdQkYeWQytGNz1LJZTU7jrySBhEazwA6aYe",
      indexerResult: "ERROR_404",
      participantAddress: "0x000000000000000000000000000000000000dEaD",
      stationId: "302a300506032b65700321005c92a0dff7f4fad282bd8a2f111fd0e3f6e9f13a02315a0b8f02f874b8ffcdac"
    }
[2024-09-03T15:07:19Z INFO  module:spark/main] Measurement submitted (id: 7414385119)
[2024-09-03T15:07:19Z INFO  module:spark/main] Sleeping for 78 seconds before starting the next task...

{"type":"activity:started","module":null}
{"type":"jobs-completed","total":176,"modules":{"spark/main":176}}
{"type":"activity:info","module":"spark/main","message":"SPARK started reporting retrievals"}
{"type":"jobs-completed","total":177,"modules":{"spark/main":177}}
    at getFinalError (file:///Users/bajtos/src/filstation/core/node_modules/execa/lib/return/final-error.js:6:9)
    at makeError (file:///Users/bajtos/src/filstation/core/node_modules/execa/lib/return/result.js:108:16)
    at getAsyncResult (file:///Users/bajtos/src/filstation/core/node_modules/execa/lib/methods/main-async.js:167:4)
    at handlePromise (file:///Users/bajtos/src/filstation/core/node_modules/execa/lib/methods/main-async.js:150:17)
    at async file:///Users/bajtos/src/filstation/core/lib/zinnia.js:165:9
    at async catchChildProcessExit (file:///Users/bajtos/src/filstation/core/lib/zinnia.js:176:5)
    at async Promise.all (index 1)
    at async Module.run (file:///Users/bajtos/src/filstation/core/lib/zinnia.js:301:5)
    at async Promise.all (index 0)
    at async station (file:///Users/bajtos/src/filstation/core/commands/station.js:113:3) {
  shortMessage: 'Command was killed with SIGTERM (Termination): /Users/bajtos/src/filstation/core/modules/zinnia/zinniad spark/main.js',
  command: '/Users/bajtos/src/filstation/core/modules/zinnia/zinniad spark/main.js',
  escapedCommand: '/Users/bajtos/src/filstation/core/modules/zinnia/zinniad spark/main.js',
  cwd: '/Users/bajtos/src/filstation/core/.cache/sources',
  durationMs: 4815.020833,
  failed: true,
  timedOut: false,
  isCanceled: false,
  isGracefullyCanceled: false,
  isTerminated: true,
  isMaxBuffer: false,
  isForcefullyTerminated: false,
  signal: 'SIGTERM',
  signalDescription: 'Termination',
  stdout: '{"type":"activity:started","module":null}\n' +
    '{"type":"jobs-completed","total":176,"modules":{"spark/main":176}}\n' +
    '{"type":"activity:info","module":"spark/main","message":"SPARK started reporting retrievals"}\n' +
    '{"type":"jobs-completed","total":177,"modules":{"spark/main":177}}',
  stderr: '[2024-09-03T15:07:17Z INFO  zinniad] Starting zinniad with config CliArgs { wallet_address: "0x000000000000000000000000000000000000dEaD", station_id: "302a300506032b65700321005c92a0dff7f4fad282bd8a2f111fd0e3f6e9f13a02315a0b8f02f874b8ffcdac", state_root: ".state/modules/zinnia", cache_root: ".cache/modules/zinnia", files: ["spark/main.js"] }\n' +
    '[2024-09-03T15:07:17Z INFO  lassie] Starting Lassie Daemon\n' +
    '[2024-09-03T15:07:17Z INFO  lassie] Lassie Daemon is listening on port 50076\n' +
    '[2024-09-03T15:07:17Z INFO  zinniad] Starting module file:///Users/bajtos/src/filstation/core/.cache/sources/spark/main.js\n' +
    '[2024-09-03T15:07:17Z INFO  module:spark/main] Checking the current SPARK round...\n' +
    '[2024-09-03T15:07:17Z INFO  module:spark/main] Fetching round details at location /rounds/meridian/0x8460766edc62b525fc1fa4d628fc79229dc73031/14429\n' +
    '[2024-09-03T15:07:17Z INFO  module:spark/main] Current SPARK round: { roundId: "19853", startEpoch: "4235611" }\n' +
    '[2024-09-03T15:07:17Z INFO  module:spark/main]   15 max tasks per round\n' +
    '[2024-09-03T15:07:17Z INFO  module:spark/main]   1000 retrieval tasks\n' +
    '[2024-09-03T15:07:17Z INFO  module:spark/main]   randomness: a21aac13c9d47f6e86c33b2ff1cd969bdea81006269b025868744e3b7f7cb13e\n' +
    '[2024-09-03T15:07:17Z INFO  module:spark/main] {\n' +
    '      retrieval: {\n' +
    '        cid: "bafykbzacecwbt7xlwmmkyt3invwneklrb557ua2rnrxnuzjazyxasz2mabyjg",\n' +
    '        minerId: "f0122215",\n' +
    '        clients: [ "f02098873" ]\n' +
    '      }\n' +
    '    }\n' +
    '[2024-09-03T15:07:17Z INFO  module:spark/main] Calling Filecoin JSON-RPC to get PeerId of miner f0122215\n' +
    '[2024-09-03T15:07:18Z INFO  module:spark/main] Found peer id: 12D3KooWPE6aMwnbvPdQkYeWQytGNz1LJZTU7jrySBhEazwA6aYe\n' +
    '[2024-09-03T15:07:18Z INFO  module:spark/main] Querying IPNI to find retrieval providers for bafykbzacecwbt7xlwmmkyt3invwneklrb557ua2rnrxnuzjazyxasz2mabyjg\n' +
    '[2024-09-03T15:07:18Z ERROR module:spark/main] IPNI query failed, HTTP response: 404\n' +
    '[2024-09-03T15:07:18Z INFO  module:spark/main] Submitting measurement...\n' +
    '[2024-09-03T15:07:18Z INFO  module:spark/main] {\n' +
    '      sparkVersion: "1.14.2",\n' +
    '      zinniaVersion: "0.20.2",\n' +
    '      cid: "bafykbzacecwbt7xlwmmkyt3invwneklrb557ua2rnrxnuzjazyxasz2mabyjg",\n' +
    '      minerId: "f0122215",\n' +
    '      clients: [ "f02098873" ],\n' +
    '      timeout: false,\n' +
    '      startAt: 2024-09-03T15:07:17.766Z,\n' +
    '      firstByteAt: null,\n' +
    '      endAt: null,\n' +
    '      carTooLarge: false,\n' +
    '      byteLength: 0,\n' +
    '      carChecksum: null,\n' +
    '      statusCode: null,\n' +
    '      providerId: "12D3KooWPE6aMwnbvPdQkYeWQytGNz1LJZTU7jrySBhEazwA6aYe",\n' +
    '      indexerResult: "ERROR_404",\n' +
    '      participantAddress: "0x000000000000000000000000000000000000dEaD",\n' +
    '      stationId: "302a300506032b65700321005c92a0dff7f4fad282bd8a2f111fd0e3f6e9f13a02315a0b8f02f874b8ffcdac"\n' +
    '    }\n' +
    '[2024-09-03T15:07:19Z INFO  module:spark/main] Measurement submitted (id: 7414385119)\n' +
    '[2024-09-03T15:07:19Z INFO  module:spark/main] Sleeping for 78 seconds before starting the next task...',
  stdio: [
    undefined,
    '{"type":"activity:started","module":null}\n' +
      '{"type":"jobs-completed","total":176,"modules":{"spark/main":176}}\n' +
      '{"type":"activity:info","module":"spark/main","message":"SPARK started reporting retrievals"}\n' +
      '{"type":"jobs-completed","total":177,"modules":{"spark/main":177}}',
    '[2024-09-03T15:07:17Z INFO  zinniad] Starting zinniad with config CliArgs { wallet_address: "0x000000000000000000000000000000000000dEaD", station_id: "302a300506032b65700321005c92a0dff7f4fad282bd8a2f111fd0e3f6e9f13a02315a0b8f02f874b8ffcdac", state_root: ".state/modules/zinnia", cache_root: ".cache/modules/zinnia", files: ["spark/main.js"] }\n' +
      '[2024-09-03T15:07:17Z INFO  lassie] Starting Lassie Daemon\n' +
      '[2024-09-03T15:07:17Z INFO  lassie] Lassie Daemon is listening on port 50076\n' +
      '[2024-09-03T15:07:17Z INFO  zinniad] Starting module file:///Users/bajtos/src/filstation/core/.cache/sources/spark/main.js\n' +
      '[2024-09-03T15:07:17Z INFO  module:spark/main] Checking the current SPARK round...\n' +
      '[2024-09-03T15:07:17Z INFO  module:spark/main] Fetching round details at location /rounds/meridian/0x8460766edc62b525fc1fa4d628fc79229dc73031/14429\n' +
      '[2024-09-03T15:07:17Z INFO  module:spark/main] Current SPARK round: { roundId: "19853", startEpoch: "4235611" }\n' +
      '[2024-09-03T15:07:17Z INFO  module:spark/main]   15 max tasks per round\n' +
      '[2024-09-03T15:07:17Z INFO  module:spark/main]   1000 retrieval tasks\n' +
      '[2024-09-03T15:07:17Z INFO  module:spark/main]   randomness: a21aac13c9d47f6e86c33b2ff1cd969bdea81006269b025868744e3b7f7cb13e\n' +
      '[2024-09-03T15:07:17Z INFO  module:spark/main] {\n' +
      '      retrieval: {\n' +
      '        cid: "bafykbzacecwbt7xlwmmkyt3invwneklrb557ua2rnrxnuzjazyxasz2mabyjg",\n' +
      '        minerId: "f0122215",\n' +
      '        clients: [ "f02098873" ]\n' +
      '      }\n' +
      '    }\n' +
      '[2024-09-03T15:07:17Z INFO  module:spark/main] Calling Filecoin JSON-RPC to get PeerId of miner f0122215\n' +
      '[2024-09-03T15:07:18Z INFO  module:spark/main] Found peer id: 12D3KooWPE6aMwnbvPdQkYeWQytGNz1LJZTU7jrySBhEazwA6aYe\n' +
      '[2024-09-03T15:07:18Z INFO  module:spark/main] Querying IPNI to find retrieval providers for bafykbzacecwbt7xlwmmkyt3invwneklrb557ua2rnrxnuzjazyxasz2mabyjg\n' +
      '[2024-09-03T15:07:18Z ERROR module:spark/main] IPNI query failed, HTTP response: 404\n' +
      '[2024-09-03T15:07:18Z INFO  module:spark/main] Submitting measurement...\n' +
      '[2024-09-03T15:07:18Z INFO  module:spark/main] {\n' +
      '      sparkVersion: "1.14.2",\n' +
      '      zinniaVersion: "0.20.2",\n' +
      '      cid: "bafykbzacecwbt7xlwmmkyt3invwneklrb557ua2rnrxnuzjazyxasz2mabyjg",\n' +
      '      minerId: "f0122215",\n' +
      '      clients: [ "f02098873" ],\n' +
      '      timeout: false,\n' +
      '      startAt: 2024-09-03T15:07:17.766Z,\n' +
      '      firstByteAt: null,\n' +
      '      endAt: null,\n' +
      '      carTooLarge: false,\n' +
      '      byteLength: 0,\n' +
      '      carChecksum: null,\n' +
      '      statusCode: null,\n' +
      '      providerId: "12D3KooWPE6aMwnbvPdQkYeWQytGNz1LJZTU7jrySBhEazwA6aYe",\n' +
      '      indexerResult: "ERROR_404",\n' +
      '      participantAddress: "0x000000000000000000000000000000000000dEaD",\n' +
      '      stationId: "302a300506032b65700321005c92a0dff7f4fad282bd8a2f111fd0e3f6e9f13a02315a0b8f02f874b8ffcdac"\n' +
      '    }\n' +
      '[2024-09-03T15:07:19Z INFO  module:spark/main] Measurement submitted (id: 7414385119)\n' +
      '[2024-09-03T15:07:19Z INFO  module:spark/main] Sleeping for 78 seconds before starting the next task...'
  ],
  ipcOutput: [],
  pipedFrom: [],
  moduleName: 'spark',
  exitReason: 'via signal SIGTERM'
}
[2024-09-03T15:07:21Z INFO  zinniad] Starting zinniad with config CliArgs { wallet_address: "0x000000000000000000000000000000000000dEaD", station_id: "302a300506032b65700321005c92a0dff7f4fad282bd8a2f111fd0e3f6e9f13a02315a0b8f02f874b8ffcdac", state_root: ".state/modules/zinnia", cache_root: ".cache/modules/zinnia", files: ["spark/main.js"] }
[2024-09-03T15:07:21Z INFO  lassie] Starting Lassie Daemon
[2024-09-03T15:07:21Z INFO  lassie] Lassie Daemon is listening on port 50085
[03/09/2024, 17:07:21] INFO  Spark started

Console output from Station Core - the version from this pull request (5df27aa):

[03/09/2024, 17:09:38] ERROR Spark crashed via signal SIGTERM
Zinnia main loop ended
[2024-09-03T15:09:38Z INFO  zinniad] Starting zinniad with config CliArgs { wallet_address: "0x000000000000000000000000000000000000dEaD", station_id: "302a300506032b65700321005c92a0dff7f4fad282bd8a2f111fd0e3f6e9f13a02315a0b8f02f874b8ffcdac", state_root: ".state/modules/zinnia", cache_root: ".cache/modules/zinnia", files: ["spark/main.js"] }
[2024-09-03T15:09:38Z INFO  lassie] Starting Lassie Daemon
[2024-09-03T15:09:38Z INFO  lassie] Lassie Daemon is listening on port 50229
[2024-09-03T15:09:38Z INFO  zinniad] Starting module file:///Users/bajtos/src/filstation/core/.cache/sources/spark/main.js
[03/09/2024, 17:09:38] INFO  Spark started

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both cases, Zinnia restarts.

When I run npm start and then kill <pid>, Station Core exits immediately without logged messages. I think that's because Station Core does not implement custom handling of SIGTERM/SIGINT signals.

The remaining question is what happens when the auto-updater kicks in and asks Zinnia to exit the process so that we can install the new version. I think this should be fine because we restart the main loop regardless of whether the child process exited or crashed.

https://github.com/filecoin-station/core/blob/e895f0be37f9fe8c0967b9d4048cd20c260e8419/lib/zinnia.js#L300-L321

@juliangruber do you know how can I test this?

I tried the following change:

  try {
    await Promise.all([
+     timers.setTimeout(5_000).then(() => { console.log('FORCED RESTART'); controller.abort() }),
      runUpdateSourceFilesLoop({
        controller,
        signal,
        onActivity,
        moduleVersionsDir,
        moduleSourcesDir
      }),
      catchChildProcessExit({ childProcesses, onActivity, controller })
    ])
    console.error('Zinnia main loop ended')

Console log from running this version:

FORCED RESTART
[03/09/2024, 17:16:17] ERROR Spark crashed via signal SIGTERM
Zinnia main loop ended
[2024-09-03T15:16:17Z INFO  zinniad] Starting zinniad with config CliArgs { wallet_address: "0x000000000000000000000000000000000000dEaD", station_id: "302a300506032b65700321005c92a0dff7f4fad282bd8a2f111fd0e3f6e9f13a02315a0b8f02f874b8ffcdac", state_root: ".state/modules/zinnia", cache_root: ".cache/modules/zinnia", files: ["spark/main.js"] }
[2024-09-03T15:16:17Z INFO  lassie] Starting Lassie Daemon
[2024-09-03T15:16:17Z INFO  lassie] Lassie Daemon is listening on port 50683
[2024-09-03T15:16:17Z INFO  zinniad] Starting module file:///Users/bajtos/src/filstation/core/.cache/sources/spark/main.js
[03/09/2024, 17:16:17] INFO  Spark started

Based on the above, I think my change is safe to land.

I also think it's Station Core that kills Zinnia using SIGTERM, but then in the "exit/crash handler" does not realise it was us 🤷🏻

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possible improvement:

   const tasks = childProcesses.map(p => (async () => {
      try {
        await p
        onActivity({ type: 'info', message: `${capitalize(module)} exited` })
      } catch (err) {
+       if ((/** @type {AbortController} */(controller)).signal.aborted) {
+         // it was us who killed the process, ignore the error
+         return
+       }
        // When the child process crash, attach the module name & the exit reason to the error object
        const exitReason = p.exitCode
          ? `with exit code ${p.exitCode}`
          : p.signalCode ? `via signal ${p.signalCode}` : undefined
        throw Object.assign(err, { moduleName: p.moduleName, exitReason, signalCode: p.signalCode })
      }
    })())

What to consider: We trigger a Zinnia restart when the module does not log any message in 5 minutes. The restart is triggered via the abort signal.

I think this part of the codebase is becoming too difficult to reason about. Unfortunately, I don't have the bandwidth to refactor it for more clarity.

Therefore, I propose to land my pull request as-is to prevent Station Core from creating Sentry issues when the auto-updater kills the module processes and defer any further improvements until there is more pressing need for them.

This comment was marked as resolved.

@bajtos bajtos enabled auto-merge (squash) September 5, 2024 14:10
@bajtos bajtos merged commit 8d940b4 into main Sep 5, 2024
18 checks passed
@bajtos bajtos deleted the ignore-zinnia-sigterm-exit branch September 5, 2024 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants