-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: don't report SIGTERM & friends to Sentry #587
Conversation
When Zinniad is killed via SIGTERM/SIGKILL/SIGINT, don't report the error to Sentry. These signal codes are triggered when somebody terminates the process from outside. It's not a problem in Zinnia, there is nothing we can do about this. Before: ``` ERROR Spark crashed via signal SIGTERM Reporting the problem to Sentry for inspection by the Station team. Zinnia main loop ended Starting zinniad with config CliArgs { /* ... */ } ``` After: ``` ERROR Spark crashed via signal SIGTERM Zinnia main loop ended Starting zinniad with config CliArgs { /* ... */ } ``` Signed-off-by: Miroslav Bajtoš <[email protected]>
Signed-off-by: Miroslav Bajtoš <[email protected]>
Signed-off-by: Miroslav Bajtoš <[email protected]>
@juliangruber ready for the (final?🤞🏻) review |
// Don't report this error to Sentry and don't print the stack trace to stderr, | ||
// treat this as a regular exit (successful completion of the process). | ||
// (Note that this event has been already logged via `onActivity()` call above.) | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you verified that Core properly shuts down here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not shut down, it restarts the Zinnia.
ERROR Spark crashed via signal SIGTERM
Zinnia main loop ended
Starting zinniad with config CliArgs { /* ... */ }
I'll double check this is still true.
Or did you mean something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right! Yeah would be great if you could double check this behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test scenario:
- Run
npm test
in one terminal tab - After a bit, execute
killall zinniad
in the second tab
Console output from Station Core - the version from the main
branch (e895f0b):
[03/09/2024, 17:07:21] ERROR Spark crashed via signal SIGTERM
Reporting the problem to Sentry for inspection by the Station team.
Zinnia main loop errored ExecaError: Command was killed with SIGTERM (Termination): /Users/bajtos/src/filstation/core/modules/zinnia/zinniad spark/main.js
[2024-09-03T15:07:17Z INFO zinniad] Starting zinniad with config CliArgs { wallet_address: "0x000000000000000000000000000000000000dEaD", station_id: "302a300506032b65700321005c92a0dff7f4fad282bd8a2f111fd0e3f6e9f13a02315a0b8f02f874b8ffcdac", state_root: ".state/modules/zinnia", cache_root: ".cache/modules/zinnia", files: ["spark/main.js"] }
[2024-09-03T15:07:17Z INFO lassie] Starting Lassie Daemon
[2024-09-03T15:07:17Z INFO lassie] Lassie Daemon is listening on port 50076
[2024-09-03T15:07:17Z INFO zinniad] Starting module file:///Users/bajtos/src/filstation/core/.cache/sources/spark/main.js
[2024-09-03T15:07:17Z INFO module:spark/main] Checking the current SPARK round...
[2024-09-03T15:07:17Z INFO module:spark/main] Fetching round details at location /rounds/meridian/0x8460766edc62b525fc1fa4d628fc79229dc73031/14429
[2024-09-03T15:07:17Z INFO module:spark/main] Current SPARK round: { roundId: "19853", startEpoch: "4235611" }
[2024-09-03T15:07:17Z INFO module:spark/main] 15 max tasks per round
[2024-09-03T15:07:17Z INFO module:spark/main] 1000 retrieval tasks
[2024-09-03T15:07:17Z INFO module:spark/main] randomness: a21aac13c9d47f6e86c33b2ff1cd969bdea81006269b025868744e3b7f7cb13e
[2024-09-03T15:07:17Z INFO module:spark/main] {
retrieval: {
cid: "bafykbzacecwbt7xlwmmkyt3invwneklrb557ua2rnrxnuzjazyxasz2mabyjg",
minerId: "f0122215",
clients: [ "f02098873" ]
}
}
[2024-09-03T15:07:17Z INFO module:spark/main] Calling Filecoin JSON-RPC to get PeerId of miner f0122215
[2024-09-03T15:07:18Z INFO module:spark/main] Found peer id: 12D3KooWPE6aMwnbvPdQkYeWQytGNz1LJZTU7jrySBhEazwA6aYe
[2024-09-03T15:07:18Z INFO module:spark/main] Querying IPNI to find retrieval providers for bafykbzacecwbt7xlwmmkyt3invwneklrb557ua2rnrxnuzjazyxasz2mabyjg
[2024-09-03T15:07:18Z ERROR module:spark/main] IPNI query failed, HTTP response: 404
[2024-09-03T15:07:18Z INFO module:spark/main] Submitting measurement...
[2024-09-03T15:07:18Z INFO module:spark/main] {
sparkVersion: "1.14.2",
zinniaVersion: "0.20.2",
cid: "bafykbzacecwbt7xlwmmkyt3invwneklrb557ua2rnrxnuzjazyxasz2mabyjg",
minerId: "f0122215",
clients: [ "f02098873" ],
timeout: false,
startAt: 2024-09-03T15:07:17.766Z,
firstByteAt: null,
endAt: null,
carTooLarge: false,
byteLength: 0,
carChecksum: null,
statusCode: null,
providerId: "12D3KooWPE6aMwnbvPdQkYeWQytGNz1LJZTU7jrySBhEazwA6aYe",
indexerResult: "ERROR_404",
participantAddress: "0x000000000000000000000000000000000000dEaD",
stationId: "302a300506032b65700321005c92a0dff7f4fad282bd8a2f111fd0e3f6e9f13a02315a0b8f02f874b8ffcdac"
}
[2024-09-03T15:07:19Z INFO module:spark/main] Measurement submitted (id: 7414385119)
[2024-09-03T15:07:19Z INFO module:spark/main] Sleeping for 78 seconds before starting the next task...
{"type":"activity:started","module":null}
{"type":"jobs-completed","total":176,"modules":{"spark/main":176}}
{"type":"activity:info","module":"spark/main","message":"SPARK started reporting retrievals"}
{"type":"jobs-completed","total":177,"modules":{"spark/main":177}}
at getFinalError (file:///Users/bajtos/src/filstation/core/node_modules/execa/lib/return/final-error.js:6:9)
at makeError (file:///Users/bajtos/src/filstation/core/node_modules/execa/lib/return/result.js:108:16)
at getAsyncResult (file:///Users/bajtos/src/filstation/core/node_modules/execa/lib/methods/main-async.js:167:4)
at handlePromise (file:///Users/bajtos/src/filstation/core/node_modules/execa/lib/methods/main-async.js:150:17)
at async file:///Users/bajtos/src/filstation/core/lib/zinnia.js:165:9
at async catchChildProcessExit (file:///Users/bajtos/src/filstation/core/lib/zinnia.js:176:5)
at async Promise.all (index 1)
at async Module.run (file:///Users/bajtos/src/filstation/core/lib/zinnia.js:301:5)
at async Promise.all (index 0)
at async station (file:///Users/bajtos/src/filstation/core/commands/station.js:113:3) {
shortMessage: 'Command was killed with SIGTERM (Termination): /Users/bajtos/src/filstation/core/modules/zinnia/zinniad spark/main.js',
command: '/Users/bajtos/src/filstation/core/modules/zinnia/zinniad spark/main.js',
escapedCommand: '/Users/bajtos/src/filstation/core/modules/zinnia/zinniad spark/main.js',
cwd: '/Users/bajtos/src/filstation/core/.cache/sources',
durationMs: 4815.020833,
failed: true,
timedOut: false,
isCanceled: false,
isGracefullyCanceled: false,
isTerminated: true,
isMaxBuffer: false,
isForcefullyTerminated: false,
signal: 'SIGTERM',
signalDescription: 'Termination',
stdout: '{"type":"activity:started","module":null}\n' +
'{"type":"jobs-completed","total":176,"modules":{"spark/main":176}}\n' +
'{"type":"activity:info","module":"spark/main","message":"SPARK started reporting retrievals"}\n' +
'{"type":"jobs-completed","total":177,"modules":{"spark/main":177}}',
stderr: '[2024-09-03T15:07:17Z INFO zinniad] Starting zinniad with config CliArgs { wallet_address: "0x000000000000000000000000000000000000dEaD", station_id: "302a300506032b65700321005c92a0dff7f4fad282bd8a2f111fd0e3f6e9f13a02315a0b8f02f874b8ffcdac", state_root: ".state/modules/zinnia", cache_root: ".cache/modules/zinnia", files: ["spark/main.js"] }\n' +
'[2024-09-03T15:07:17Z INFO lassie] Starting Lassie Daemon\n' +
'[2024-09-03T15:07:17Z INFO lassie] Lassie Daemon is listening on port 50076\n' +
'[2024-09-03T15:07:17Z INFO zinniad] Starting module file:///Users/bajtos/src/filstation/core/.cache/sources/spark/main.js\n' +
'[2024-09-03T15:07:17Z INFO module:spark/main] Checking the current SPARK round...\n' +
'[2024-09-03T15:07:17Z INFO module:spark/main] Fetching round details at location /rounds/meridian/0x8460766edc62b525fc1fa4d628fc79229dc73031/14429\n' +
'[2024-09-03T15:07:17Z INFO module:spark/main] Current SPARK round: { roundId: "19853", startEpoch: "4235611" }\n' +
'[2024-09-03T15:07:17Z INFO module:spark/main] 15 max tasks per round\n' +
'[2024-09-03T15:07:17Z INFO module:spark/main] 1000 retrieval tasks\n' +
'[2024-09-03T15:07:17Z INFO module:spark/main] randomness: a21aac13c9d47f6e86c33b2ff1cd969bdea81006269b025868744e3b7f7cb13e\n' +
'[2024-09-03T15:07:17Z INFO module:spark/main] {\n' +
' retrieval: {\n' +
' cid: "bafykbzacecwbt7xlwmmkyt3invwneklrb557ua2rnrxnuzjazyxasz2mabyjg",\n' +
' minerId: "f0122215",\n' +
' clients: [ "f02098873" ]\n' +
' }\n' +
' }\n' +
'[2024-09-03T15:07:17Z INFO module:spark/main] Calling Filecoin JSON-RPC to get PeerId of miner f0122215\n' +
'[2024-09-03T15:07:18Z INFO module:spark/main] Found peer id: 12D3KooWPE6aMwnbvPdQkYeWQytGNz1LJZTU7jrySBhEazwA6aYe\n' +
'[2024-09-03T15:07:18Z INFO module:spark/main] Querying IPNI to find retrieval providers for bafykbzacecwbt7xlwmmkyt3invwneklrb557ua2rnrxnuzjazyxasz2mabyjg\n' +
'[2024-09-03T15:07:18Z ERROR module:spark/main] IPNI query failed, HTTP response: 404\n' +
'[2024-09-03T15:07:18Z INFO module:spark/main] Submitting measurement...\n' +
'[2024-09-03T15:07:18Z INFO module:spark/main] {\n' +
' sparkVersion: "1.14.2",\n' +
' zinniaVersion: "0.20.2",\n' +
' cid: "bafykbzacecwbt7xlwmmkyt3invwneklrb557ua2rnrxnuzjazyxasz2mabyjg",\n' +
' minerId: "f0122215",\n' +
' clients: [ "f02098873" ],\n' +
' timeout: false,\n' +
' startAt: 2024-09-03T15:07:17.766Z,\n' +
' firstByteAt: null,\n' +
' endAt: null,\n' +
' carTooLarge: false,\n' +
' byteLength: 0,\n' +
' carChecksum: null,\n' +
' statusCode: null,\n' +
' providerId: "12D3KooWPE6aMwnbvPdQkYeWQytGNz1LJZTU7jrySBhEazwA6aYe",\n' +
' indexerResult: "ERROR_404",\n' +
' participantAddress: "0x000000000000000000000000000000000000dEaD",\n' +
' stationId: "302a300506032b65700321005c92a0dff7f4fad282bd8a2f111fd0e3f6e9f13a02315a0b8f02f874b8ffcdac"\n' +
' }\n' +
'[2024-09-03T15:07:19Z INFO module:spark/main] Measurement submitted (id: 7414385119)\n' +
'[2024-09-03T15:07:19Z INFO module:spark/main] Sleeping for 78 seconds before starting the next task...',
stdio: [
undefined,
'{"type":"activity:started","module":null}\n' +
'{"type":"jobs-completed","total":176,"modules":{"spark/main":176}}\n' +
'{"type":"activity:info","module":"spark/main","message":"SPARK started reporting retrievals"}\n' +
'{"type":"jobs-completed","total":177,"modules":{"spark/main":177}}',
'[2024-09-03T15:07:17Z INFO zinniad] Starting zinniad with config CliArgs { wallet_address: "0x000000000000000000000000000000000000dEaD", station_id: "302a300506032b65700321005c92a0dff7f4fad282bd8a2f111fd0e3f6e9f13a02315a0b8f02f874b8ffcdac", state_root: ".state/modules/zinnia", cache_root: ".cache/modules/zinnia", files: ["spark/main.js"] }\n' +
'[2024-09-03T15:07:17Z INFO lassie] Starting Lassie Daemon\n' +
'[2024-09-03T15:07:17Z INFO lassie] Lassie Daemon is listening on port 50076\n' +
'[2024-09-03T15:07:17Z INFO zinniad] Starting module file:///Users/bajtos/src/filstation/core/.cache/sources/spark/main.js\n' +
'[2024-09-03T15:07:17Z INFO module:spark/main] Checking the current SPARK round...\n' +
'[2024-09-03T15:07:17Z INFO module:spark/main] Fetching round details at location /rounds/meridian/0x8460766edc62b525fc1fa4d628fc79229dc73031/14429\n' +
'[2024-09-03T15:07:17Z INFO module:spark/main] Current SPARK round: { roundId: "19853", startEpoch: "4235611" }\n' +
'[2024-09-03T15:07:17Z INFO module:spark/main] 15 max tasks per round\n' +
'[2024-09-03T15:07:17Z INFO module:spark/main] 1000 retrieval tasks\n' +
'[2024-09-03T15:07:17Z INFO module:spark/main] randomness: a21aac13c9d47f6e86c33b2ff1cd969bdea81006269b025868744e3b7f7cb13e\n' +
'[2024-09-03T15:07:17Z INFO module:spark/main] {\n' +
' retrieval: {\n' +
' cid: "bafykbzacecwbt7xlwmmkyt3invwneklrb557ua2rnrxnuzjazyxasz2mabyjg",\n' +
' minerId: "f0122215",\n' +
' clients: [ "f02098873" ]\n' +
' }\n' +
' }\n' +
'[2024-09-03T15:07:17Z INFO module:spark/main] Calling Filecoin JSON-RPC to get PeerId of miner f0122215\n' +
'[2024-09-03T15:07:18Z INFO module:spark/main] Found peer id: 12D3KooWPE6aMwnbvPdQkYeWQytGNz1LJZTU7jrySBhEazwA6aYe\n' +
'[2024-09-03T15:07:18Z INFO module:spark/main] Querying IPNI to find retrieval providers for bafykbzacecwbt7xlwmmkyt3invwneklrb557ua2rnrxnuzjazyxasz2mabyjg\n' +
'[2024-09-03T15:07:18Z ERROR module:spark/main] IPNI query failed, HTTP response: 404\n' +
'[2024-09-03T15:07:18Z INFO module:spark/main] Submitting measurement...\n' +
'[2024-09-03T15:07:18Z INFO module:spark/main] {\n' +
' sparkVersion: "1.14.2",\n' +
' zinniaVersion: "0.20.2",\n' +
' cid: "bafykbzacecwbt7xlwmmkyt3invwneklrb557ua2rnrxnuzjazyxasz2mabyjg",\n' +
' minerId: "f0122215",\n' +
' clients: [ "f02098873" ],\n' +
' timeout: false,\n' +
' startAt: 2024-09-03T15:07:17.766Z,\n' +
' firstByteAt: null,\n' +
' endAt: null,\n' +
' carTooLarge: false,\n' +
' byteLength: 0,\n' +
' carChecksum: null,\n' +
' statusCode: null,\n' +
' providerId: "12D3KooWPE6aMwnbvPdQkYeWQytGNz1LJZTU7jrySBhEazwA6aYe",\n' +
' indexerResult: "ERROR_404",\n' +
' participantAddress: "0x000000000000000000000000000000000000dEaD",\n' +
' stationId: "302a300506032b65700321005c92a0dff7f4fad282bd8a2f111fd0e3f6e9f13a02315a0b8f02f874b8ffcdac"\n' +
' }\n' +
'[2024-09-03T15:07:19Z INFO module:spark/main] Measurement submitted (id: 7414385119)\n' +
'[2024-09-03T15:07:19Z INFO module:spark/main] Sleeping for 78 seconds before starting the next task...'
],
ipcOutput: [],
pipedFrom: [],
moduleName: 'spark',
exitReason: 'via signal SIGTERM'
}
[2024-09-03T15:07:21Z INFO zinniad] Starting zinniad with config CliArgs { wallet_address: "0x000000000000000000000000000000000000dEaD", station_id: "302a300506032b65700321005c92a0dff7f4fad282bd8a2f111fd0e3f6e9f13a02315a0b8f02f874b8ffcdac", state_root: ".state/modules/zinnia", cache_root: ".cache/modules/zinnia", files: ["spark/main.js"] }
[2024-09-03T15:07:21Z INFO lassie] Starting Lassie Daemon
[2024-09-03T15:07:21Z INFO lassie] Lassie Daemon is listening on port 50085
[03/09/2024, 17:07:21] INFO Spark started
Console output from Station Core - the version from this pull request (5df27aa):
[03/09/2024, 17:09:38] ERROR Spark crashed via signal SIGTERM
Zinnia main loop ended
[2024-09-03T15:09:38Z INFO zinniad] Starting zinniad with config CliArgs { wallet_address: "0x000000000000000000000000000000000000dEaD", station_id: "302a300506032b65700321005c92a0dff7f4fad282bd8a2f111fd0e3f6e9f13a02315a0b8f02f874b8ffcdac", state_root: ".state/modules/zinnia", cache_root: ".cache/modules/zinnia", files: ["spark/main.js"] }
[2024-09-03T15:09:38Z INFO lassie] Starting Lassie Daemon
[2024-09-03T15:09:38Z INFO lassie] Lassie Daemon is listening on port 50229
[2024-09-03T15:09:38Z INFO zinniad] Starting module file:///Users/bajtos/src/filstation/core/.cache/sources/spark/main.js
[03/09/2024, 17:09:38] INFO Spark started
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In both cases, Zinnia restarts.
When I run npm start
and then kill <pid>
, Station Core exits immediately without logged messages. I think that's because Station Core does not implement custom handling of SIGTERM/SIGINT signals.
The remaining question is what happens when the auto-updater kicks in and asks Zinnia to exit the process so that we can install the new version. I think this should be fine because we restart the main loop regardless of whether the child process exited or crashed.
@juliangruber do you know how can I test this?
I tried the following change:
try {
await Promise.all([
+ timers.setTimeout(5_000).then(() => { console.log('FORCED RESTART'); controller.abort() }),
runUpdateSourceFilesLoop({
controller,
signal,
onActivity,
moduleVersionsDir,
moduleSourcesDir
}),
catchChildProcessExit({ childProcesses, onActivity, controller })
])
console.error('Zinnia main loop ended')
Console log from running this version:
FORCED RESTART
[03/09/2024, 17:16:17] ERROR Spark crashed via signal SIGTERM
Zinnia main loop ended
[2024-09-03T15:16:17Z INFO zinniad] Starting zinniad with config CliArgs { wallet_address: "0x000000000000000000000000000000000000dEaD", station_id: "302a300506032b65700321005c92a0dff7f4fad282bd8a2f111fd0e3f6e9f13a02315a0b8f02f874b8ffcdac", state_root: ".state/modules/zinnia", cache_root: ".cache/modules/zinnia", files: ["spark/main.js"] }
[2024-09-03T15:16:17Z INFO lassie] Starting Lassie Daemon
[2024-09-03T15:16:17Z INFO lassie] Lassie Daemon is listening on port 50683
[2024-09-03T15:16:17Z INFO zinniad] Starting module file:///Users/bajtos/src/filstation/core/.cache/sources/spark/main.js
[03/09/2024, 17:16:17] INFO Spark started
Based on the above, I think my change is safe to land.
I also think it's Station Core that kills Zinnia using SIGTERM, but then in the "exit/crash handler" does not realise it was us 🤷🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possible improvement:
const tasks = childProcesses.map(p => (async () => {
try {
await p
onActivity({ type: 'info', message: `${capitalize(module)} exited` })
} catch (err) {
+ if ((/** @type {AbortController} */(controller)).signal.aborted) {
+ // it was us who killed the process, ignore the error
+ return
+ }
// When the child process crash, attach the module name & the exit reason to the error object
const exitReason = p.exitCode
? `with exit code ${p.exitCode}`
: p.signalCode ? `via signal ${p.signalCode}` : undefined
throw Object.assign(err, { moduleName: p.moduleName, exitReason, signalCode: p.signalCode })
}
})())
What to consider: We trigger a Zinnia restart when the module does not log any message in 5 minutes. The restart is triggered via the abort signal.
I think this part of the codebase is becoming too difficult to reason about. Unfortunately, I don't have the bandwidth to refactor it for more clarity.
Therefore, I propose to land my pull request as-is to prevent Station Core from creating Sentry issues when the auto-updater kills the module processes and defer any further improvements until there is more pressing need for them.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
When Zinniad is killed via SIGTERM/SIGKILL/SIGINT, don't report the error to Sentry.
These signal codes are triggered when somebody terminates the process from outside. It's not a problem in Zinnia, there is nothing we can do about this.
Before:
After:
The Sentry issue: