Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: SystemD will not restart a crashed collector daemon #23

Merged
merged 1 commit into from
Jul 6, 2018

Conversation

Dombo
Copy link
Contributor

@Dombo Dombo commented Jul 6, 2018

Irony.

@lfittl
Copy link
Member

lfittl commented Jul 6, 2018

@Dombo Good catch, thanks Dom!

@lfittl lfittl merged commit 1d9ceea into pganalyze:master Jul 6, 2018
@Dombo
Copy link
Contributor Author

Dombo commented Jul 6, 2018

sudo journalctl -xe -u pganalyze-collector.service

Jun 24 13:36:03 ip-10-0-21-2 pganalyze-collector[9101]: I [tanda-apac] Submitted compact activity snapshot successfully
Jun 24 13:36:34 ip-10-0-21-2 pganalyze-collector[9101]: E [tanda-apac] Error uploading to S3: Post https://pganalyze-snapshots-production.s3.amazonaws.com: net/http: TLS handshake timeout
Jun 24 13:36:38 ip-10-0-21-2 pganalyze-collector[9101]: E [tanda-apac] Could not collect activity for server: failed to upload/send activity snapshot: Post https://pganalyze-snapshots-production.s3.amazonaws.com: net/http: TLS handshake ti
Jun 24 13:36:55 ip-10-0-21-2 systemd[1]: pganalyze-collector.service: Main process exited, code=killed, status=9/KILL
Jun 24 13:36:55 ip-10-0-21-2 systemd[1]: pganalyze-collector.service: Unit entered failed state.
Jun 24 13:36:55 ip-10-0-21-2 systemd[1]: pganalyze-collector.service: Failed with result 'signal'.

After a manual restart.

Jul 06 00:11:42 ip-10-0-21-2 systemd[1]: Stopped Statistics collector for the pganalyze database monitoring service.
-- Subject: Unit pganalyze-collector.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit pganalyze-collector.service has finished shutting down.
Jul 06 00:11:42 ip-10-0-21-2 systemd[1]: Started Statistics collector for the pganalyze database monitoring service.
-- Subject: Unit pganalyze-collector.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit pganalyze-collector.service has finished starting up.
-- 

@Dombo
Copy link
Contributor Author

Dombo commented Oct 4, 2018

We're back!

Sep 27 14:02:17 ip-10-0-21-2 systemd[1]: pganalyze-collector.service: Main process exited, code=killed, status=9/KILL
Sep 27 14:02:17 ip-10-0-21-2 systemd[1]: pganalyze-collector.service: Unit entered failed state.
Sep 27 14:02:17 ip-10-0-21-2 systemd[1]: pganalyze-collector.service: Failed with result 'signal'.
Oct 04 03:44:55 ip-10-0-21-2 systemd[1]: [/lib/systemd/system/pganalyze-collector.service:13] Memory limit '256MB' invalid. Ignoring.
Oct 04 03:44:55 ip-10-0-21-2 systemd[1]: [/lib/systemd/system/pganalyze-collector.service:14] Failed to parse service restart specifier, ignoring: Always
Oct 04 03:44:55 ip-10-0-21-2 systemd[1]: Stopped Statistics collector for the pganalyze database monitoring service.

I think this should be updated to reflect:

MemoryLimit=256M
Restart=always

The capital was my bad - the MB should also be fixed though.

I think there are some unhandled exceptions in the app code.

@lfittl
Copy link
Member

lfittl commented Oct 4, 2018

Good catch - want to make a quick PR to fix that service file?

Are you seeing any backtrace in the logs, or just that it gets killed?

@Dombo
Copy link
Contributor Author

Dombo commented Oct 15, 2018

I had a dig through the logs - looks like system memory killer got the daemon.

The fixed memorylimit directive should stop this from occurring in theory.

@Dombo
Copy link
Contributor Author

Dombo commented Oct 16, 2018

Oct 15 04:21:11 ip-10-0-21-2 pganalyze-collector[17658]: I [tanda-apac] Submitted compact activity snapshot successfully
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: fatal error: runtime: out of memory
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: runtime stack:
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: runtime.throw(0xc60377, 0x16)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/runtime/panic.go:619 +0x81
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: runtime.sysMap(0xc4444a0000, 0x3c30000, 0x0, 0x14397d8)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/runtime/mem_linux.go:216 +0x20a
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: runtime.(*mheap).sysAlloc(0x1420300, 0x3c30000, 0x8dc5b7)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/runtime/malloc.go:470 +0xd4
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: runtime.(*mheap).grow(0x1420300, 0x1e14, 0x0)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/runtime/mheap.go:907 +0x60
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: runtime.(*mheap).allocSpanLocked(0x1420300, 0x1e14, 0x14397e8, 0x7f93e016f678)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/runtime/mheap.go:820 +0x301
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: runtime.(*mheap).alloc_m(0x1420300, 0x1e14, 0x101, 0x0)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/runtime/mheap.go:686 +0x118
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: runtime.(*mheap).alloc.func1()
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/runtime/mheap.go:753 +0x4d
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: runtime.(*mheap).alloc(0x1420300, 0x1e14, 0x7f93da000101, 0x42a0e0)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/runtime/mheap.go:752 +0x8a
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: runtime.largeAlloc(0x3c28000, 0x450100, 0xc44146e000)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/runtime/malloc.go:826 +0x94
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: runtime.mallocgc.func1()
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/runtime/malloc.go:721 +0x46
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: runtime.systemstack(0x7f9300000000)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/runtime/asm_amd64.s:409 +0x79
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: runtime.mstart()
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/runtime/proc.go:1170
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: goroutine 687 [running]:
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: runtime.systemstack_switch()
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/runtime/asm_amd64.s:363 fp=0xc42164f748 sp=0xc42164f740 pc=0x45cce0
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: runtime.mallocgc(0x3c28000, 0x0, 0xc424382000, 0xc438760000)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/runtime/malloc.go:720 +0x8a2 fp=0xc42164f7e8 sp=0xc42164f748 pc=0x416dd2
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: runtime.growslice(0xb00600, 0xc44146e000, 0x3020000, 0x3020000, 0x3028000, 0x8000, 0x2680000, 0x3020000)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/runtime/slice.go:172 +0x21d fp=0xc42164f850 sp=0xc42164f7e8 pc=0x446dbd
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: github.com/pganalyze/collector/output.(*bytesReadWriteSeeker).Write(0xc421a74000, 0xc424382000, 0x8000, 0x8000, 0x8000, 0x0, 0x0)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /go/src/github.com/pganalyze/collector/output/upload_logfile_util.go:114 +0xf7 fp=0xc42164f8b8 sp=0xc42164f850 pc=0x8dc5b7
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: io.copyBuffer(0xd4baa0, 0xc421a74000, 0xd4bc80, 0xc429952000, 0xc424382000, 0x8000, 0x8000, 0xc42017c060, 0xd4bc80, 0xc429952000)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/io/io.go:402 +0x240 fp=0xc42164f928 sp=0xc42164f8b8 pc=0x476ed0
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: io.Copy(0xd4baa0, 0xc421a74000, 0xd4bc80, 0xc429952000, 0xc429952000, 0x0, 0x0)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/io/io.go:362 +0x5a fp=0xc42164f988 sp=0xc42164f928 pc=0x476b5a
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: github.com/pganalyze/collector/output.EncryptAndUploadLogfiles(0xc4203183c0, 0x32, 0xc420190150, 0xc4200f4d20, 0xe0, 0xc4202b25f0, 0x4b, 0xc421ba2090, 0x2c, 0xc420358b80, ...)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /go/src/github.com/pganalyze/collector/output/upload_logfile.go:84 +0xbf2 fp=0xc421650108 sp=0xc42164f988 pc=0x8db572
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: github.com/pganalyze/collector/output.UploadAndSendLogs(0xc42023c9a1, 0x10, 0xc62853, 0x19, 0x0, 0x0, 0x0, 0x0, 0x10001, 0x0, ...)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /go/src/github.com/pganalyze/collector/output/compact_logs.go:12 +0x48b fp=0xc421650840 sp=0xc421650108 pc=0x8d708b
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: github.com/pganalyze/collector/runner.downloadLogsForServer(0xc42023c9a1, 0x10, 0xc62853, 0x19, 0x0, 0x0, 0x0, 0x0, 0x10001, 0x0, ...)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /go/src/github.com/pganalyze/collector/runner/logs.go:31 +0x395 fp=0xc421650f70 sp=0xc421650840 pc=0xa06805
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: github.com/pganalyze/collector/runner.DownloadLogsFromAllServers(0xc420584000, 0x1, 0x1, 0x101010101010101, 0x1, 0xc5d853, 0x13, 0x100, 0x0, 0x0, ...)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /go/src/github.com/pganalyze/collector/runner/logs.go:84 +0x1ac fp=0xc421651de8 sp=0xc421650f70 pc=0xa0702c
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: main.run.func3()
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /go/src/github.com/pganalyze/collector/main.go:179 +0xba fp=0xc421651e88 sp=0xc421651de8 pc=0xa152ba
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: github.com/pganalyze/collector/scheduler.Group.Schedule.func1(0xc4201ec0e0, 0xc4205b5800, 0xc64775, 0x1b, 0xc4205ab560, 0xc4213f3560)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /go/src/github.com/pganalyze/collector/scheduler/scheduler.go:26 +0x41 fp=0xc421651fb0 sp=0xc421651e88 pc=0xa10e41
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: runtime.goexit()
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc421651fb8 sp=0xc421651fb0 pc=0x45f851
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: created by github.com/pganalyze/collector/scheduler.Group.Schedule
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /go/src/github.com/pganalyze/collector/scheduler/scheduler.go:16 +0x93
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: goroutine 1 [chan receive, 15873 minutes]:
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: main.main()
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /go/src/github.com/pganalyze/collector/main.go:386 +0x104c
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: goroutine 5 [syscall, 15873 minutes]:
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: os/signal.signal_recv(0xd50080)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/runtime/sigqueue.go:139 +0xa6
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: os/signal.loop()
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/os/signal/signal_unix.go:22 +0x22
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: created by os/signal.init.0
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/os/signal/signal_unix.go:28 +0x41
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: goroutine 3575056 [select]:
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: net/http.(*persistConn).writeLoop(0xc429930900)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/net/http/transport.go:1822 +0x14b
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: created by net/http.(*Transport).dialConn
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/net/http/transport.go:1238 +0x97f
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: goroutine 10 [select, 15876 minutes, locked to thread]:
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: runtime.gopark(0xcd6d50, 0x0, 0xc54fe3, 0x6, 0x18, 0x1)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/runtime/proc.go:291 +0x11a
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: runtime.selectgo(0xc42003af50, 0xc4205c01e0)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/runtime/select.go:392 +0xe50
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: runtime.ensureSigM.func1()
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/runtime/signal_unix.go:549 +0x1f4
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: runtime.goexit()
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/runtime/asm_amd64.s:2361 +0x1
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: goroutine 686 [select, 1 minutes]:
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: github.com/pganalyze/collector/scheduler.Group.Schedule.func1(0xc4201eddc0, 0xc4205b5800, 0xc654d0, 0x1c, 0xc4205ab4d0, 0xc4213f3500)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /go/src/github.com/pganalyze/collector/scheduler/scheduler.go:22 +0x268
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: created by github.com/pganalyze/collector/scheduler.Group.Schedule
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /go/src/github.com/pganalyze/collector/scheduler/scheduler.go:16 +0x93
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: goroutine 3576465 [IO wait]:
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: internal/poll.runtime_pollWait(0x7f93e2a8dbc0, 0x72, 0xc421916760)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/runtime/netpoll.go:173 +0x57
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: internal/poll.(*pollDesc).wait(0xc420400918, 0x72, 0xffffffffffffff00, 0xd4d460, 0x13e5a38)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9b
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: internal/poll.(*pollDesc).waitRead(0xc420400918, 0xc4219b4000, 0x1000, 0x1000)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: internal/poll.(*FD).Read(0xc420400900, 0xc4219b4000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]:         /usr/local/go/src/internal/poll/fd_unix.go:157 +0x17d
Oct 15 04:21:17 ip-10-0-21-2 pganalyze-collector[17658]: net.(*netFD).Read(0xc420400900, 0xc4219b4000, 0x1000, 0x1000, 0x20, 0x28, 0xc4219168a8)
Oct 15 04:21:17 ip-10-0-21-2 systemd[1]: pganalyze-collector.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Oct 15 04:21:17 ip-10-0-21-2 systemd[1]: pganalyze-collector.service: Unit entered failed state.
Oct 15 04:21:17 ip-10-0-21-2 systemd[1]: pganalyze-collector.service: Failed with result 'exit-code'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants