Long running migrations can cause systemd to terminate daemon on start #7269

Stebalien · 2020-05-04T06:06:24Z

Version information:

0.5.0

Description:

Now that go-ipfs supports systemd's "notification" system, we need to tell systemd to repeatedly extend time startup timeout while performing repo migrations. Otherwise, systemd may kill the daemon thinking it timed out on startup.

We can do this by repeatedly sending EXTEND_TIMEOUT_USEC=... to systemd's notification socket using github.com/coreos/go-systemd/v22/daemon. See cmd/ipfs/daemon_linux.go for how we interact with systemd's notification service.

The text was updated successfully, but these errors were encountered:

RubenKelevra · 2020-05-04T11:52:11Z

Can we notify systemd the same way while we open the database on startup (which is basically the reason for the slow startup documented here and on shutdown when we for example run database compaction jobs, flush data to disk, or clean up sockets etc.?

Stebalien · 2020-05-05T00:12:37Z

Yep. We can wrap all of https://github.com/ipfs/go-ipfs/blob/538bff085abb18a16fbf14386358c00c7078c222/cmd/ipfs/daemon.go#L257-L290.

Stebalien · 2020-05-05T00:15:04Z

However, we don't do database compactions on stop (explicitly disabled for this reason) so we shouldn't delay there. If IPFS gets stuck on shutdown, we should just die and cleanup when we next restart.

RubenKelevra · 2020-05-05T08:36:55Z

I've seen ipfs hit the one minute 30-second limit dozens of times on different machines, without badger db. My guess is, that it's a TCP cleanup issue.

I think it's still nicer to clean up everything cleanly and as long as we make progress we notify systemd.

Maybe add a hard limit, after which we stop notifying on shutdown, to make sure we don't hang indefinitely.

Stebalien · 2020-05-05T22:00:38Z

Odd. We shouldn't be spending any time cleaning up TCP connections or anything like that, unless we have a bug somewhere. Can you reproduce this?

RubenKelevra · 2020-05-07T07:31:25Z

Odd. We shouldn't be spending any time cleaning up TCP connections or anything like that, unless we have a bug somewhere. Can you reproduce this?

Well, I converted all my nodes to badgerds with the 0.5.0 release.

I need to convert one back to see how it can be reproduced.

RubenKelevra · 2021-03-29T14:27:56Z

@Stebalien

I've seen ipfs hit the one minute 30-second limit dozens of times on different machines, without badger db. My guess is, that it's a TCP cleanup issue.

I can report this bug seems to be gone. Not sure when, but I haven't noticed it at all on 0.8.

Apart from this, wouldn't it make sense to send go-ipfs a SIGABRT instead of a SIGKILL by systemd by default if this happens? This gives the user a stack trace to share with us?

Stebalien · 2021-03-31T00:53:12Z

I agree. Want to file a PR?

Stebalien · 2021-03-31T23:20:14Z

Ok, I'm actually just going to disable the startup timeout. It's not helping.

Stebalien added kind/bug A bug in existing code (including security flaws) help wanted Seeking public contribution on this issue P3 Low: Not priority right now exp/intermediate Prior experience is likely helpful labels May 4, 2020

Stebalien mentioned this issue May 4, 2020

systemd: add helptext #7265

Merged

Stebalien mentioned this issue May 6, 2020

skip traversing raw blocks while garbage collecting #7272

Closed

Stebalien mentioned this issue May 22, 2020

IPFS startup with Badger Datastore is hitting the systemd-timeout #7273

Open

Stebalien added topic/daemon + init topic/repo Topic repo status/ready Ready to be worked labels May 22, 2020

RubenKelevra mentioned this issue Jun 13, 2020

ipfs migration without starting the daemon #7471

Closed

RubenKelevra mentioned this issue Mar 29, 2021

Very long startup with badgerds #8034

Closed

gammazero mentioned this issue Apr 5, 2021

fix: set systemd startup timeout to infinity #8040

Merged

Stebalien closed this as completed in #8040 Apr 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long running migrations can cause systemd to terminate daemon on start #7269

Long running migrations can cause systemd to terminate daemon on start #7269

Stebalien commented May 4, 2020

RubenKelevra commented May 4, 2020 •

edited

Loading

Stebalien commented May 5, 2020

Stebalien commented May 5, 2020

RubenKelevra commented May 5, 2020

Stebalien commented May 5, 2020

RubenKelevra commented May 7, 2020

RubenKelevra commented Mar 29, 2021

Stebalien commented Mar 31, 2021 via email

Stebalien commented Mar 31, 2021

Long running migrations can cause systemd to terminate daemon on start #7269

Long running migrations can cause systemd to terminate daemon on start #7269

Comments

Stebalien commented May 4, 2020

Version information:

Description:

RubenKelevra commented May 4, 2020 • edited Loading

Stebalien commented May 5, 2020

Stebalien commented May 5, 2020

RubenKelevra commented May 5, 2020

Stebalien commented May 5, 2020

RubenKelevra commented May 7, 2020

RubenKelevra commented Mar 29, 2021

Stebalien commented Mar 31, 2021 via email

Stebalien commented Mar 31, 2021

RubenKelevra commented May 4, 2020 •

edited

Loading