Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reread config and restart all services on SIGHUP #1949

Closed
roman-khimov opened this issue May 4, 2021 · 3 comments · Fixed by #2612
Closed

Reread config and restart all services on SIGHUP #1949

roman-khimov opened this issue May 4, 2021 · 3 comments · Fixed by #2612
Milestone

Comments

@roman-khimov
Copy link
Member

It's more tricky than #1892, because most of the configuration just can't be changed (all of protocol-level configuration and many application-level things), so we need to carefully read new configuration, check that nothing critical has changed and then shutdown/startup all services as configured. See #1945 also.

@roman-khimov
Copy link
Member Author

We've got the following services:

  • RPC
    Has a lot of configuration options including TLS keys/certificates.
  • Pprof
    Can be enabled/disabled and configured for some address/port.
  • Prometheus metrics
    Can be enabled/disabled and configured for some address/port.
  • Oracle
    Has a lot of service-specific options, including a list of other oracle nodes.
  • Notary
    Can be enabled/disabled, configured for some address/port, has a wallet configuration.
  • State validation
    Can be enabled/disabled, configured for some address/port, has a wallet configuration.
  • Consensus
    Unfortunately, it's an implicit service in configuration at the moment (in that it doesn't have a section of its own). It only has a wallet configuration (which enables it at the same time).

In general we might want to do something with them either because the configuration has changed or because we suspect that some runtime operational problem can be fixed by dropping the service state. The behavior at the moment is to reread the config and restart RPC service on SIGHUP. Originally we wanted to restart all of the other services as well, but:

  • restarting RPC can affect its clients. They must handle this appropriately, but still we may want to avoid restarting it in some cases
  • restarting all of the services at once doesn't differ much from a complete node restart
  • restarting the node shouldn't be a problem in most of the cases (so changing nothing at all is an option too)
  • most of the services don't have a lot of things to change in their configuration sections, so they're not likely to require restarts
  • consensus restart is dangerous in general, so it should be handled separately from other services (it's also the one where restarts are not likely to be driven by configuration changes)
  • theoretically there are other things that can be controlled externally in a node (flushing transaction/notary pools, dropping P2P connections)
  • we've only got HUP, USR1 and USR2 if we're to follow our current control model which is not a lot

What I'd probably do is:

  • reread config and restart RPC, Pprof, Prometheus (client-oriented services) on HUP
  • reread config and restart Oracle/Notary/SV (network-oriented services) on USR1
  • reread config and restart Consensus (special one) on USR2

But this uses all of the "normal" signals and maybe is an overkill. And there is always an option of full node restart which can solve any configuration/operation issue.

@anatoly-bogatyrev, @realloc, thoughts, opinions?

@roman-khimov roman-khimov modified the milestones: v0.99.3, v0.99.2 Jul 11, 2022
@realloc
Copy link

realloc commented Jul 18, 2022

From the current field usage experience, we need a way to restart Consensus without disturbing RPC and WS connections giving first priority to reads/testinvoke calls in the following cases:

  • Stuck consensus service
  • Increase number of CNs
  • CN replacement

We also need a way to restart RPC without disturbing other services on TLS certificates update.

The proposed scheme with HUP/USR1/USR2 signals is good for a start.

In future it would be also nice to have this wrapped into neo-go CLI. Like neo-go service <svc> [restart|status] or have the full control API like we do in neofs-node.

@roman-khimov
Copy link
Member Author

Like neo-go service [restart|status] or have the full control API like we do in neofs-node.

That requires either some RPC extensions (with all of the associated access restriction problems), or a separate control channel (like Unix socket) which is something that we'd like to avoid as much as possible.

roman-khimov added a commit that referenced this issue Jul 27, 2022
@roman-khimov roman-khimov modified the milestones: v0.99.2, v0.99.1 Jul 27, 2022
roman-khimov added a commit that referenced this issue Jul 27, 2022
@roman-khimov roman-khimov modified the milestones: v0.99.1, v0.99.2 Jul 28, 2022
roman-khimov added a commit that referenced this issue Jul 28, 2022
roman-khimov added a commit that referenced this issue Aug 2, 2022
Fix #1949. Also drop wallet from the ServerConfig since it's not used in any
meaningful way after this change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants