No great pattern for doing a graceful shutdown with Apollo Server integration packages #5074

vieira · 2021-03-30T15:28:39Z

Hello,

After updating to 2.22 (see PR #4981) and applying the recommendation in changelog to insert await server.start() between server = new ApolloServer() and server.applyMiddleware we started observing that Apollo is now listetning to termination signals and stops handling in-flight requests by throwing:

{"errors": [{
  "message": "Cannot execute GraphQL operations after the server has stopped.",
  "extensions": {"code":"INTERNAL_SERVER_ERROR"}
}]}

We were already handling these signals and calling the express close method (which does not abort in-flight requests but rather stops accepting new ones and waits for the others to finish).

My impression was that when using some middleware, like express, rather than the standalone apollo server these signals should not be handled by apollo itself? At least they were not prior to 2.22.x.

To work around this issue we explicitely set stopOnTerminationSignals: false and it seems to have resolved it.

Some context: We are deploying to a K8S deployment which does a rolling update. After the new version has started k8s sends a termination signal to the old version. Upon receiving this signal we make the readiness probe fail to avoid new requests being routed but keep express up for some more time until the in-flight requests are finished (or a timeout is triggered).

The text was updated successfully, but these errors were encountered:

glasser · 2021-03-30T19:16:43Z

Yes, if you're using apollo-server-express I'd suggest you use stopOnTerminationSignals: false and handle signals yourself. It would be best to completely stop listening to requests at the http level before calling server.stop().

If you're using apollo-server (the package which bundles its own express), it in fact does that automatically for you since AS v2.20. (We use the stoppable npm package instead of just httpServer.close, which closes idle HTTP connections and eventually force-closes all connections after a grace period instead of just waiting for all connections to be naturally closed.)

The only difference in v2.22 is that we're more strict about not letting you run operations during or after the server has stopped.

I think the main "bug" here is that the default signal handling behavior doesn't give you much of a hook for doing the right thing yourself. Using stopOnTerminationSignals: false and expecting users to set up their own handlers isn't the worst solution, but it would be nice if there was an easy way to stick with Apollo Server's signal handlers and have them close the server for you. For example, there could be a method that lets you register an HTTP server with Apollo Server, and then move the apollo-server-specific stoppable code into apollo-server-core.

glasser · 2021-03-30T20:54:01Z

I think I want to add a method to ApolloServer in apollo-server-core like ApolloServer.registerHttpServerForStop({httpServer: http.Server, stopGracePeriodMillis?: number}). I'm not sure I like that particular name, though.

glasser · 2021-07-01T16:58:37Z

Maybe we just need a "pre stop handler" (and have a phase before "stopping" that's like "stopping" but operations still run). This will also give an appropriate place to shut down subscriptions servers.

glasser · 2021-07-08T19:12:41Z

I think we should fork stoppable either into @apollographql/stoppable or just be a function exported from apollo-server-core which (a) fixes hunterloftis/stoppable#33 / #4933 and (b) provides a Promise-based version of stop. Then we can add a built-in plugin that applies our stoppable to an http server and shuts it down at pre-stop time. (Maybe don't even need to export the function, just use it in the plugin.l)

glasser · 2021-07-16T23:18:07Z

OK, I think this is my full proposal.

Add an async drainServer() plugin hook on GraphQLServerListener (with docs, etc).
Add a draining phase to ServerState. This state should generally function the same as started (including having schemaDerivedData on it). This is a nice opportunity to find references on phase and see what it controls. Notably, operations should run during this phase! The only real difference from started should be that if you call stop() in the draining phase, it waits for the stop() in progress to end rather than beginning shutdown itself.
ApolloServer.stop() (in apollo-server-core) should transition into draining before stopping and invoke the drainServer hooks. Make sure all the logic around "if stop is called when some other thread is already stopping it, it just waits for the other thread to be done" still works.
Add a new built-in plugin (with docs etc), ApolloServerPluginDrainHTTPServer. This plugin takes {httpServer: http.Server, stopGracePeriodMillis?: number}. It wraps the given server with stoppable, and it drainServer() by calling stop() on it.
Maybe fork stoppable into apollo-server-core (not exported) to fix some issues with it (Avoid using timer.unref hunterloftis/stoppable#33, probably don't actually need to put the stop() call on the server, don't need to write a _pendingSockets field that's never read, etc). I might just do this myself now.
Change docs for apollo-server-express etc as described below
Update stop docs.
Update apollo-server so that instead of overriding stop(), it adds ApolloServerPluginDrainHTTPServer to the given plugins. Note that this will require creating the app and httpServer in its constructor (but we don't have to actually add behavior to app until listen().
Update subscription docs to show how to use the drainServer hook to close the subscription server instead of signal handling.

Re docs, we'll need to change from:

async function startApolloServer(typeDefs, resolvers) {
  const server = new ApolloServer({ typeDefs, resolvers });
  await server.start();
  const app = express();
  server.applyMiddleware({ app });
  await new Promise(resolve => app.listen({ port: 4000 }, resolve));
  console.log(`🚀 Server ready at http://localhost:4000${server.graphqlPath}`);
}

to

async function startApolloServer(typeDefs, resolvers) {
  const app = express();
  const httpServer = http.createServer(app);
  const server = new ApolloServer({ typeDefs, resolvers, plugins: [ApolloServerPluginDrainServer(httpServer)] });
  await server.start();
  server.applyMiddleware({ app });
  await new Promise(resolve => httpServer.listen({ port: 4000 }, resolve));
  console.log(`🚀 Server ready at http://localhost:4000${server.graphqlPath}`);
}

glasser · 2021-07-16T23:34:11Z

(starting on the inline-and-improve-stoppable project at #5498 )

s123121 · 2021-07-23T11:15:29Z

Sorry to chime in here, but I came here from the docs https://www.apollographql.com/docs/apollo-server/migration/

I am in the process of migrating Apollo Server from 2.16 to 3.0.2, with the SIGINT workaround

['SIGINT', 'SIGTERM'].forEach(signal => {
    process.on(signal, () => subscriptionServer.close());
  });

My dev machine (Mac M1) start giving error as follow:

Error: read EIO
   at TTY.onStreamRead (node:internal/stream_base_commons:211:20)
Emitted 'error' event on ReadStream instance at:
   at emitErrorNT (node:internal/streams/destroy:188:8)
   at emitErrorCloseNT (node:internal/streams/destroy:153:3)
   at processTicksAndRejections (node:internal/process/task_queues:81:21) {
 errno: -5,
 code: 'EIO',
 syscall: 'read'
}

Remove the shut down code make everything work again. Is this related? Do we really need the above code?

glasser · 2021-07-23T21:30:08Z

You're welcome to not shut down your server cleanly if you find that works for you. We're planning to work soon on fixing this issue to give a way to put the shutdown at an appropriate time in the server lifecycle.

kevin-lindsay-1 · 2021-08-09T18:13:13Z

I would like to throw this out there just so that it's acknowledged: right now it looks like stop also shuts down the health check endpoint, which can then cause readinessProbes to fail in kubernetes, which then prevents the request from completing.

I haven't looked into the code, so maybe I'm missing something, but it seems like that's the case, and if so it would be good to keep in mind that in kubernetes the health check should stay active until the pod is ready to be removed. At least, that's what I intuit, as once a pod enters terminating state it's generally considered finished cleaning up once it stops being ready.

Something else could be going on; I'll follow up after I figure out what's up with this.

Edit: Upon further examination, the issue was caused by Istio needing a pod annotation of:

# https://istio.io/latest/docs/reference/config/istio.mesh.v1alpha1/#ProxyConfig
proxy.istio.io/config: |
  terminationDrainDuration: {{ $terminationGracePeriodSeconds }}s

trevor-scheer · 2021-08-13T17:21:17Z

@glasser jut as a heads up, it looks like this issue is now affecting us, or we've begun to observe/address it recently.
cc @AlexanderMann

glasser · 2021-08-16T19:26:22Z

@kevin-lindsay-1 hmm, are you combining liveness and readiness probes here? readiness is about routing, liveness is about "should restart". And do liveness probes actually continue to be relevant once you're already shutting down?

kevin-lindsay-1 · 2021-08-17T19:12:43Z

@glasser my liveness and readiness probes both use /.well-known/.../health as the endpoint; if it stops responding momentarily, it's not ready, and requests are load balanced to other pods in the deployment. If it stops responding for a long time, it's presumed to be dead, which has never happened to my services to my knowledge.

A pod will also stop when the process exits, however I'm not sure if a successful apolloServer.stop() causes process.exit(0) to be called. If a pod stops while it's not being terminated, it is by default restarted with a crash-loop backoff.

Once you're in the terminating state on a pod, I believe that it stops routing requests and expects the process to exit within terminationGracePeriodSeconds (default 30s), so I don't think liveness probes are checked for that.

(Full commit message to come.) Fixes #5074.

Previously, the batteries-included `apollo-server` package had a special override of `stop()` which drains the HTTP server before letting the actual Apollo Server `stop()` machinery begin. This meant that `apollo-server` followed this nice shutdown lifecycle: - Stop listening for new connections - Close all idle connections and start closing connections as they go idle - Wait a grace period for all connections to close and force-close any remaining ones - Transition ApolloServer to the stopping state, where no operations will run - Run stop hooks (eg send final usage report) This was great... but only `apollo-server` worked this way, because only `apollo-server` has full knowledge and control over its HTTP server. This PR adds a server draining step to the ApolloServer lifecycle and plugin interface, and provides a built-in plugin which drains a Node `http.Server` using the logic of the first three steps above. `apollo-server`'s behavior is now just to automatically install the plugin. Specifically: - Add a new 'phase' called `draining` that fits between `started` and `stopping`. Like `started`, operations can still execute during `draining`. Like `stopping`, any concurrent call to `stop()` will just block until the first `stop()` call finishes rather than starting a second shutdown process. - Add a new `drainServer` plugin hook (on the object returned by `serverWillStart`). Invoke all `drainServer` hooks in parallel during the `draining` phase. - Make calling `stop()` when `start()` has not yet completed successfully into an error. That behavior was previously undefined. Note that as of #5639, the automatic `stop()` call from signal handlers can't happen before `start()` succeeds. - Add `ApolloServerPluginDrainHttpServer` to `apollo-server-core`. This plugin implements `drainServer` using the `Stopper` class that was previously in the `apollo-server` package. The default grace period is 10 seconds. - Clean up integration tests to just use `stop()` with the plugin instead of separately stopping the HTTP server. Note that for Fastify specifically we also call `app.close` although there is some weirdness here around both `app.close` and our Stopper closing the same server. A comment describes the weirdness; perhaps Fastify experts can improve this later. - The Hapi web framework has built in logic that is similar to our Stopper, so `apollo-server-hapi` exports `ApolloServerPluginStopHapiServer` which should be used instead of the other plugin with Hapi. - Fix some test issues (eg, have FakeTimers only mock out Date.now instead of setImmediate, drop an erroneous `const` which made an `app` not get cleaned up, etc). Fixes #5074.

Previously, the batteries-included `apollo-server` package had a special override of `stop()` which drains the HTTP server before letting the actual Apollo Server `stop()` machinery begin. This meant that `apollo-server` followed this nice shutdown lifecycle: - Stop listening for new connections - Close all idle connections and start closing connections as they go idle - Wait a grace period for all connections to close and force-close any remaining ones - Transition ApolloServer to the stopping state, where no operations will run - Run stop hooks (eg send final usage report) This was great... but only `apollo-server` worked this way, because only `apollo-server` has full knowledge and control over its HTTP server. This PR adds a server draining step to the ApolloServer lifecycle and plugin interface, and provides a built-in plugin which drains a Node `http.Server` using the logic of the first three steps above. `apollo-server`'s behavior is now just to automatically install the plugin. Specifically: - Add a new 'phase' called `draining` that fits between `started` and `stopping`. Like `started`, operations can still execute during `draining`. Like `stopping`, any concurrent call to `stop()` will just block until the first `stop()` call finishes rather than starting a second shutdown process. - Add a new `drainServer` plugin hook (on the object returned by `serverWillStart`). Invoke all `drainServer` hooks in parallel during the `draining` phase. - Make calling `stop()` when `start()` has not yet completed successfully into an error. That behavior was previously undefined. Note that as of #5639, the automatic `stop()` call from signal handlers can't happen before `start()` succeeds. - Add `ApolloServerPluginDrainHttpServer` to `apollo-server-core`. This plugin implements `drainServer` using the `Stopper` class that was previously in the `apollo-server` package. The default grace period is 10 seconds. - Clean up integration tests to just use `stop()` with the plugin instead of separately stopping the HTTP server. Note that for Fastify specifically we also call `app.close` although there is some weirdness here around both `app.close` and our Stopper closing the same server. A comment describes the weirdness; perhaps Fastify experts can improve this later. - The Hapi web framework has built in logic that is similar to our Stopper, so `apollo-server-hapi` exports `ApolloServerPluginStopHapiServer` which should be used instead of the other plugin with Hapi. - Remove some examples from READMEs and point to examples in the docs instead. Keeping both up to date is extra work. - Fix some test issues (eg, have FakeTimers only mock out Date.now instead of setImmediate, drop an erroneous `const` which made an `app` not get cleaned up, etc). Fixes #5074.

glasser changed the title ~~Calling start before applyMiddleware installing term signals listeners and stopping~~ No great pattern for doing a graceful shutdown with Apollo Server integration packages Mar 30, 2021

glasser mentioned this issue Jul 1, 2021

docs: update subscriptions docs for v3 #5406

Merged

glasser added this to the July 2021 milestone Jul 8, 2021

glasser added the 🖇️ all-integrations label Jul 8, 2021

hwillson added the size/medium Estimated to take LESS THAN A WEEK label Jul 8, 2021

hwillson assigned trevor-scheer Jul 8, 2021

glasser mentioned this issue Jul 19, 2021

Inline and improve the stoppable package #5498

Merged

hwillson added the 2021-07 label Jul 29, 2021

hwillson removed this from the MM-2021-07 milestone Jul 29, 2021

glasser added 2021-08 and removed 2021-07 labels Aug 3, 2021

hwillson assigned glasser and unassigned trevor-scheer Aug 19, 2021

glasser added a commit that referenced this issue Aug 20, 2021

Generalize apollo-server graceful shutdown to all integrations

1edba46

(Full commit message to come.) Fixes #5074.

glasser added a commit that referenced this issue Aug 20, 2021

Generalize apollo-server graceful shutdown to all integrations

133beb0

(Full commit message to come.) Fixes #5074.

glasser mentioned this issue Aug 20, 2021

Generalize apollo-server graceful shutdown to all integrations #5635

Merged

glasser added a commit that referenced this issue Aug 20, 2021

Generalize apollo-server graceful shutdown to all integrations

a92be0e

(Full commit message to come.) Fixes #5074.

glasser closed this as completed in #5635 Aug 23, 2021

github-actions bot locked as resolved and limited conversation to collaborators Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No great pattern for doing a graceful shutdown with Apollo Server integration packages #5074

No great pattern for doing a graceful shutdown with Apollo Server integration packages #5074

vieira commented Mar 30, 2021

glasser commented Mar 30, 2021 •

edited

Loading

glasser commented Mar 30, 2021

glasser commented Jul 1, 2021

glasser commented Jul 8, 2021

glasser commented Jul 16, 2021 •

edited

Loading

glasser commented Jul 16, 2021

s123121 commented Jul 23, 2021

glasser commented Jul 23, 2021

kevin-lindsay-1 commented Aug 9, 2021 •

edited

Loading

trevor-scheer commented Aug 13, 2021

glasser commented Aug 16, 2021

kevin-lindsay-1 commented Aug 17, 2021 •

edited

Loading

No great pattern for doing a graceful shutdown with Apollo Server integration packages #5074

No great pattern for doing a graceful shutdown with Apollo Server integration packages #5074

Comments

vieira commented Mar 30, 2021

glasser commented Mar 30, 2021 • edited Loading

glasser commented Mar 30, 2021

glasser commented Jul 1, 2021

glasser commented Jul 8, 2021

glasser commented Jul 16, 2021 • edited Loading

glasser commented Jul 16, 2021

s123121 commented Jul 23, 2021

glasser commented Jul 23, 2021

kevin-lindsay-1 commented Aug 9, 2021 • edited Loading

trevor-scheer commented Aug 13, 2021

glasser commented Aug 16, 2021

kevin-lindsay-1 commented Aug 17, 2021 • edited Loading

glasser commented Mar 30, 2021 •

edited

Loading

glasser commented Jul 16, 2021 •

edited

Loading

kevin-lindsay-1 commented Aug 9, 2021 •

edited

Loading

kevin-lindsay-1 commented Aug 17, 2021 •

edited

Loading