pageserver: discuss & define behavior on read IO errors #10454

problame · 2025-01-20T17:52:31Z

Problem

Today, if a getpage request from compute encounters an IO error inside pageserver (e.g. an error from the pageserver's filesystem), the compute SQL query that reads this result will fail with a scary-looking query error to the application that are unactionable to users.
(Unverified, but that is my understanding of the compute-side code).

That is not a good user experience.

We should align in the team on the intended behavior and figure out how to get there.

Technical Stuff

IO errors get mapped to an undifferentiated PageReconstructError::Other, which is anyhow.
We get a nice stack trace, and the user gets a PagestreamBeMessage::Error.

The page_service pagestream protocol can in theory continue processing requests.
In practice, the compute backend probably isn't going to read any further requests, but I'm not sure.

There is no checking & aborting of the process for maybe_fatal_err on the path up inside pageserver.

The text was updated successfully, but these errors were encountered:

jcsp · 2025-01-20T19:00:24Z

scary-looking query error to the application that are unactionable to users.

When we get an Other (I/O errors shouldn't be Other though...), it's sort of right that this is scary looking, but not that it's un-actionable. Should we just rephrase this into something like "If you encounter this error repeatedly, please contact neon support"?

IO errors get mapped to an undifferentiated PageReconstructError::Other

When we discussed I/O errors and wrote https://docs.neon.build/storage/handling_io_and_logical_errors.html?highlight=coding%20conventions#purpose -- I think our intent was that we would abort the pageserver process in this case.

Then for the overall flow... I guess compute sees a connection drop, gets a notification from CP when another pageserver is available, reconnects and succeeds (hopefully!).

jcsp · 2025-01-28T15:12:33Z

If we abort process rather than bubbling up error, then compute doesn't see the scary error (it just eventually retries).
"I/O" error in the doc above means EIO -- there is a broader category of std::io::Error that includes things like code bugs/races that might show up as an ENOENT or similar. Those are out of scope.

problame added c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Jan 20, 2025

jcsp assigned problame Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: discuss & define behavior on read IO errors #10454

pageserver: discuss & define behavior on read IO errors #10454

problame commented Jan 20, 2025

jcsp commented Jan 20, 2025

jcsp commented Jan 28, 2025

pageserver: discuss & define behavior on read IO errors #10454

pageserver: discuss & define behavior on read IO errors #10454

Comments

problame commented Jan 20, 2025

Problem

Technical Stuff

jcsp commented Jan 20, 2025

jcsp commented Jan 28, 2025