Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: discuss & define behavior on read IO errors #10454

Open
problame opened this issue Jan 20, 2025 · 2 comments
Open

pageserver: discuss & define behavior on read IO errors #10454

problame opened this issue Jan 20, 2025 · 2 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug

Comments

@problame
Copy link
Contributor

Problem

Today, if a getpage request from compute encounters an IO error inside pageserver (e.g. an error from the pageserver's filesystem), the compute SQL query that reads this result will fail with a scary-looking query error to the application that are unactionable to users.
(Unverified, but that is my understanding of the compute-side code).

That is not a good user experience.

We should align in the team on the intended behavior and figure out how to get there.

Technical Stuff

IO errors get mapped to an undifferentiated PageReconstructError::Other, which is anyhow.
We get a nice stack trace, and the user gets a PagestreamBeMessage::Error.

The page_service pagestream protocol can in theory continue processing requests.
In practice, the compute backend probably isn't going to read any further requests, but I'm not sure.

There is no checking & aborting of the process for maybe_fatal_err on the path up inside pageserver.

@problame problame added c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Jan 20, 2025
@jcsp
Copy link
Collaborator

jcsp commented Jan 20, 2025

scary-looking query error to the application that are unactionable to users.

When we get an Other (I/O errors shouldn't be Other though...), it's sort of right that this is scary looking, but not that it's un-actionable. Should we just rephrase this into something like "If you encounter this error repeatedly, please contact neon support"?

IO errors get mapped to an undifferentiated PageReconstructError::Other

When we discussed I/O errors and wrote https://docs.neon.build/storage/handling_io_and_logical_errors.html?highlight=coding%20conventions#purpose -- I think our intent was that we would abort the pageserver process in this case.

Then for the overall flow... I guess compute sees a connection drop, gets a notification from CP when another pageserver is available, reconnects and succeeds (hopefully!).

@jcsp
Copy link
Collaborator

jcsp commented Jan 28, 2025

  • If we abort process rather than bubbling up error, then compute doesn't see the scary error (it just eventually retries).
  • "I/O" error in the doc above means EIO -- there is a broader category of std::io::Error that includes things like code bugs/races that might show up as an ENOENT or similar. Those are out of scope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

2 participants