You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Today, if a getpage request from compute encounters an IO error inside pageserver (e.g. an error from the pageserver's filesystem), the compute SQL query that reads this result will fail with a scary-looking query error to the application that are unactionable to users.
(Unverified, but that is my understanding of the compute-side code).
That is not a good user experience.
We should align in the team on the intended behavior and figure out how to get there.
Technical Stuff
IO errors get mapped to an undifferentiated PageReconstructError::Other, which is anyhow.
We get a nice stack trace, and the user gets a PagestreamBeMessage::Error.
The page_service pagestream protocol can in theory continue processing requests.
In practice, the compute backend probably isn't going to read any further requests, but I'm not sure.
There is no checking & aborting of the process for maybe_fatal_err on the path up inside pageserver.
The text was updated successfully, but these errors were encountered:
scary-looking query error to the application that are unactionable to users.
When we get an Other (I/O errors shouldn't be Other though...), it's sort of right that this is scary looking, but not that it's un-actionable. Should we just rephrase this into something like "If you encounter this error repeatedly, please contact neon support"?
IO errors get mapped to an undifferentiated PageReconstructError::Other
Then for the overall flow... I guess compute sees a connection drop, gets a notification from CP when another pageserver is available, reconnects and succeeds (hopefully!).
If we abort process rather than bubbling up error, then compute doesn't see the scary error (it just eventually retries).
"I/O" error in the doc above means EIO -- there is a broader category of std::io::Error that includes things like code bugs/races that might show up as an ENOENT or similar. Those are out of scope.
Problem
Today, if a getpage request from compute encounters an IO error inside pageserver (e.g. an error from the pageserver's filesystem), the compute SQL query that reads this result will fail with a scary-looking query error to the application that are unactionable to users.
(Unverified, but that is my understanding of the compute-side code).
That is not a good user experience.
We should align in the team on the intended behavior and figure out how to get there.
Technical Stuff
IO errors get mapped to an undifferentiated
PageReconstructError::Other
, which isanyhow
.We get a nice stack trace, and the user gets a
PagestreamBeMessage::Error
.The page_service pagestream protocol can in theory continue processing requests.
In practice, the compute backend probably isn't going to read any further requests, but I'm not sure.
There is no checking & aborting of the process for
maybe_fatal_err
on the path up inside pageserver.The text was updated successfully, but these errors were encountered: