-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gimlet-seq: retry I2C errors, or die trying #1611
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N.B. that the git diff is super messy just because I moved a big chunk of code out of main
and into a function; it's mostly unchanged, but GitHub doesn't really seem to get that. All the functional changes are in commit ee898f0, which might be a bit easier to review.
A general question I have from an upstack consumer perspective:
My understanding is that #1556 has caused the visible state of the fault signal to be asserted the entire time that we are booting up. So I'm not sure how this intersects with this. In particular, given the change in that PR, we cannot actually know that we need to power cycle that Gimlet without creating some sort of startup time-out countdown timer. That is, upstack we'd have to know that this has remained asserted for some time x to believe that this is valid versus the original theory of being able to say whenever it's asserted we know it's a problem. If the logic in 1556 happens before this, how do we distinguish the two? If not, do they intersect? |
8ce2f19
to
48f3391
Compare
@rmustacc If I'm being entirely honest, that seems like more of a question for @cbiffle — I just picked up the issue as he described it in #1205. Cliff probably has a better sense of how this is expected to interact with the rest of the system. With that said, I believe the visible behavior on failures should be similar? Currently, the fault pin is pulled low during the startup process, and only set high once the sequencer task is actually serving IPC requests. If the sequencer task crashes and enters a crashloop before it starts serving IPC requests, the fault pin remains asserted. Now, if a failure occurs during startup, the sequencer task no longer crash loops, but it still won't set the fault pin high, so the fault pin still remains asserted indefinitely if the sequencer task can't come up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think moving init to ServerImpl has more justification than you suggest in your commit message -- you've basically done a "parse don't validate," since ServerImpl is written with a lot of embedded assumptions that init succeeded. So it's now harder to accidentally start a server without finishing init.
Generally good, a few questions below:
drv/gimlet-seq-server/src/main.rs
Outdated
// All these moments will be lost in time, like tears in rain... | ||
// Time to die. | ||
loop { | ||
hl::sleep_for(100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IPC server probably assumes that initialization was successful, so it's unsafe to start it here... that's too bad because it'd be really great to be able to ask it questions.
We'd have to introduce some notion of "I'm broken" into the IPC server for that to work.
Mostly just thinking out loud here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was also thinking about possibly having it start serving, and just respond to everything with an "I'm broken" message. Would be happy to change that now, if you think that's a good idea, or in a later PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this PR is a significant marginal improvement, so there's no reason to ask you to cram more stuff in before merging it.
{ | ||
// Chosen by fair dice roll, seems reasonable-ish? | ||
let mut retries_remaining = 3; | ||
loop { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might consider structuring this as a for _ in 0..retries_remaining
just to avoid accidentally creating an infinite loop through a logic error below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was what i was going to do. but, we want to return the last error when we run out of retries, which means that we need to keep it around when we fall through to the end of the loop, which felt a bit ugly; it would have to be something like this:
// Chosen arbitrarily, but don't worry! We'll never actually return this --- it's
// just because we need to initialize this to *something*.
let mut last_error = i2c::ResponseCode::BadResponse;
for retries_remaining in (0..3).rev() {
match txn() {
Ok(x) => return Ok(x),
Err(e) => {
let code = e.into();
ringbuf_entry!(Trace::I2cFault {
retries_remaining,
code,
});
last_error = code;
}
}
}
Err(last_error)
in particular, i felt like having to assign last_error
to something, even though we've never actually seen that error, felt a bit unfortunate.
what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm. Fair point. Current structure's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, the more I think about it, I'm not sure whether initializing the variable to an arbitrary error code we'll never actually return, or the theoretically-infinite loop which isn't actually is more unpleasant. They're both kinda ugly...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A third option, which is also not my favorite, is to do something like this:
for retries_remaining in (0..MAX_RETRIES).rev() {
match txn() {
Ok(x) => return Ok(x),
Err(error) => {
let code = error.into();
ringbuf_entry!(Trace::I2cFault {
retries_remaining,
code,
});
if retries_remaining == 0 {
return Err(code);
}
}
}
}
unreachable!()
Now, we've gotten rid of the arbitrary error code that we'll never actually return, and the infinite loop that isn't actually infinite, but this comes at the cost of an unreachable!()
, and having to write unreachable!()
also makes me feel kind of sad.
If you have a preference for any of these bad options, I'll happily go with that one --- I don't really have strong feelings about which is the least objectionable. :)
@rmustacc - Until we're confident that potential crashloops are all gone from the sequencer task, I'm hesitant to remove the code that sets fault on startup. So we probably need to work out a "how much time does the sequencer have to start up before we consider it faulted" delay either way until we're more confident in the code. This change fixes most (or all, possibly) of the I2C-related panics, but not all the panics. |
That's fine. Just want to make sure that we can actually use this up stack and then we'll have to figure out how that fits into the ignition controller and related (or the future looking interface overhaul you have planned). I realize the upstack stuff isn't there yet so I don't ask this to block this at all, just to make sure we know how it all would fit together. |
@rmustacc - understood. The main improvement of this change is that it'll stop frantically reinitializing the FPGA and bouncing rails when I2C messes up; improved upstream diagnostics are kind of a side effect. There's more we can do to improve the diagnostics beyond this change. |
e569c99
to
982cb5b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(please squash or fold your most recent commits)
(or all of them, I'm not the cops)
Currently, the sequencer task uses `unwrap()` on I2C bus operations. However, these may fail due to transient bus errors, or fail permanently for reasons wholly outside the sequencer task's control. As discussed in issue #1205, it's incorrect for the sequencer task to unwap these errors, as panicking results in the task being restarted. On fatal bus errors, this means we crash loop. Instead, this commit changes the `gimlet-seq` task to retry failed I2C reads/writes up to three times, and, if they still fail, transition to a permanent failed state, setting the FAULT net to Ignition to indicate that the Gimlet must be power-cycled. Fixes #1205
This way, we only construct a `ServerImpl` when the sequencer task is in a state to serve IPC requests. This isn't all that important, but it's nice to make it a bit harder to accidentally start a server without finishing init.
Sleeping for 100ms in a loop is not a particularly efficient way for a task to die permanently. Thanks to suggestions from @labbott and @cbiffle, I've changed this so that the sequencer task now dies by waiting for a notification with an empty notification mask, essentially waiting forever for a notification that should never happen. This is much more respectful of other tasks that might still be trying to do stuff, and would like to be scheduled to do that stuff. Also, I've cleaned up a couple other things based on some of Cliff's suggestions. Because we no longer set the state when dying, we can pass the `jefe` handle into `init` by value, which is a bit nicer.
a0cdaf6
to
b02c0d7
Compare
Currently, the sequencer task uses `unwrap()` on I2C bus operations. However, these may fail due to transient bus errors, or fail permanently for reasons wholly outside the sequencer task's control. As discussed in issue #1205, it's incorrect for the sequencer task to unwap these errors, as panicking results in the task being restarted. On fatal bus errors, this means we crash loop. Instead, this commit changes the `gimlet-seq` task to retry failed I2C reads/writes up to three times, and, if they still fail, transition to a permanent failed state, setting the FAULT net to Ignition to indicate that the Gimlet must be power-cycled. Fixes #1205
This way, we only construct a `ServerImpl` when the sequencer task is in a state to serve IPC requests. This isn't all that important, but it's nice to make it a bit harder to accidentally start a server without finishing init.
Sleeping for 100ms in a loop is not a particularly efficient way for a task to die permanently. Thanks to suggestions from @labbott and @cbiffle, I've changed this so that the sequencer task now dies by waiting for a notification with an empty notification mask, essentially waiting forever for a notification that should never happen. This is much more respectful of other tasks that might still be trying to do stuff, and would like to be scheduled to do that stuff. Also, I've cleaned up a couple other things based on some of Cliff's suggestions. Because we no longer set the state when dying, we can pass the `jefe` handle into `init` by value, which is a bit nicer.
b02c0d7
to
bd539d6
Compare
Sleeping for 100ms in a loop is not a particularly efficient way for a task to die permanently. Thanks to suggestions from @labbott and @cbiffle, I've changed this so that the sequencer task now dies by waiting for a notification with an empty notification mask, essentially waiting forever for a notification that should never happen. This is much more respectful of other tasks that might still be trying to do stuff, and would like to be scheduled to do that stuff. Also, I've cleaned up a couple other things based on some of Cliff's suggestions. Because we no longer set the state when dying, we can pass the `jefe` handle into `init` by value, which is a bit nicer.
bd539d6
to
72c99f3
Compare
Currently, the sequencer task uses `unwrap()` on I2C bus operations. However, these may fail due to transient bus errors, or fail permanently for reasons wholly outside the sequencer task's control. As discussed in issue #1205, it's incorrect for the sequencer task to unwap these errors, as panicking results in the task being restarted. On fatal bus errors, this means we crash loop. Instead, this commit changes the `gimlet-seq` task to retry failed I2C reads/writes up to three times, and, if they still fail, transition to a permanent failed state, setting the FAULT net to Ignition to indicate that the Gimlet must be power-cycled. Fixes #1205
This way, we only construct a `ServerImpl` when the sequencer task is in a state to serve IPC requests. This isn't all that important, but it's nice to make it a bit harder to accidentally start a server without finishing init.
Currently, the sequencer task uses
unwrap()
on I2C bus operations.However, these may fail due to transient bus errors, or fail permanently
for reasons wholly outside the sequencer task's control. As discussed in
issue #1205, it's incorrect for the sequencer task to unwap these
errors, as panicking results in the task being restarted. On fatal bus
errors, this means we crash loop.
Instead, this commit changes the
gimlet-seq
task to retry failed I2Creads/writes up to three times, and, if they still fail, transition to a
permanent failed state, setting the FAULT net to Ignition to indicate
that the Gimlet must be power-cycled.