-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lost SubmitNotification when sending invocations while runtime is starting #1651
Comments
The same problem applies to sending responses back to the ingress. If the current leader successfully proposes an The underlying problem is that there is no way for a replica to learn about actions taken by the leader if these actions are not acknowledged by writing an ack to bifrost. |
To sum up the underlying problem again: Restate PPs need to send messages to the ingress (responses and The observable effect is that external clients won't receive a response and hang until their connection gets interrupted (e.g. via a timeout, failure of ingress, etc.). If the invocation wasn't done with an idempotency key set, then the user has little chance of retrieving the original response. The current state is not acceptable since a leadership change can happen at any point in time which can leave external clients hanging. An internal implementation detail should ideally not surface in a way that forces users to do extra work to make things reliably work with Restate if everything else is working fine (e.g. external client being happily connected to the ingress w/o connection loss). Some ideas for solving/mitigating this problem are: Making the communication channel between ingress and PPs reliableIn order to avoid the problem entirely, we need to make the communication between the ingress and the PPs reliable. One idea to achieve this is to handle PP -> ingress similar to how we handle PP -> PP communication: If a PP wants to send a message to the ingress, then it appends it to the ingress-outbox (could also be the outbox) and a shuffle component tries to send it to the ingress. If the shuffle component has successfully sent the message, then it tells its replicas about it by writing a The way the ingress shuffle sends messages to the ingress could either be by writing to a dedicated ingress log (one per ingress or multiple ingresses share the same log) or by using the network for sending messages. When using the network, we need the ingress to acknowledge the reception of the message for the ingress shuffle to generate the The downsides are that we introduce more messages that we need to write to Bifrost and potentially a small delay for sending messages since we first need to write it to the ingress outbox and then tell the ingress outbox shuffle about it before it gets sent to the ingress. Let the ingress close client connections if a lost message is detectedIf the ingress could detect that a message was lost for a respective client connection, then it could close the connection to signal the client that something went wrong. If Restate is allowed to do this, then we will effectively require that users always invoke services idempotently since it can always happen that Restate closes their connection even if there wasn't a problem between the client and the ingress (one might still argue that this is good practice anyway since the connection could also be closed because of a fault that happens between the external client and the ingress). One way to detect that a message might be lost is to remember for every dispatcher request the lsn of the corresponding entry. Then if the ingress wants to check a request/connection, it could ask the corresponding leader whether it has already processed the respective lsn. If yes, then it could check whether an invocation with a given invocation id is still running or not. If the invocation is not running, then this either means that we have lost a response/submit notification or a response/submit notification is still in flight. So after some timeout, we might somewhat safely conclude that a message was lost and then close the connection. If the invocation is running, then it can retrieve the A way to trigger checking a connection is to observe leadership changes of the PPs. If a leadership change occurs, then this could trigger the check since this is the necessary condition for causing a message loss. Enforce mandatory timeouts for external callsIf external client connections always have an associated timeout, then clients wouldn't hang. This, of course, begs the question what this timeout should be since it limits the maximum duration of a service invocation when doing a synchronous call. One answer to this question could be to discourage people from doing synchronous calls and instead submit invocations and then "poll" for their results. Another problem arises when a non-idempotent invocation times out. Then, there is currently no reliable way to fetch the result. A solution to this problem could be to either require always idempotent invocations or to also store the result for non-idempotent invocations. In order to make this work, we would have to be able to create an invocation with a specified invocation id in case that we submitted the invocation on a previous attempt. Another problem with this approach is that it will increase the P99 latency significantly (by the configured timeout). |
Personally, I am leaning towards making the communication between the ingress and PP reliable since it feels a bit odd to me that Restate can "lose" some messages internally which requires users to act in a specific way (make all invocations idempotent and add retries in case of a connection termination, although this is what users should do to properly handle faults that might arise between the client and the ingress). |
Ahmed proposed a new variant: The ingress <=> partition processor communicate via RPCs. Instead of directly writing to Bifrost, the ingress will rpc the leading partition processor (routing based on idempotency key or invocation id) and tell it to self-propose an invocation. The partition processor will respond with the invocation id once the self-proposed message is being applied. In order to retrieve the result, the ingress will rpc any of the replicas to ask for the result. Each replica will keep the results of an invocation around for a bit of time (configurable) to fulfill these requests in case of a retried ingress request. Replicas can remove these completed results based on their own time as long as they don't use this information in their state machines. The benefits of this approach are that we don't introduce additional writes to Bifrost. The "downside" is that we need to keep the result for completed invocations for some time for ingress invocations. |
This is a temporary band-aid to solve the problem that at start-up we might not respond to some invocations if the partition processors are slow at claiming leadership. See restatedev#1651 for more details. The band-aid introduces a oneshot which is being completed by the cluster controller when it has seen for every partition a leader. The ingress will wait for the oneshot to be completed before starting it. Note, this band-aid only works for single process setups where the Admin role is colocated with the Worker role. This should cover the normal deployment right now. This fixes restatedev#1684.
This is a temporary band-aid to solve the problem that at start-up we might not respond to some invocations if the partition processors are slow at claiming leadership. See restatedev#1651 for more details. The band-aid introduces a oneshot which is being completed by the cluster controller when it has seen for every partition a leader. The ingress will wait for the oneshot to be completed before starting it. Note, this band-aid only works for single process setups where the Admin role is colocated with the Worker role. This should cover the normal deployment right now. This fixes restatedev#1684.
We decided to replace the existing ingress <> pp communication via RPCs:
Design document: https://docs.google.com/document/d/1_LcgBTsghkpjTlyqh8DNidZyFDLc8-DNw88dF741hPU |
One way to make the retry behavior event-driven and not timeout-based is to require that responses to requests need to be sent via the same physical connection on which the request was received. Given this assumption, we can monitor the liveness of the connection and retry the requests if the connection breaks. Moreover, the |
Fixed via 3a293a4 |
…ost SubmitNotifications With restatedev#1651 being fixed, we no longer need to wait for all partitions to have a running leader before we can start the ingress to avoid stucks invocations. That's because the ingress will now retry until it finds an active leader for a given partition. This fixes restatedev#2206.
…ost SubmitNotifications With restatedev#1651 being fixed, we no longer need to wait for all partitions to have a running leader before we can start the ingress to avoid stucks invocations. That's because the ingress will now retry until it finds an active leader for a given partition. This fixes restatedev#2206.
We have a race condition in the runtime which manifests in the form of seemingly lost
SubmitNotifications
if a client submits invocations while the runtime is starting. TheSubmitNotifications
are needed to complete a send invocation with the correctInvocationId
. The observable effect is that the external client hangs forever.The problem of why this occurs is the following: If the external client submits an invocation before the PP is running, it will get appended to Bifrost. Upon gaining leadership, the PP will treat the invocation as if were resuming it from a previous leader epoch. Since we are not sending
SubmitNotifications
when resuming invocations, the external client will never get unblocked.As a solution, I propose to also send
SubmitNotifications
when resuming invocations because the PP currently cannot know whether such notification was sent or not before it became the leader. This will also cover the situation where a user submits a new invocation while there is a leadership change, which can always happen.The downside of this approach is that we might send quite a few
SubmitNotifications
depending on the number of resumed invocations when obtaining leadership. Maybe this can be mitigated by the network layer by batching these requests.Tasks
The text was updated successfully, but these errors were encountered: