Lost SubmitNotification when sending invocations while runtime is starting #1651

tillrohrmann · 2024-06-20T15:59:19Z

We have a race condition in the runtime which manifests in the form of seemingly lost SubmitNotifications if a client submits invocations while the runtime is starting. The SubmitNotifications are needed to complete a send invocation with the correct InvocationId. The observable effect is that the external client hangs forever.

The problem of why this occurs is the following: If the external client submits an invocation before the PP is running, it will get appended to Bifrost. Upon gaining leadership, the PP will treat the invocation as if were resuming it from a previous leader epoch. Since we are not sending SubmitNotifications when resuming invocations, the external client will never get unblocked.

As a solution, I propose to also send SubmitNotifications when resuming invocations because the PP currently cannot know whether such notification was sent or not before it became the leader. This will also cover the situation where a user submits a new invocation while there is a leadership change, which can always happen.

The downside of this approach is that we might send quite a few SubmitNotifications depending on the number of resumed invocations when obtaining leadership. Maybe this can be mitigated by the network layer by batching these requests.

Tasks

Give feedback

Distribute information about who's leader for a given partition #1907
Define RPC interface used by the Ingress to communicate with partition processors #1908
Add support for handling RPC requests to the partition processor #1909
Update ingress dispatcher to use RPCs to communicate with partition processor #1910
Support monitoring for leadership changes
Support monitoring for connection loss to a node
Options

The text was updated successfully, but these errors were encountered:

tillrohrmann · 2024-06-20T17:50:12Z

The same problem applies to sending responses back to the ingress. If the current leader successfully proposes an InvokerEffectKind::End and then loses leadership, then the next leader will think that the response has been sent to the ingress. If this was not the case, then an external client might wait until eternity.

The underlying problem is that there is no way for a replica to learn about actions taken by the leader if these actions are not acknowledged by writing an ack to bifrost.

tillrohrmann · 2024-06-24T08:48:24Z

To sum up the underlying problem again: Restate PPs need to send messages to the ingress (responses and SubmitNotifications) so that the ingress can respond to external clients. This communication channel is unreliable. At the moment, PPs try to send the message once and if it fails, then it is ignored. Moreover, if a message occurs that triggers an ingress message (Command::Invoke can trigger SubmitNotification and Command::InvokerEffect(End) triggers InvocationResponse) w/o an active leader, then subsequent leaders will assume that the message has been sent.

The observable effect is that external clients won't receive a response and hang until their connection gets interrupted (e.g. via a timeout, failure of ingress, etc.). If the invocation wasn't done with an idempotency key set, then the user has little chance of retrieving the original response.

The current state is not acceptable since a leadership change can happen at any point in time which can leave external clients hanging. An internal implementation detail should ideally not surface in a way that forces users to do extra work to make things reliably work with Restate if everything else is working fine (e.g. external client being happily connected to the ingress w/o connection loss).

Some ideas for solving/mitigating this problem are:

Making the communication channel between ingress and PPs reliable

In order to avoid the problem entirely, we need to make the communication between the ingress and the PPs reliable. One idea to achieve this is to handle PP -> ingress similar to how we handle PP -> PP communication: If a PP wants to send a message to the ingress, then it appends it to the ingress-outbox (could also be the outbox) and a shuffle component tries to send it to the ingress. If the shuffle component has successfully sent the message, then it tells its replicas about it by writing a Command::TruncateIngressOutbox to its log. Upon applying this command, one can remove message from the ingress outbox.

The way the ingress shuffle sends messages to the ingress could either be by writing to a dedicated ingress log (one per ingress or multiple ingresses share the same log) or by using the network for sending messages. When using the network, we need the ingress to acknowledge the reception of the message for the ingress shuffle to generate the TruncateIngressOutbox message.

The downsides are that we introduce more messages that we need to write to Bifrost and potentially a small delay for sending messages since we first need to write it to the ingress outbox and then tell the ingress outbox shuffle about it before it gets sent to the ingress.

Let the ingress close client connections if a lost message is detected

If the ingress could detect that a message was lost for a respective client connection, then it could close the connection to signal the client that something went wrong. If Restate is allowed to do this, then we will effectively require that users always invoke services idempotently since it can always happen that Restate closes their connection even if there wasn't a problem between the client and the ingress (one might still argue that this is good practice anyway since the connection could also be closed because of a fault that happens between the external client and the ingress).

One way to detect that a message might be lost is to remember for every dispatcher request the lsn of the corresponding entry. Then if the ingress wants to check a request/connection, it could ask the corresponding leader whether it has already processed the respective lsn. If yes, then it could check whether an invocation with a given invocation id is still running or not. If the invocation is not running, then this either means that we have lost a response/submit notification or a response/submit notification is still in flight. So after some timeout, we might somewhat safely conclude that a message was lost and then close the connection. If the invocation is running, then it can retrieve the SubmitNotification message by looking up the invocation id.

A way to trigger checking a connection is to observe leadership changes of the PPs. If a leadership change occurs, then this could trigger the check since this is the necessary condition for causing a message loss.

Enforce mandatory timeouts for external calls

If external client connections always have an associated timeout, then clients wouldn't hang.

This, of course, begs the question what this timeout should be since it limits the maximum duration of a service invocation when doing a synchronous call. One answer to this question could be to discourage people from doing synchronous calls and instead submit invocations and then "poll" for their results.

Another problem arises when a non-idempotent invocation times out. Then, there is currently no reliable way to fetch the result. A solution to this problem could be to either require always idempotent invocations or to also store the result for non-idempotent invocations. In order to make this work, we would have to be able to create an invocation with a specified invocation id in case that we submitted the invocation on a previous attempt.

Another problem with this approach is that it will increase the P99 latency significantly (by the configured timeout).

tillrohrmann · 2024-06-24T08:58:31Z

Personally, I am leaning towards making the communication between the ingress and PP reliable since it feels a bit odd to me that Restate can "lose" some messages internally which requires users to act in a specific way (make all invocations idempotent and add retries in case of a connection termination, although this is what users should do to properly handle faults that might arise between the client and the ingress).

tillrohrmann · 2024-07-02T13:10:59Z

Ahmed proposed a new variant: The ingress <=> partition processor communicate via RPCs. Instead of directly writing to Bifrost, the ingress will rpc the leading partition processor (routing based on idempotency key or invocation id) and tell it to self-propose an invocation. The partition processor will respond with the invocation id once the self-proposed message is being applied. In order to retrieve the result, the ingress will rpc any of the replicas to ask for the result. Each replica will keep the results of an invocation around for a bit of time (configurable) to fulfill these requests in case of a retried ingress request. Replicas can remove these completed results based on their own time as long as they don't use this information in their state machines.

The benefits of this approach are that we don't introduce additional writes to Bifrost. The "downside" is that we need to keep the result for completed invocations for some time for ingress invocations.

This is a temporary band-aid to solve the problem that at start-up we might not respond to some invocations if the partition processors are slow at claiming leadership. See restatedev#1651 for more details. The band-aid introduces a oneshot which is being completed by the cluster controller when it has seen for every partition a leader. The ingress will wait for the oneshot to be completed before starting it. Note, this band-aid only works for single process setups where the Admin role is colocated with the Worker role. This should cover the normal deployment right now. This fixes restatedev#1684.

tillrohrmann · 2024-08-28T13:44:58Z

We decided to replace the existing ingress <> pp communication via RPCs:

Change the communication between ingresses and partition processors to be RPC-based
Ingresses ask leading partition processors for results (potentially “long polling” for the result of a long-running invocation)
Ingresses resend their requests after timeout (later they can also monitor for leadership changes or message loss to resend requests)
Partition processors can store invocation results to solve the problem of disconnected ingresses
Partition processors don’t have to ensure the delivery of messages to an ingress since the ingress would retry requesting it

Design document: https://docs.google.com/document/d/1_LcgBTsghkpjTlyqh8DNidZyFDLc8-DNw88dF741hPU

tillrohrmann · 2024-10-04T12:39:07Z

One way to make the retry behavior event-driven and not timeout-based is to require that responses to requests need to be sent via the same physical connection on which the request was received. Given this assumption, we can monitor the liveness of the connection and retry the requests if the connection breaks. Moreover, the PartitionProcessor would need to fail all pending requests upon a leadership change. This is similar to how the RemoteSequencer operates #2003.

tillrohrmann · 2024-11-04T08:49:59Z

Fixed via 3a293a4

…ost SubmitNotifications With restatedev#1651 being fixed, we no longer need to wait for all partitions to have a running leader before we can start the ingress to avoid stucks invocations. That's because the ingress will now retry until it finds an active leader for a given partition. This fixes restatedev#2206.

tillrohrmann self-assigned this Jun 20, 2024

tillrohrmann added bug Something isn't working ingress partition-processor labels Jun 24, 2024

tillrohrmann mentioned this issue Jul 2, 2024

Only start ingress after all partition processors have obtained leadership #1684

Closed

tillrohrmann mentioned this issue Jul 3, 2024

Start ingress only after all partitions have a leader #1686

Merged

tillrohrmann removed their assignment Jul 29, 2024

tillrohrmann mentioned this issue Aug 8, 2024

Known problems for correctness of distributed Restate #1801

Closed

tillrohrmann mentioned this issue Aug 16, 2024

Decouple ingress from PartitionStoreManager #1811

Closed

tillrohrmann self-assigned this Aug 26, 2024

slinkydeveloper mentioned this issue Sep 17, 2024

Kafka perf improvements #1964

Merged

tillrohrmann assigned slinkydeveloper and unassigned tillrohrmann Oct 15, 2024

tillrohrmann closed this as completed Nov 4, 2024

tillrohrmann mentioned this issue Nov 4, 2024

Remove start-up dependency from ingress on partition processors #2206

Closed

tillrohrmann mentioned this issue Nov 4, 2024

Remove temporary solution for avoiding stuck invocations because of lost SubmitNotifications #2207

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lost SubmitNotification when sending invocations while runtime is starting #1651

Lost SubmitNotification when sending invocations while runtime is starting #1651

tillrohrmann commented Jun 20, 2024 •

edited

Loading

Tasks

tillrohrmann commented Jun 20, 2024

tillrohrmann commented Jun 24, 2024 •

edited

Loading

tillrohrmann commented Jun 24, 2024 •

edited

Loading

tillrohrmann commented Jul 2, 2024

tillrohrmann commented Aug 28, 2024 •

edited

Loading

tillrohrmann commented Oct 4, 2024

tillrohrmann commented Nov 4, 2024

Lost SubmitNotification when sending invocations while runtime is starting #1651

Lost SubmitNotification when sending invocations while runtime is starting #1651

Comments

tillrohrmann commented Jun 20, 2024 • edited Loading

Tasks

tillrohrmann commented Jun 20, 2024

tillrohrmann commented Jun 24, 2024 • edited Loading

Making the communication channel between ingress and PPs reliable

Let the ingress close client connections if a lost message is detected

Enforce mandatory timeouts for external calls

tillrohrmann commented Jun 24, 2024 • edited Loading

tillrohrmann commented Jul 2, 2024

tillrohrmann commented Aug 28, 2024 • edited Loading

tillrohrmann commented Oct 4, 2024

tillrohrmann commented Nov 4, 2024

tillrohrmann commented Jun 20, 2024 •

edited

Loading

tillrohrmann commented Jun 24, 2024 •

edited

Loading

tillrohrmann commented Jun 24, 2024 •

edited

Loading

tillrohrmann commented Aug 28, 2024 •

edited

Loading