Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lost SubmitNotification when sending invocations while runtime is starting #1651

Closed
6 tasks done
Tracked by #1801
tillrohrmann opened this issue Jun 20, 2024 · 7 comments
Closed
6 tasks done
Tracked by #1801
Assignees
Labels

Comments

@tillrohrmann
Copy link
Contributor

tillrohrmann commented Jun 20, 2024

We have a race condition in the runtime which manifests in the form of seemingly lost SubmitNotifications if a client submits invocations while the runtime is starting. The SubmitNotifications are needed to complete a send invocation with the correct InvocationId. The observable effect is that the external client hangs forever.

The problem of why this occurs is the following: If the external client submits an invocation before the PP is running, it will get appended to Bifrost. Upon gaining leadership, the PP will treat the invocation as if were resuming it from a previous leader epoch. Since we are not sending SubmitNotifications when resuming invocations, the external client will never get unblocked.

As a solution, I propose to also send SubmitNotifications when resuming invocations because the PP currently cannot know whether such notification was sent or not before it became the leader. This will also cover the situation where a user submits a new invocation while there is a leadership change, which can always happen.

The downside of this approach is that we might send quite a few SubmitNotifications depending on the number of resumed invocations when obtaining leadership. Maybe this can be mitigated by the network layer by batching these requests.

Tasks

Preview Give feedback
@tillrohrmann tillrohrmann self-assigned this Jun 20, 2024
@tillrohrmann
Copy link
Contributor Author

The same problem applies to sending responses back to the ingress. If the current leader successfully proposes an InvokerEffectKind::End and then loses leadership, then the next leader will think that the response has been sent to the ingress. If this was not the case, then an external client might wait until eternity.

The underlying problem is that there is no way for a replica to learn about actions taken by the leader if these actions are not acknowledged by writing an ack to bifrost.

@tillrohrmann
Copy link
Contributor Author

tillrohrmann commented Jun 24, 2024

To sum up the underlying problem again: Restate PPs need to send messages to the ingress (responses and SubmitNotifications) so that the ingress can respond to external clients. This communication channel is unreliable. At the moment, PPs try to send the message once and if it fails, then it is ignored. Moreover, if a message occurs that triggers an ingress message (Command::Invoke can trigger SubmitNotification and Command::InvokerEffect(End) triggers InvocationResponse) w/o an active leader, then subsequent leaders will assume that the message has been sent.

The observable effect is that external clients won't receive a response and hang until their connection gets interrupted (e.g. via a timeout, failure of ingress, etc.). If the invocation wasn't done with an idempotency key set, then the user has little chance of retrieving the original response.

The current state is not acceptable since a leadership change can happen at any point in time which can leave external clients hanging. An internal implementation detail should ideally not surface in a way that forces users to do extra work to make things reliably work with Restate if everything else is working fine (e.g. external client being happily connected to the ingress w/o connection loss).

Some ideas for solving/mitigating this problem are:

Making the communication channel between ingress and PPs reliable

In order to avoid the problem entirely, we need to make the communication between the ingress and the PPs reliable. One idea to achieve this is to handle PP -> ingress similar to how we handle PP -> PP communication: If a PP wants to send a message to the ingress, then it appends it to the ingress-outbox (could also be the outbox) and a shuffle component tries to send it to the ingress. If the shuffle component has successfully sent the message, then it tells its replicas about it by writing a Command::TruncateIngressOutbox to its log. Upon applying this command, one can remove message from the ingress outbox.

The way the ingress shuffle sends messages to the ingress could either be by writing to a dedicated ingress log (one per ingress or multiple ingresses share the same log) or by using the network for sending messages. When using the network, we need the ingress to acknowledge the reception of the message for the ingress shuffle to generate the TruncateIngressOutbox message.

The downsides are that we introduce more messages that we need to write to Bifrost and potentially a small delay for sending messages since we first need to write it to the ingress outbox and then tell the ingress outbox shuffle about it before it gets sent to the ingress.

Let the ingress close client connections if a lost message is detected

If the ingress could detect that a message was lost for a respective client connection, then it could close the connection to signal the client that something went wrong. If Restate is allowed to do this, then we will effectively require that users always invoke services idempotently since it can always happen that Restate closes their connection even if there wasn't a problem between the client and the ingress (one might still argue that this is good practice anyway since the connection could also be closed because of a fault that happens between the external client and the ingress).

One way to detect that a message might be lost is to remember for every dispatcher request the lsn of the corresponding entry. Then if the ingress wants to check a request/connection, it could ask the corresponding leader whether it has already processed the respective lsn. If yes, then it could check whether an invocation with a given invocation id is still running or not. If the invocation is not running, then this either means that we have lost a response/submit notification or a response/submit notification is still in flight. So after some timeout, we might somewhat safely conclude that a message was lost and then close the connection. If the invocation is running, then it can retrieve the SubmitNotification message by looking up the invocation id.

A way to trigger checking a connection is to observe leadership changes of the PPs. If a leadership change occurs, then this could trigger the check since this is the necessary condition for causing a message loss.

Enforce mandatory timeouts for external calls

If external client connections always have an associated timeout, then clients wouldn't hang.

This, of course, begs the question what this timeout should be since it limits the maximum duration of a service invocation when doing a synchronous call. One answer to this question could be to discourage people from doing synchronous calls and instead submit invocations and then "poll" for their results.

Another problem arises when a non-idempotent invocation times out. Then, there is currently no reliable way to fetch the result. A solution to this problem could be to either require always idempotent invocations or to also store the result for non-idempotent invocations. In order to make this work, we would have to be able to create an invocation with a specified invocation id in case that we submitted the invocation on a previous attempt.

Another problem with this approach is that it will increase the P99 latency significantly (by the configured timeout).

@tillrohrmann
Copy link
Contributor Author

tillrohrmann commented Jun 24, 2024

Personally, I am leaning towards making the communication between the ingress and PP reliable since it feels a bit odd to me that Restate can "lose" some messages internally which requires users to act in a specific way (make all invocations idempotent and add retries in case of a connection termination, although this is what users should do to properly handle faults that might arise between the client and the ingress).

@tillrohrmann
Copy link
Contributor Author

Ahmed proposed a new variant: The ingress <=> partition processor communicate via RPCs. Instead of directly writing to Bifrost, the ingress will rpc the leading partition processor (routing based on idempotency key or invocation id) and tell it to self-propose an invocation. The partition processor will respond with the invocation id once the self-proposed message is being applied. In order to retrieve the result, the ingress will rpc any of the replicas to ask for the result. Each replica will keep the results of an invocation around for a bit of time (configurable) to fulfill these requests in case of a retried ingress request. Replicas can remove these completed results based on their own time as long as they don't use this information in their state machines.

The benefits of this approach are that we don't introduce additional writes to Bifrost. The "downside" is that we need to keep the result for completed invocations for some time for ingress invocations.

tillrohrmann added a commit to tillrohrmann/restate that referenced this issue Jul 3, 2024
This is a temporary band-aid to solve the problem that at start-up
we might not respond to some invocations if the partition processors
are slow at claiming leadership. See restatedev#1651
for more details.

The band-aid introduces a oneshot which is being completed by the cluster
controller when it has seen for every partition a leader. The ingress will
wait for the oneshot to be completed before starting it. Note, this band-aid
only works for single process setups where the Admin role is colocated with
the Worker role. This should cover the normal deployment right now.

This fixes restatedev#1684.
tillrohrmann added a commit to tillrohrmann/restate that referenced this issue Jul 4, 2024
This is a temporary band-aid to solve the problem that at start-up
we might not respond to some invocations if the partition processors
are slow at claiming leadership. See restatedev#1651
for more details.

The band-aid introduces a oneshot which is being completed by the cluster
controller when it has seen for every partition a leader. The ingress will
wait for the oneshot to be completed before starting it. Note, this band-aid
only works for single process setups where the Admin role is colocated with
the Worker role. This should cover the normal deployment right now.

This fixes restatedev#1684.
@tillrohrmann tillrohrmann removed their assignment Jul 29, 2024
@tillrohrmann tillrohrmann self-assigned this Aug 26, 2024
@tillrohrmann
Copy link
Contributor Author

tillrohrmann commented Aug 28, 2024

We decided to replace the existing ingress <> pp communication via RPCs:

  • Change the communication between ingresses and partition processors to be RPC-based
  • Ingresses ask leading partition processors for results (potentially “long polling” for the result of a long-running invocation)
  • Ingresses resend their requests after timeout (later they can also monitor for leadership changes or message loss to resend requests)
  • Partition processors can store invocation results to solve the problem of disconnected ingresses
  • Partition processors don’t have to ensure the delivery of messages to an ingress since the ingress would retry requesting it

Design document: https://docs.google.com/document/d/1_LcgBTsghkpjTlyqh8DNidZyFDLc8-DNw88dF741hPU

@tillrohrmann
Copy link
Contributor Author

One way to make the retry behavior event-driven and not timeout-based is to require that responses to requests need to be sent via the same physical connection on which the request was received. Given this assumption, we can monitor the liveness of the connection and retry the requests if the connection breaks. Moreover, the PartitionProcessor would need to fail all pending requests upon a leadership change. This is similar to how the RemoteSequencer operates #2003.

@tillrohrmann
Copy link
Contributor Author

Fixed via 3a293a4

tillrohrmann added a commit to tillrohrmann/restate that referenced this issue Nov 4, 2024
…ost SubmitNotifications

With restatedev#1651 being fixed, we no longer need to wait for all partitions to have a
running leader before we can start the ingress to avoid stucks invocations. That's
because the ingress will now retry until it finds an active leader for a given partition.

This fixes restatedev#2206.
tillrohrmann added a commit to tillrohrmann/restate that referenced this issue Nov 4, 2024
…ost SubmitNotifications

With restatedev#1651 being fixed, we no longer need to wait for all partitions to have a
running leader before we can start the ingress to avoid stucks invocations. That's
because the ingress will now retry until it finds an active leader for a given partition.

This fixes restatedev#2206.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants