Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add guaranteed eventual consistency #3561

Closed
4 tasks done
Dakkaron opened this issue Jul 9, 2023 · 4 comments
Closed
4 tasks done

Add guaranteed eventual consistency #3561

Dakkaron opened this issue Jul 9, 2023 · 4 comments
Labels
area: federation support federation via activitypub enhancement New feature or request

Comments

@Dakkaron
Copy link

Dakkaron commented Jul 9, 2023

Requirements

  • Is this a feature request? For questions or discussions use https://lemmy.ml/c/lemmy_support
  • Did you check to see if this issue already exists?
  • Is this only a feature request? Do not put multiple feature requests in one issue.
  • Is this a backend issue? Use the lemmy-ui repo for UI / frontend issues.

Is your proposal related to a problem?

To avoid desyncs when user numbers keep climbing, guaranteed eventual consistency should be added.

Describe the solution you'd like.

  • Outgoing messages should be stored in the database so that they don't get lost when the server is terminated before sending all messages
  • Retries should not be given up on
  • To reduce the performance impact of unreachable instances, retries shouldn't be handled on a per-message basis, but instead on a per-remote-instance basis

An option to handle that would be to permanently store each outgoing message in a table, with an incrementing ID.

For each remote instance, a record is kept which message was the last successfully sent one.

For each instance that still has unreceived messages, periodically check whether the instance is reachable. If it is reachable, send messages starting from the oldest.

Describe alternatives you've considered.

I don't know the code well enough to give a serious recommendation/analysis. The one above was also more of an informed guess than anything else.

Additional context

No response

@Dakkaron Dakkaron added the enhancement New feature or request label Jul 9, 2023
@WayneSheppard
Copy link

I'm thinking something similar. Consistency is very important. What if an important pinned post doesn't get federated? Or your response to my post? What happens if a moderator action to remove an illegal post doesn't get federated?

What happens if Lemmy grows to 1% the size of Reddit, with some communities with 500,000 subscribers? There could be thousands of federated actions each second from one community. The solution needs to be robust enough to handle this.

Outgoing messages should be stored in the database - Let's call this FederatedActionsQueue.

For each remote instance, a record is kept which message was the last successfully sent one. - Let's call this the FederatedServersQueue.

Every federation cycle, the instance runs through FederatedServersQueue and distributes actions to every subscribed server. The cycle length might start at one second, but could scale back during times of congestion.

If a server fails to respond, there is only one worker waiting for the timeout instead of thousands. We could mark that server as unresponsive in FederatedServersQueue and do an exponential backoff, until it starts responding again.

On the FederatedActionsQueue table, this is necessary with the current data model. However, if each action was stored in the main tables with a timestamp, we might be able to use the timestamp instead of the incrementing ID. I haven't though enough about this.

@Dakkaron
Copy link
Author

Dakkaron commented Jul 9, 2023

Yeah, timestamps could work if the resolution is high enough that there are never two events with the exact same timestamp.

Retrying per server and not per event should scale much better.

@lionirdeadman lionirdeadman added the area: federation support federation via activitypub label Jul 15, 2023
@WayneSheppard
Copy link

It appears there is a pull request working on the issues brought up here.

#3605

@phiresky
Copy link
Collaborator

phiresky commented Aug 22, 2023

Yes, #3605 will (if bugless) make federation guaranteed reliable, up until the point where an instance is down longer than the clear-scheduled-task clears activities (currently 3 months). If it goes up again at some point, federation activity is fully replayed from the point where it went down.

@Nutomic Nutomic closed this as completed Sep 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: federation support federation via activitypub enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants