Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Large rooms are stuck in partial-state because of response body size limit on /state federation API #15127

Open
reivilibre opened this issue Feb 21, 2023 · 3 comments
Labels
A-Federated-Join joins over federation generally suck A-Federation O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Critical Blocks development, potential data loss, more than 25% of users possibly affected, no workarounds. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@reivilibre
Copy link
Contributor

reivilibre commented Feb 21, 2023

Whilst resyncing a room from libera.chat to Matrix.org (Libera is resident, Morg is partial-state), the room state is too big (the response > 500 MiB) and it's hitting a response body size limit:

federation_inbound4.log:2023-02-21 16:02:34,329 - synapse.http.matrixfederationclient - 240 - WARNING - _process_incoming_pdus_in_room_inner-41-$9smLiG7QFK54RlCkbKxtv3fP51S9Fpof7gx34bhvVdI---$inr3_5AH_4CTwJyBza-ZrWP4Qo3i_QvMEilMW_E9pBU---$_M55QrVpzz9THIgUgngswj99-Z6JPa5XGBQ-xr7q1h4--- - {GET-O-22420} [libera.chat] JSON response exceeded max size 524288000 - GET matrix://libera.chat/_matrix/federation/v1/state/%21fuzEIyMtiFYSFEjRCI%3Alibera.chat?event_id=%24_M55QrVpzz9THIgUgngswj99-Z6JPa5XGBQ-xr7q1h4
    # As with /send_join, /state responses can be huge.
    MAX_RESPONSE_SIZE = 500 * 1024 * 1024

This room is now stuck and I don't think anything, short of upping this limit and waiting until we manage to hit that, will help. We need a sustainable solution.

Synapse version 1.78.0rc1 (+ morg hotfixes)

@reivilibre reivilibre added A-Federation A-Federated-Join joins over federation generally suck S-Critical Blocks development, potential data loss, more than 25% of users possibly affected, no workarounds. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. O-Uncommon Most users are unlikely to come across this or unexpected workflow labels Feb 21, 2023
@reivilibre
Copy link
Contributor Author

I've upped the limit to 600 MiB on matrix.org but this is not sustainable. In Libera's logs, I notice that there are other servers requesting state and being fed >500MiB responses, so those people will be out of luck for now and Libera will be under repetitive load until we fix the problem.

@ara4n
Copy link
Member

ara4n commented Feb 22, 2023

huge thanks for digging into this. two thoughts:

  • can we paginate /state (chuck a next_batch on the response or similar?)
  • or can we decide not to serve up parted members in the state block (and rely on those events being pulled in by other servers incrementally, only if they go backpaginating through the DAG)?

@reivilibre
Copy link
Contributor Author

can we paginate /state (chuck a next_batch on the response or similar?)

I would like to do this. In fact, I'd like to rework the endpoint so the result can be entirely streamed (rather than buffering up 500 MB into memory!) — I know that's kind of unusual for Matrix endpoints, but it doesn't seem like a bad habit to get into. The only issue is that, for either case, the result needs to be ordered properly so Synapse can process the events as they come in, rather than buffering them all up (which is the reason for the 500 MB limit in the first place).

Either that, or I suppose we could spill the events to disk and sort them out of memory, but it seems like most of the work for a very suboptimal solution.

I'm intending to have a look at how the in-order streaming could be done, but I'm not expecting it to be trivial :/.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Federated-Join joins over federation generally suck A-Federation O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Critical Blocks development, potential data loss, more than 25% of users possibly affected, no workarounds. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

2 participants