Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

429 too many requests or unknown state #8295

Closed
mvgorcum opened this issue Sep 11, 2020 · 8 comments
Closed

429 too many requests or unknown state #8295

mvgorcum opened this issue Sep 11, 2020 · 8 comments
Assignees

Comments

@mvgorcum
Copy link
Contributor

since upgrading to rc we noticed that federation is sometimes very slow, and noticed a number of responses with 429 headers

@clokep
Copy link
Member

clokep commented Sep 11, 2020

It was reported in #synapse-dev:matrix.org that this was seen on both v1.20.0rc2 and rc3.

Some logs:

2020-09-11 14:44:26,856 - synapse.http.matrixfederationclient - 505 - INFO - federation_transaction_transmission_loop-3480- {PUT-O-7868} [asra.gr] Got response headers: 429 Unknown Status
2020-09-11 14:55:42,606 - synapse.handlers.federation - 521 - WARNING - PUT-3223-$a86WkgB1U_aKftDwZBRgbsCTpFbWMHou1of_pgGekbI - [!TdAwENXmXuMrCrFEFX:maunium.net $a86WkgB1U_aKftDwZBRgbsCTpFbWMHou1of_pgGekbI]: Failed to get prev_events: Failed to send request: HttpResponseException: 429: Unknown Status

@Cadair
Copy link
Contributor

Cadair commented Sep 11, 2020

Here are some logs for me trying to send to @tulir

synapse.federation.transport.server - 172 - INFO - PUT-50227 - Marking origin 'maunium.net' as up                                                                                     
synapse.federation.transport.server - 406 - INFO - PUT-50227 - Received txn 1599827380107 from maunium.net. (PDUs: 3, EDUs: 2)                                                        
synapse.crypto.keyring - 307 - INFO - PUT-50227-$BhgsCOeI93UHjE0AofU2uI2-xKm1UnqEnsmlbQNHknc - Waiting for existing lookups for ['maunium.net'] to complete [loop 1]                  
synapse.handlers.federation - 270 - INFO - PUT-50227-$HccmlHfM-RLYKv1YF1hIO5YObIQt6u9JWHr3nkSd7Cg - [!TdAwENXmXuMrCrFEFX:maunium.net $HccmlHfM-RLYKv1YF1hIO5YObIQt6u9JWHr3nkSd7Cg] Acquiring room lock to fetch 3 missing prev_events: ['$qMpligsJMKXt8r4og7RJnOyRmUkCDcbImOxckAWWb5s', '$O0lvFkxdtRtosHN6WV3puNlcnLPT7lWvQ0NeRBy2B2E', '$2EasUj82R5-3BmPO62bkKMnXnDtFEIy68NcewszAaN8']                                                              
synapse.http.matrixfederationclient - 505 - INFO - GET-50214 - {GET-O-40343} [maunium.net] Got response headers: 429 Too Many Requests                                                
synapse.http.matrixfederationclient - 546 - INFO - GET-50214 - {GET-O-40343} [maunium.net] Request failed: GET matrix://maunium.net/_matrix/federation/v1/groups/%2Bsponsors%3Amaunium.net/profile?requester_user_id=%40cadair%3Acadair.com: HttpResponseException('429: Too Many Requests')
synapse.http.matrixfederationclient - 505 - INFO - GET-50214 - {GET-O-40343} [maunium.net] Got response headers: 429 Too Many Requests                                                
synapse.http.matrixfederationclient - 546 - INFO - GET-50214 - {GET-O-40343} [maunium.net] Request failed: GET matrix://maunium.net/_matrix/federation/v1/groups/%2Bsponsors%3Amaunium.net/profile?requester_user_id=%40cadair%3Acadair.com: HttpResponseException('429: Too Many Requests')
synapse.http.server - 76 - INFO - GET-50214 - <XForwardedForRequest at 0x7fed7b3305d0 method='GET' uri='/_matrix/client/r0/groups/%2Bsponsors%3Amaunium.net/profile' clientproto='HTTP/1.0' site=8008> SynapseError: 502 - Failed to contact group server
synapse.access.http.8008 - 311 - INFO - GET-50214 - 10.21.5.180 - 8008 - {@cadair:cadair.com} Processed request: 4.616sec/-0.000sec (0.002sec, 0.000sec) (0.000sec/0.000sec/0) 64B 502 "GET /_matrix/client/r0/groups/%2Bsponsors%3Amaunium.net/profile HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Element-Nightly/2020090901 Chrome/83.0.4103.122 Electron/9.1.2 Safari/537.36" [0 dbevts]                   
synapse.http.matrixfederationclient - 505 - INFO - federation_transaction_transmission_loop-58128 - {PUT-O-39886} [maunium.net] Got response headers: 429 Too Many Requests           
synapse.http.matrixfederationclient - 546 - INFO - federation_transaction_transmission_loop-58128 - {PUT-O-39886} [maunium.net] Request failed: PUT matrix://maunium.net/_matrix/federation/v1/send/1599832034876: HttpResponseException('429: Too Many Requests')
synapse.http.server - 76 - INFO - GET-50286 - <XForwardedForRequest at 0x7fedb7ef9390 method='GET' uri='/_matrix/federation/v1/event_auth/%21TdAwENXmXuMrCrFEFX%3amaunium.net/%24YarUycGczWPvjrJxHmFA6kMx_N-64FawzkOoiQ605Hw' clientproto='HTTP/1.0' site=8048> SynapseError: 404 - Could not find event $YarUycGczWPvjrJxHmFA6kMx_N-64FawzkOoiQ605Hw

@mvgorcum
Copy link
Contributor Author

I did notice a small amount of Got response headers: 429 Unknown Status from last Wednesday, which is when I was running rc1, but those are from only 2 servers (harha.net and sibnsk.net)

After upgrading to rc3 I see loads of Got response headers: 429 Unknown Status and Got response headers: 429 Too Many Requests from a lot more different servers.

@erikjohnston
Copy link
Member

Are people still seeing this out of interest? I think I saw someone mention that the problem had gone away for them

@tulir
Copy link
Member

tulir commented Sep 17, 2020

It did seem to go away briefly, but now my logs have tons of 429's again. In the past ~14 hours, I have >500 429's from asra.gr, colab.de, riot.firechicken.net and utwente.io. Also between 10 and 100 from a bunch of other servers

@erikjohnston
Copy link
Member

Cool, can you send me your full logs for today then please?

@erikjohnston
Copy link
Member

So looking at the logs it appears that the following happens:

  1. Receive a transaction from a server that requires fetching prev events.
  2. Acquire the room lock to fetch those prev events.
  3. Hit remote host to get them.
  4. Release room lock, process events and return.

However, if the remote host is performing slowing step 3 can take a long time to return, causing the room lock to be taken out for extended periods of time. At some point the server sending the transaction times out the requests and resends it. Since it has the same transaction ID it gets queued up in the transaction linearizer (not a ResponseCache). This has a couple of consequences: a) while queued the federation ratelimiter still considers it an active request and b) we don't remove requests from the linearizer if the remote has gone away.

The net result here is that when fetching the prev events from the remote host takes too long we end up stacking up transaction requests from the remote, which fills up the federation ratelimiter (it only allows 3 active requests at a time), thus causing the server to reject requests with 429.

I'm not at all sure why this is an issue in v1.20.0 RCs, I don't think anything has changed in this area. It's possible there is some performance regression somewhere that is causing this failure mode to be hit more often I guess.

Things that we can do to "fix" this:

  1. Linearize transactions based on origin rather than origin and transaction ID.
  2. Use a ResponseCache to deduplicate based on origin and transaction ID
  3. Remove requests from the linearizer and cache if the request has been cancelled.
  4. Do not include requests waiting in queues in the federation rate limiter. (This probably means moving the federation rate limiter down several layers for the transaction requests)

@clokep
Copy link
Member

clokep commented Sep 18, 2020

I believe this ca be closed. @erikjohnston please re-open if this is incorrect!

@clokep clokep closed this as completed Sep 18, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants