occasional QUIC connection failures #7526
Labels
kind/bug
A bug in existing code (including security flaws)
need/triage
Needs initial labeling and prioritization
TLDR: I recently discovered a bug in quic-go, that leads to a stall of QUIC connections if a certain packet shortly after completion of the QUIC handshake is declared lost. Analysis of logs from one of our DHT nodes reveals that this affects around 0.37% of connections.
Connection IDs in QUIC
QUIC uses Connection IDs to match packets with connections. Each endpoint announces the connection ID its peer uses on outgoing packets. This means that a client will use connection ID X when sending a packet to the server, while the server will use connection ID Y to send a packet back to the client. Each connection can be associated with multiple connection IDs. Using a fresh connection ID is needed to guarantee unlinkability during connection migration. Endpoints are also free to occasionally use new connection IDs when not migrating.
Endpoints announce new connection IDs to their peer (using NEW_CONNECTION_ID frames). The peer can then switch to using a new connection ID, and retire the old connection id (using a RETIRE_CONNECTION_ID frame).
Analysis of the Bug
The bug occurs when an endpoint receives a duplicate NEW_CONNECTION_ID frame annoucing the connection ID it is currently already using (this can happen if the packet is spuriously declared lost by the peer). Due to an off-by-one bug, it will then send a RETIRE_CONNECTION_ID frame for that connection ID (which makes no sense at all), but still continue using this connection ID. When the peer receives the RETIRE_CONNECTION_ID frame, it will drop the association between that connection ID and the connection after 5 seconds (this is intended to allow for reordered packets to arrive). Once the connection ID is dropped, the connection is basically dead: it times out (on the on side), and results in a stateless reset for the other side.
Measurement of the Impact
I analyzed roughly 1.55 million QUIC connections handled by our DHT booster nodes. 5678 of them (0.37%) experienced the bug described above occurred. All but 34 of those connections either timed out or were closed with a stateless reset (these 34 connection were closed regularly within the 5 second window and therefore didn't suffer any consequences).
Proposed Solution
There are two fixes that need to be made in quic-go:
At this point, I propose to do the following:
Add a flag to control the following:
With this flag switched on, nodes that install the update will not issue any new connection IDs to their peers, thereby preventing the bug from occurring in nodes that haven't updated yet.
Once a large enough fraction of the network has upgraded, we can toggle the flag, and go back to using multiple connection IDs per connection.
The text was updated successfully, but these errors were encountered: