Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC3401: Native Group VoIP Signalling #3401

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
05fd5af
MSC3401: Native Group VoIP Signalling
ara4n Sep 19, 2021
7f5ee49
comments & cosmetics
ara4n Sep 20, 2021
083fd9a
grammar
ara4n Sep 20, 2021
5ee96fb
incorporate review
ara4n Sep 22, 2021
b90b85e
more feedback
ara4n Sep 23, 2021
ed37a0d
add `purpose` from #3077
ara4n Sep 23, 2021
33a64f2
Update proposals/3401-group-voip.md
ara4n Sep 23, 2021
7fd1ba6
converge better with #3077 and WebRTC norms
ara4n Sep 25, 2021
669d471
tracks have to be identified by stream + track tuple
ara4n Sep 25, 2021
48526ad
spell out that you should ignore `m.call.member` state events from pa…
ara4n Oct 12, 2021
dfd4ffe
Add basic call sequence diagram
robertlong Mar 9, 2022
3c306cc
Remove SFU datachannel ping/pong timeout section
robertlong Mar 9, 2022
4d43aae
Update m.call.member and call setup sections
robertlong Mar 10, 2022
856ddc7
spell out the unstable prefix
ara4n May 28, 2022
d109b54
add tracks back into m.call.member for SFUs to use
ara4n May 30, 2022
07f9547
add session IDs & labels
ara4n Jun 3, 2022
7a06ed7
Let call member events expire (#3831)
robintown Jun 16, 2022
32f566a
Rip out SFU bits out of MSC3401 (#3897)
SimonBrandner Oct 21, 2022
3fde32b
Move expiration timestamps to be per-device (#3941)
robintown Nov 30, 2022
05b5db2
Specify who calls who (#3942)
robintown Nov 30, 2022
43dc42f
Clarify `expires_ts`
SimonBrandner Dec 3, 2022
5635cee
Add `seq`
SimonBrandner Dec 3, 2022
b8ebe27
Use heading for Legend
SimonBrandner Dec 3, 2022
6b98d66
Fix-up some formatting
SimonBrandner Dec 3, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Update m.call.member and call setup sections
  • Loading branch information
robertlong committed Mar 10, 2022
commit 4d43aae3b105b0bf5193c4670dcba91c093e5458
93 changes: 29 additions & 64 deletions proposals/3401-group-voip.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,11 @@ The fields within the item in the `m.calls` contents are:

* `m.call_id` - the ID of the conference the user is claiming to participate in. If this doesn't match an unterminated `m.call` event, it should be ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably ought to be m.conf_id to differentiate it from IDs of 1:1 calls and match the conf_id field in m.call.* to-device events?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: currently the call_id and conf_id are not identical. This seems to be confusing if we're talking about the SFU calls (not sure how it's handled in a full-mesh).

When working on an SFU recently, I realized that conf_id was the ID of a conference (or a call if you will) which was quite logical and expected. However, what I did not expect is that in addition to the conf_id, each To-Device message has a call_id which does not match the conf_id and which seems to be uniquely generated by each participant.

The thing is: call_id field does not make any sense for the SFU at the moment (see the SFU MSC), since the SFU does not know what the call_id is (it looks like a randomly generated string that is different for each participant who tries to join a conference), but at the same time, the SFU is essentially obligated to store the call_id because the To-Device messages from the SFU to the participants are expected to have the call_id that matches the call_id value sent from participants to the SFU when they contact the SFU (I tried settings the call_id to match conf_id when sending a message from the SFU to the client, but the client discarded the message if the call_id did not match the call_id that the client sent to the SFU). So essentially, there is a conf_id the semantics of which is defined (it's the unique ID of a conference/call) and the call_id (which does not have any meaning for the SFU).

* `m.foci` - Optionally, if the user wants to be contacted via an SFU rather than called directly (either 1:1 or full mesh), the user can also specify the SFUs their client(s) are connecting to.
* `m.sources` - Optionally, the user can list the various media streams (and tracks within the streams) they are able to send. This is important if connecting to an SFU, as it lets the SFU know what simulcast tracks the sender can send. In theory the offered SDP should include this, but if we are multiplexing all streams into the same SDP it seems likely that this will get lost, hence publishing it here. If the conference has no SFU, this list defines the devices which other devices should connect to full-mesh in order to participate.
* `m.devices` - The list of the member's active devices in the call. A member may join from one or more devices at a time, but they may not have two active sessions from the same device. Each device contains the following properties:
* `device_id` - The device id to use for to-device messages when establishing a call
* `session_id` - A unique identifier used for resolving duplicate sessions from a given device. When the `session_id` field changes from an incoming `m.call.member` event, any existing calls from this device in this call should be terminated. `session_id` should be generated once per client session on application load.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have trouble understanding what this is about. Is it perhaps about protecting against for example a user opening their same element web session in multiple tabs? In any case, it might be good to spell out more explicitly what exactly what this is supposed to guard against.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mostly to deal with users hitting the refresh button I believe, so we can ignore anything from previous instances of the app and terminate calls with the old instance when we see a new one.

* `feeds` - Contains an array of feeds the member is sharing and the opponent member may reference when setting up their WebRTC connection.
* `purpose` - Either `m.usermedia` or `m.screenshare` otherwise the feed should be ignored.

For instance:

Expand All @@ -123,88 +127,49 @@ For instance:
"@sfu-lon:matrix.org",
"@sfu-nyc:matrix.org",
],
"m.sources": [
"m.devices": [
{
"id": "qegwy64121wqw", // WebRTC MediaStream id
"purpose": "m.usermedia",
"name": "Webcam", // optional, just to help users understand what multiple streams from the same person mean.
"device_id": "ASDUHDGFYUW", // just in case people ending up dialing this directly for full mesh or 1:1
"audio": [
"device_id": "ASDUHDGFYUW", // Used to target to-device messages
"session_id": "GHKJFKLJLJ", // Used to resolve duplicate calls from a device
"feeds": [
{
"id": "zbhsbdhwe", // WebRTC MediaStreamTrack id
"settings": { // WebRTC MediaTrackSettings object
"channelCount": 2,
"sampleRate": 48000,
"m.maxbr": 32000, // Matrix-specific extension to advertise the max bitrate of this track
}
"purpose": "m.usermedia"
// TODO: Add tracks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What info do we need in here to describe tracks? Maybe just an array containing the track kind? That way we know whether or not to add am audio/video transceiver and properly negotiate the RTCPeerConnection when the user joins without an audio/video track?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was envisaging stealing the mediastream description straight of WebRTC. So the m.call would describe the constraints needed to participate in the call (as per #3401 (comment)), and the m.call.member's device's feeds would describe the reality of what it's broadcasting. And then toggles for mute state etc end up being #3291 style m.call.sdp_stream_metadata_changed events (or possibly updates to the m.call.member state event, at the risk of introducing unnecessary state churn).

// TODO: Available bitrates etc. should be listed here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should we describe the available audio/video streams that clients can request via the SFU datachannel?

},
],
"video": [
{
"id": "zbhsbdhzs",
"settings": {
"width": 1280,
"height": 720,
"facingMode": "user",
"frameRate": 30.0,
"m.maxbr": 512000,
}
},
{
"id": "zbhsbdhzx",
"settings": {
"width": 320,
"height": 240,
"facingMode": "user",
"frameRate": 15.0,
"m.maxbr": 64000,
}
},
],
"mosaic": {}, // for composited video streams?
},
{
"id": "suigv372y8378",
"name": "Screenshare", // optional
"purpose": "m.screenshare",
"device_id": "ASDUHDGFYUW",
"video": [
{
"id": "xbhsbdhzs",
"settings": {
"width": 3072,
"height": 1920,
"cursor": "moving",
"displaySurface": "monitor",
"frameRate": 30.0,
"m.maxbr": 768000,
}
},
"purpose": "m.screenshare"
// TODO: Add tracks
// TODO: Available bitrates etc. should be listed here
}
]
},
}
]
}
]
}
}
```

This builds on MSC #3077, which describes streams in `m.call.*` events via a `sdp_stream_metadata` field, but providing the full set of information needed for all devices in the room to know what streams are available in the group call without having to independently discover them from the SFU.
This builds on MSC #3077, which describes streams in `m.call.*` events via a `sdp_stream_metadata` field, but providing the full set of information needed for all devices in the room to know what feeds are available in the group call without having to independently discover them from the SFU.

It's acceptable to advertise rigid formats here rather than dynamically negotiating resolution, bitrate etc, as in a group call we should just pick plausible desirable formats rather than try to please everyone.
** TODO: Add tracks field **
** TODO: Add bitrate/format fields **

If a device loses connectivity, it is not particularly problematic that the membership data will be stale: all that will happen is that calls to the disconnected device will fail due to media or data-channel keepalive timeouts, and then subsequent attempts to call that device will fail. Therefore (unlike the earlier demos) we don't need to spot timeouts by constantly re-posting the state event.
Clients should do their best to ensure that calls in `m.call.member` state are removed when the member leaves the call. However, there will be cases where the device loses network connectivity, power, the application is forced closed, or it crashes. If the `m.call.member` state has stale device data the call setup will fail. Clients should re-attempt invites up to 3 times before giving up on calling a member.

### Call setup

Call setup then uses the normal `m.call.*` events, except they are sent over to-device messages to the relevant devices (encrypted via Olm). This means:
Copy link
Contributor

@bwindels bwindels Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the idea here is that clients should identify the sender (user_id and device_id) of the to_device through the sender_key of the olm-encrypted message to known which peer the message is from in a full-mesh group call? Perhaps it could be valuable to spell that out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the spec to include this now, we send the sender's device_id along in the m.call.invite event. But I think this is a better method? I haven't seen sender_key yet, could you point me to the docs on that? Also we're not using olm-encryption just yet so it might still be better to include the device_id field in the content? Not sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can read about sender_key here and here. The idea is that you query the keys for the given user (if not fetched already) and verify that the keys match the sender_key of the olm message you received. The advantage of doing it this way is that the user id and device id can almost not be spoofed, assuming you have marked the other device/user as trusted, either manually or through cross-signing.

Perhaps the current impl doesn't encrypt with olm yet, but does it make sense to spec that? Is there a good reason to offer a non-encrypted version of the signalling?


* When initiating a 1:1 call, the `m.call.invite` is sent to `*` devices of the intended target user.
* Once the user answers the call from the device, the sender should rescind the other pending to-device messages, ensuring that other devices don't get spammed about long-obsolete 1:1 calls. XXX: We will need a way to rescind pending to-device msgs.
* Subsequent candidates and other events are sent only to the device who answered.
* XXX: do we still need MSC2746's `party_id` and `m.call.select_answer`?
* We will need to include the `m.call_id` and room_id so that peers can map the call to the right room.
* However, especially for 1:1 calls, we might want to let the to-device messages flow and cause the client to ring even before the `m.call` event propagates, to minimise latency. Therefore we'll need to include an `m.intent` on the `m.call.invite` too.
* When initiating a 1:1 call, the `m.call.invite` is sent to the devices listed in `m.call.member` event's `m.devices` array using the `device_id` field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are using m.ring as the intent we don't have the recipient's device id. So how do we ring them? Do we send the m.call.invite as a regular message event?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was assuming that m.ring will ring every device in the room

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. So send to-device messages to each of the users and use * for to target all the user's devices. I'll add a clarification for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the ringing happen by the mere fact that there is a non-terminated m.call state event with that intent? The to_device messages only start flowing once there are at least two m.call.members in the call, e.g. you've picked up already with a specific device.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A "1:1 call" does not seem to be well defined within the context of this MSC. I assume what is meant here is a m.ring call (that could have 2 or more participants). Can we please use less ambiguous terminology?

* `m.call.*` events sent via to-device messages should also include the following properties in their content:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we seem to have completely missed seq, needed to make up for todevice events having no intrinsic ordering otherwise: matrix-org/matrix-js-sdk@7f21f56

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a dumb question here: does it mean that the order of the To-Device messages is not guaranteed? I'm asking since I've processed the To-Device messages on the SFU under the assumption that they come at the same order in which they were sent by the client. If the order is not guaranteed, it may cause some interesting (undesired) effects, e.g. if the "Invite -> Hangup" sequence comes as "Hangup -> Invite" on a server, the end effect will not be what the user expects. Or Invite -> SelectAnswer as SelectAnswer -> Invite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's not guaranteed and we should rely on the seq field

Copy link

@daniel-abramov daniel-abramov Dec 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's not guaranteed and we should rely on the seq field

Oh, that's interesting. Should we document how to deal with it and what's the semantics? - I've noticed that a new commit has been pushed recently to add the seq to the request that says that it starts with 0 and gets incremented with each message. But where the counter is stored? I can imagine that it's a per-device counter (meaning that 2 different devices may generate the same seq)? And what happens when the overflow of the value occurs?

It also has some practical implications: how are we (as receivers) expected to handle it in a proper way? - I.e. imagine that we receive a "New ICE candidates message" on the SFU with a seq=15 while the previous message from the sender had seq=5. We probably don't want to handle the message with seq=15 right now if we have not yet received the previous 10 messages (since the current message with seq=15 may not even be useful by the moment we process the previous 10, or maybe it's related to the invite that has been sent in seq=10). This means that in order to handle a message with seq=15, we would need to buffer a couple more messages (messages that went before seq=15) before we take a decision on whether to handle it.

However, this poses certain questions, namely if we're communicating with 1000 devices (SFU use case), this means we would need to store the lastStoredSeq for 1000 users, the problem is that we don't really know when to release the counter for a particular user from memory (i.e. when to remove it from the map since we don't know in advance the pool of devices that would communicate with us and we may theoretically receive a message from any of them) which means that the memory usage would grow indefinitely and once the SFU is restarted, we'll have the counters lost.

Another issue is that the sender can attack the receiver by sending a message with seq=1 followed by a message with seq=99999999999 and then another with seq=99999999998 and another with seq=99999999997 and, knowing that the other side buffers them since it can't process them, it will send them until the other side gets killed due to the OOM.

* `conf_id` - The group call id listed in `m.call`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use a name other than conf_id? I think this is left over from the m.conf event which was renamed to m.call. So maybe group_call_id is better?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, much prefer group_call_id

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if call_id wouldn't be enough since we might want to transition to this MSC for all calls and thus call_id seems sufficient

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the to_device events sent between members of the group calls also have a call_id property, inherited from pre-3401 calls. We should probably address the existence of the other property and either plan a phase out or just use something less ambiguous like group_call_id.

It would also be good to use the same property name in both the m.call.member state event (now m.call_id) and to_device events (now conf_id).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo on top of the raised mentioned improvements, this proposal and eventual resulting spec should stick to the wording "call" even in text to clarify that "conference" is not another concept at work here.

* `dest_session_id` - The recipient's session id. Incoming messages with a `dest_session_id` that doesn't match your current session id should be discarded.
* In addition to the fields above `m.call.invite` events sent via to-device messages should include the following properties :
* `device_id` - The message sender's device id. Used by the opponent member to send response to-device signalling messages even if the `m.call.member` event has not been received yet.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EC also sends device_id, sender_session_id and dest_session_id for every toDevice event: https://github.com/matrix-org/matrix-js-sdk/blob/353d6bab47ab928aab089e897f5475942fcfa0ac/src/webrtc/call.ts#L2008-L2010

* `sender_session_id` - Like the `device_id` the `sender_session_id` is used by the opponent member to filter out messages unrelated to the sender's session even if the `m.call.member` event has not been received yet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need a better way of explaining both the device_id and sender_session_id here.

* For 1:1 calls, we might want to let the to-device messages flow and cause the client to ring even before the `m.call` event propagates, to minimise latency. Therefore we'll need to include an `m.intent` on the `m.call.invite` too.
* When initiating a group call, we need to decide which devices to actually talk to.
* If the client has no SFU configured, we try to use the `m.foci` in the `m.call` event.
* If there are multiple `m.foci`, we select the closest one based on latency, e.g. by trying to connect to all of them simultaneously and discarding all but the first call to answer.
Expand Down