Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Clarify (room_id, event_id) global uniqueness #13701

Merged
merged 5 commits into from
Sep 2, 2022
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog.d/13701.doc
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Clarify `(room_id, event_id)` global uniqueness and how we should scope our database schemas.
24 changes: 24 additions & 0 deletions docs/development/database_schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -191,3 +191,27 @@ There are three separate aspects to this:
flavour will be accepted by SQLite 3.22, but will give a column whose
default value is the **string** `"FALSE"` - which, when cast back to a boolean
in Python, evaluates to `True`.


## `event_id` uniqueness
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved

In room versions `1` and `2` it's possible to end up with two events with the
same `event_id` (in the same or different rooms). After room version `3`, that
can only happen with a hash collision, which we basically hope will never
happen.

There are several places in Synapse and even Matrix APIs like [`GET
/_matrix/federation/v1/event/{eventId}`](https://spec.matrix.org/v1.1/server-server-api/#get_matrixfederationv1eventeventid)
where we assume that event IDs are globally unique.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems rather sad considering they are very much not and it's not very hard to cause a conflict :/

Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can hope for MSC2848 🙌


But hash collisions are still possible, and by treating event IDs as room
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think hash collisions from sheer probability are not much of a justifiable problem (maybe SHA256 will get defeated one day I suppose...?)
SHA256 is 256-bit, so only once you have 2^128 events would you have 0.5 probability of having a collision. That's way more events than I think anyone will ever store.

I expect the main problem is probably intentional collisions (esp in v1 rooms), where namespacing events by room means that we don't let a bad actor interfere with any rooms they're not in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe SHA256 will get defeated one day I suppose...?

🤷 Probably, maybe

The rest of my reply is just linking stuff from my own curiosity:

SHA-1 attack, https://github.blog/2017-03-20-sha-1-collision-detection-on-github-com/

Other reading:

I expect the main problem is probably intentional collisions (esp in v1 rooms), where namespacing events by room means that we don't let a bad actor interfere with any rooms they're not in.

https://spec.matrix.org/v1.1/rooms/v3/#event-ids

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about where we've ended up on this thread: are hash collisions (feasibly) possible or not?

Room v1 and v2 have bigger problems than event-id clashes between rooms. The solution to that is to stop using v1 and v2 rooms, not to arrange the entire database schema and matrix API around a half-assed fix to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are hash collisions (feasibly) possible or not?

Probably not feasible but I wouldn't rule it out one day.

[...] not to arrange the entire database schema and matrix API around a half-assed fix to it.

I'm confused by this. Do we prefer (room_id, event_id) or not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we prefer (room_id, event_id) or not?

That's the entire discussion here, and I don't think we have a clear conclusion. Personally, I don't really see the point in including room_id in the constraint, but mostly I'd rather we have a discussion on it than just merge a PR which takes one particular view, and justifies it using questionable arguments.

  • this PR said: "we should prefer (room_id, event_id) because of hash collisions"
  • @reivilibre's review appeared to say "actually, hash collisions are infeasible"
  • And yet the PR is still merged, saying that the reason to prefer (room_id, event_id) is to avoid hash collisions.

Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was discussed in the backend chapter sync as well which also brought up #13771.


  • @reivilibre's review appeared to say "actually, hash collisions are infeasible"
  • And yet the PR is still merged, saying that the reason to prefer (room_id, event_id) is to avoid hash collisions.

This PR captures the tribal knowledge you mentioned in,

Hash collisions are possible, and by treating event IDs as room scoped, we could reduce the possibility of a hash collision.

-- @richvdh, #13589 (comment)

@reivilibre's number investigation is a good enough to disprove the sheer chance that a client and server run into a collision. I'm less convinced there won't be a way to exploit things in the future (targeted attack) but we can update this part of the doc to not call it out as much.

Copy link
Member

@richvdh richvdh Sep 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, we probably need to discuss this further when I'm back from leave. #12892 moves in exactly the opposite direction to that suggested here.

scoped, we can reduce the possibility of a hash collision. When scoping
`event_id` in the database schema, it should be also accompanied by `room_id`
(`PRIMARY KEY (room_id, event_id)`) and lookups should be done through the pair
`(room_id, event_id)`.

There has been a lot of debate on this in places like
https://github.com/matrix-org/matrix-spec-proposals/issues/2779 and
[MSC2848](https://github.com/matrix-org/matrix-spec-proposals/pull/2848) which
has no resolution yet (as of 2022-09-01).