Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing timelines in the state store #138

Closed
poljar opened this issue Feb 10, 2021 · 12 comments
Closed

Storing timelines in the state store #138

poljar opened this issue Feb 10, 2021 · 12 comments

Comments

@poljar
Copy link
Contributor

poljar commented Feb 10, 2021

The state store used to be able to store some limited amount of events from the timeline, while the features seemed to be a bit flaky and due to the snapshot based state storage quite limited it was generally useful.

We should reintroduce this feature in the new state store.

Since the new state store keeps stuff out of memory we can without a bad conscience store the whole timeline in it. The state store trait should be expanded to allow this.

For some basic functionality, we'll need:

  1. The ability to store events.
  2. The ability to get an event by its event ID.
  3. The ability to get previous/next events after giving an event id.

How gaps in the timeline should be handled needs to be figured out.

Each sync response will contain a slice of a timeline, the timeline might have been a continuation of the previous sync response, in which case the limited parameter will be set to false. If the timeline doesn't contain all events that happened between two syncs the limited parameter will be set to true and a gap exists in the timeline. More info can be found over here.

We'll likely will need to remember which event belongs to which slice and figure out how to merge slices when we request historic events using the /rooms/$room_id/messages endpoint.

Our room_messages() and similar methods that fetch events from the server should use events from the store if they are available and fill in from the server if not.

@jsparber
Copy link
Contributor

How should we store the timeline? We will need to preserve the ordering we receive from the server, so we can't use a Sled::Tree (at least not directly).

@poljar
Copy link
Contributor Author

poljar commented Feb 11, 2021

An additional mapping/tree from an index (u128) to the event id would work, no? Perhaps @DevinR528 has opinions about this. Though that's an implementation detail of the store, e.g. a SQL based store would just use a order by in their query.

@DevinR528
Copy link
Contributor

So my thinking was to store the timeline slices in order (and since the events are ordered in the slice, depending on Direction::Forward/Backwards) if we know the prev_batch token of a slice of events we know where they go. This has some difficulties right off the bat because of sled:

  • In general, we start in the "middle" of a timeline but sled can't handle negative numbers (in terms of sorting they are the binary rep so they are the same as big numbers...) to fix this I wondered about doing u128::MAX / 2 for the initial sync of a room (we would need to ensure that the for this room is empty somehow.
    • I wonder about SyncMap<RoomId, Tree>?
  • All events in from /sync or from /messages have keys like prev_batch + index for /sync and end / start + index for /messages
    • We always use the token of the oldest event so for dir=b its the end and for dir=f it would be start
  • We also keep a Tree of index -> token (prev_batch, start, end), this sorts the chunks and allows us to iter all events.
    • There would also be prev_batch + index/count->EventId`
    • And EventId -> Event
pub struct SledStore {
    // ...

    /// Order the chunks by their `prev_batch`.
    ///
    /// We now have slices of the timeline ordered by position in the timeline. Since the events
    /// come in a predictable order if we find the prevbatchid location we know the location.
    id_prevbatch: Tree,
    /// Since the `prev_batch` token can get long and we want an orderable key use a `u64`.
    /// This is also known as prev_batch ID.
    prevbatch_id: Tree,
    /// This `Tree` consists of keys of `prev_batch ID + count` and values of `EventId` bytes.
    ///
    /// `count` is the index of the MessageEvent in the original chunk or sync response.
    prevbatchid_eventid: Tree,
    eventid_prevbatchid: Tree,
    /// This `Tree` is `RoomId + EventId -> Event bytes`
    timeline: Tree,
}

This would make it so we can fill in /messages responses, by from to tokens. Answering the next/previous from an EventId can be done also from the prevbatchid_eventid and eventid_prevbatchid Trees.

@DevinR528
Copy link
Contributor

We would still have to guard against duplicate events and overlapping and underlapping chunks which I'm not entirely sure how to handle

@poljar
Copy link
Contributor Author

poljar commented Feb 12, 2021

So my thinking was to store the timeline slices in order (and since the events are ordered in the slice, depending on Direction::Forward/Backwards) if we know the prev_batch token of a slice of events we know where they go. This has some difficulties right off the bat because of sled:

In general, we start in the "middle" of a timeline but sled can't handle negative numbers (in terms of sorting they are the binary rep so they are the same as big numbers...) to fix this I wondered about doing u128::MAX / 2 for the initial sync of a room (we would need to ensure that the for this room is empty somehow.

An u128 was suggested for that reason, I think nheko does the int::MAX /2 trick as well.

I wonder about SyncMap<RoomId, Tree>?

All events in from /sync or from /messages have keys like prev_batch + index for /sync and end / start + index for /messages

We always use the token of the oldest event so for dir=b its the end and for dir=f it would be start

We also keep a Tree of index -> token (prev_batch, start, end), this sorts the chunks and allows us to iter all events.

There would also be prev_batch + index/count->EventIdAndEventId` -> Event

This is looking a bit too much into what sled should be doing, let's first agree on a more abstract level what we need, implementation details for specific store implementations can come later. Or in other words let's add a bit of structure to those strings and numbers.

Looking at the sync spec, when we sync we get a slice of the timeline that exists between two tokens.

struct TimelineSlice {
    start: String,
    end: String,
    events: Vec<EventId>,
}

The same thing applies when we call the /messages API endpoint. Now when we call a /sync or /messages we'll get another TimelineSlice where one of the tokens will match either start or end of our first one, thus we can merge the two slices and get a new continuous one. Of course this requires a non-gappy sync, if there's a gap we'll have two slices which we might later on merge using the /messages endpoint.

Storing the slice like this isn't ideal, due to the nature of events vector growing indefinitely if we merge the slices. This also makes it necessary for us to be able to reach into the slice to return a sub-slice. For example if the caller requests 10 events but our slice contains 100. Using the event_id as an start/end token makes sense in this case. So to come back what should be stores:

struct SliceIndex(u128);
struct EventIndex(u128);

struct TimelineSlice {
    slice_id: SliceIndex,
    start: String,
    end: String,
}

struct EventOwnerMap {
     slice_map: BTreeMap<SliceId, BTreeMap<EventIndex, EventId>>,
     event_map: BTreeMap<EventId, SliceId>,
}

struct Timeline {
    slices: BTreeMap<SliceId, TimeLineSlice>
    // This should probably be the prev_batch token so we
    // don't need two maps and in the general case where
    // we walk backwards in the timeline, we won't need to 
    // fetch the neighboring slice
    token_slice_map: BTreeMap<String, SliceId>,
    events: EventOwnerMap,
}

So now we can get the slices in order, and events belonging to a slice in order. When we put a new slice from a sync into the timeline we get the last slice, check if the tokens match, if they do populate the existing slice with events and update the end token.

When we on the other hand receive events from the /messages endpoint, we'll want to access the slice using a token and merge the slice that way.

Another thing we might need/want to store is the token -> slice -> event index position. E.g. if someone stores a prev_batch token and we merged slices, they'll want to fetch events from the middle of the slice and we'll have no idea where the slice points to since we only store the start/end tokens.

Hopefully this makes any sense, and somewhat also points to how sled should store stuff. Of course if this has some flaw I'm missing please let me know.

@DevinR528
Copy link
Contributor

DevinR528 commented Feb 12, 2021

These would be stucts that are the Timeline equivalent to StateChanges right? So they would not stick around for longer than a call to /sync or /messages? Or do they need to be recreated by the StateStore on startup (or at some point when they are needed)?

For example if the caller requests 10 events but our slice contains 100. Using the event_id as an start/end token makes sense in this case.

struct EventOwnerMap {
     slice_map: BTreeMap<SliceId, BTreeMap<EventIndex, EventId>>,
     event_map: BTreeMap<EventId, SliceId>,
    // Now we can go from an eventId to a sub-slice of it's parent SliceId, I don't think we could before?
     event_index: BTreeMap<EventId, EventIndex>,
}

Just to make sure, the event_map would be like the following right

BTreeMap {
    "$eventid1:foo" -> SliceId(1),
    "$eventid12:foo" -> SliceId(1),
    "$eventid123:foo" -> SliceId(2),
    "$eventid1234:foo" -> SliceId(2),
}

EventId's would point to the chunk (SliceId) they came from?

I think I'm following along pretty well. we'll see soon enough 😄 !

@poljar
Copy link
Contributor Author

poljar commented Mar 1, 2021

These would be stucts that are the Timeline equivalent to StateChanges right? So they would not stick around for longer than a call to /sync or /messages? Or do they need to be recreated by the StateStore on startup (or at some point when they are needed)?

I think that depends on the store, the memory store will probably keep those as is. I'm not completely sure yet if the store should create those and the equivalent of StateChanges would just push the timeline as we get them from the sync response or if the the store should receive those. I'm leaning towards yeah the lib should create those as an equivalent to StateChanges since that would make it easier to write additional stores.

For example if the caller requests 10 events but our slice contains 100. Using the event_id as an start/end token makes sense in this case.

struct EventOwnerMap {
     slice_map: BTreeMap<SliceId, BTreeMap<EventIndex, EventId>>,
     event_map: BTreeMap<EventId, SliceId>,
    // Now we can go from an eventId to a sub-slice of it's parent SliceId, I don't think we could before?
     event_index: BTreeMap<EventId, EventIndex>,
}

Just to make sure, the event_map would be like the following right

BTreeMap {
    "$eventid1:foo" -> SliceId(1),
    "$eventid12:foo" -> SliceId(1),
    "$eventid123:foo" -> SliceId(2),
    "$eventid1234:foo" -> SliceId(2),
}

EventId's would point to the chunk (SliceId) they came from?

Yeah that sounds sensible, it might need to store the position inside the slice as well, since you otherwise don't know where to continue to return events from.

Sorry I forgot to respond to this.

@jsparber
Copy link
Contributor

jsparber commented Apr 7, 2021

As a side note: there is also https://matrix.org/docs/spec/client_server/r0.6.1#get-matrix-client-r0-rooms-roomid-context-eventid which is similar to /messages but instead of a start token it takes an event and returns the surrounding events and the start/end tokens needed for /messages. This api shouldn't really be needed in our case but it may be useful for some edge cases.

@poljar
Copy link
Contributor Author

poljar commented Apr 7, 2021

That can be handled in sled with the get_lt and get_gt methods.

@jsparber
Copy link
Contributor

jsparber commented Apr 7, 2021

This api shouldn't really be needed in our case but it may be useful for some edge cases.

i missed a not, sorry

@poljar
Copy link
Contributor Author

poljar commented Apr 7, 2021

It will certainly be needed, if clients want to have a similar search experience like Element does, E2EE search will use this.

@poljar
Copy link
Contributor Author

poljar commented Apr 25, 2022

Closed by #486.

@poljar poljar closed this as completed Apr 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants