Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Fix 500 error on /messages when we accumulate more than 5 backward extremities #11027

1 change: 1 addition & 0 deletions changelog.d/11027.bugfix
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Fix 500 error on `/messages` when the server accumulates more than 5 backwards extremities at a given depth for a room.
28 changes: 15 additions & 13 deletions synapse/handlers/federation.py
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@ async def _maybe_backfill_inner(
)

if not oldest_events_with_depth and not insertion_events_to_be_backfilled:
logger.debug("Not backfilling as no extremeties found.")
logger.info("Not backfilling as no extremeties found.")
return False

# We only want to paginate if we can actually see the events we'll get,
Expand Down Expand Up @@ -240,28 +240,30 @@ async def _maybe_backfill_inner(
)
return False

logger.debug(
"room_id: %s, backfill: current_depth: %s, max_depth: %s, extrems: %s",
room_id,
current_depth,
max_depth,
sorted_extremeties_tuple,
)

# We ignore extremities that have a greater depth than our current depth
# as:
# 1. we don't really care about getting events that have happened
# before our current position; and
# after our current position; and
Copy link
Contributor Author

@MadLittleMods MadLittleMods Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read this over many times. I am pretty sure this is a typo or the logic below is wrong, t for t in sorted_extremeties_tuple if int(t[1]) <= current_depth

Let's just assume the code is correct and the source of truth. The code also matches the general summary above "We ignore extremities that have a greater depth than our current depth"

Depth is incremented for new messages.

Say our current_depth is 1 and the extremities are at depth 10, the extremities occurred after our current place. And the logic below will ignore any extremities greater than 1.

Therefore, "we don't really care about getting events that have happened after our current position"

# 2. we have likely previously tried and failed to backfill from that
# extremity, so to avoid getting "stuck" requesting the same
# backfill repeatedly we drop those extremities.
filtered_sorted_extremeties_tuple = [
t for t in sorted_extremeties_tuple if int(t[1]) <= current_depth
]

logger.debug(
"room_id: %s, backfill: current_depth: %s, limit: %s, max_depth: %s, extrems: %s filtered_sorted_extremeties_tuple: %s",
room_id,
current_depth,
limit,
max_depth,
sorted_extremeties_tuple,
filtered_sorted_extremeties_tuple,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this below because it's useful to see what filtered_sorted_extremeties_tuple is

)

# However, we need to check that the filtered extremities are non-empty.
# If they are empty then either we can a) bail or b) still attempt to
# backill. We opt to try backfilling anyway just in case we do get
# backfill. We opt to try backfilling anyway just in case we do get
# relevant events.
if filtered_sorted_extremeties_tuple:
sorted_extremeties_tuple = filtered_sorted_extremeties_tuple
Expand Down Expand Up @@ -318,7 +320,7 @@ async def try_backfill(domains: List[str]) -> bool:
for dom in domains:
try:
await self._federation_event_handler.backfill(
dom, room_id, limit=100, extremities=extremities
dom, room_id, limit=100, extremities=extremities.keys()
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
)
# If this succeeded then we probably already have the
# appropriate stuff.
Expand Down Expand Up @@ -391,7 +393,7 @@ async def try_backfill(domains: List[str]) -> bool:
for key, state_dict in states.items()
}

for e_id, _ in sorted_extremeties_tuple:
for e_id in event_ids:
Copy link
Contributor Author

@MadLittleMods MadLittleMods Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The big bug fix is here! I am iterating over event_ids which matches what we use to create states and then lookup in the loop.

Previously if we had say 8 backward extremities (sorted_extremeties_tuple), we would slice off the first 5 into extremities so we don't try to backfill too many in the request to the remote federated server.

# We don't want to specify too many extremities as it causes the backfill
# request URI to be too long.
extremities = dict(sorted_extremeties_tuple[:5])

Then we fetch the state for those 5 extremities and create the states dictionary.

But then later, we looped over sorted_extremeties_tuple which is the full 8 backward extremity list and we didn't grab the state for those extra 3 extremities. So we get a KeyError trying to lookup those events we didn't expect.

likely_extremeties_domains = get_domains_from_state(states[e_id])
Copy link
Contributor Author

@MadLittleMods MadLittleMods Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general question on the MSC2716 front. I ran into this error because each historical batch creates a backward extremity that will never be able to resolve because it points to a fake prev_event.

Is it okay to have this build-up?

My thinking is this would be problematic as we will have thousands of batches per room to backfill a big room for Gitter and will have to deal with thousands of backward extremities here.

Also because we use get_oldest_event_ids_with_depth_in_room and oldest backward extremities will always be historical events. Assuming we're paginating from the bottom (where current_depth > historical extremity depth), we will always pick the 5 unresolvable events in the room to backfill from. This would stop us from backfilling any other events in the room.

I feel like I may be missing something though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh bleurgh, that's true. Ideally we'd know that those are fake prev events and not store them? Or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking this separately via #11091


success = await try_backfill(
Expand Down
1 change: 1 addition & 0 deletions synapse/storage/databases/main/user_directory.py
Original file line number Diff line number Diff line change
Expand Up @@ -383,6 +383,7 @@ async def should_include_local_user_in_dir(self, user: str) -> bool:
"""Certain classes of local user are omitted from the user directory.
Is this user one of them?
"""

# App service users aren't usually contactable, so exclude them.
if self.get_if_app_services_interested_in_user(user):
# TODO we might want to make this configurable for each app service
Expand Down