Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Only try to backfill event if we haven't tried before recently (exponential backoff) #13635

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
e0d7fab
Keep track when we tried to backfill an event
MadLittleMods Aug 23, 2022
b8d55d3
Record in some fail spots
MadLittleMods Aug 25, 2022
f63d823
Merge branch 'develop' into madlittlemods/keep-track-when-we-tried-to…
MadLittleMods Aug 25, 2022
bec26e2
Record and clear attempts
MadLittleMods Aug 25, 2022
fee37c3
Add changelog
MadLittleMods Aug 25, 2022
d1290be
Remove from when spam checker fails
MadLittleMods Aug 25, 2022
f9119d0
Custom upsert to increment
MadLittleMods Aug 25, 2022
f5c6fe7
Fix lints
MadLittleMods Aug 25, 2022
16ebec0
Remove extra whitespace
MadLittleMods Aug 25, 2022
ce07aa1
Move to correct folder
MadLittleMods Aug 25, 2022
5811ba1
Set the version back
MadLittleMods Aug 25, 2022
cf2b093
Fix `TypeError: not all arguments converted during string formatting`
MadLittleMods Aug 25, 2022
cbb4194
Add test to make sure failed backfill attempts are recorded
MadLittleMods Aug 26, 2022
621c5d3
Clean up test
MadLittleMods Aug 26, 2022
75c07bb
Fix comments
MadLittleMods Aug 26, 2022
783cce5
Add test for clearing backfill attempts
MadLittleMods Aug 26, 2022
54ef84b
Maybe a better comment
MadLittleMods Aug 26, 2022
7bf3e7f
WIP: Just working on the query
MadLittleMods Aug 26, 2022
37ff009
Move comment to where it matters
MadLittleMods Aug 26, 2022
a58d191
Silly graph pt 1
MadLittleMods Aug 26, 2022
f127ad1
Silly graph pt 2
MadLittleMods Aug 26, 2022
18abbf4
Tests running (not working)
MadLittleMods Aug 27, 2022
23310f5
Passing test
MadLittleMods Aug 27, 2022
64e01d8
Add test for A and B
MadLittleMods Aug 27, 2022
47bac25
Add tests for backfill attempts
MadLittleMods Aug 27, 2022
2ebed9d
Remove `GROUP BY backward_extrem.event_id` (seems unnecessary)
MadLittleMods Aug 27, 2022
60b3b92
Clarify why that much time
MadLittleMods Aug 27, 2022
e9f603d
Label ? slot
MadLittleMods Aug 27, 2022
a8f1464
Better explanation
MadLittleMods Aug 27, 2022
bbd1c94
Add changelog
MadLittleMods Aug 27, 2022
dd1db25
Fix lints
MadLittleMods Aug 27, 2022
c583eef
Update docstring
MadLittleMods Aug 27, 2022
ea4a3ad
Apply same changes to `get_insertion_event_backward_extremities_in_room`
MadLittleMods Aug 27, 2022
f495752
Use power and capitalize AS
MadLittleMods Aug 27, 2022
f2061b9
Use SQLite compatible power of 2 (left shift)
MadLittleMods Aug 31, 2022
e4192d7
Update table name with "failed" and include room_id in the primary key
MadLittleMods Aug 31, 2022
7a44932
Rename to record_event_failed_backfill_attempt
MadLittleMods Aug 31, 2022
86d98ca
Merge branch 'develop' into madlittlemods/keep-track-when-we-tried-to…
MadLittleMods Aug 31, 2022
29f584e
Merge branch 'madlittlemods/keep-track-when-we-tried-to-backfill-an-e…
MadLittleMods Aug 31, 2022
506a8dd
Changes after merging madlittlemods/keep-track-when-we-tried-to-backf…
MadLittleMods Aug 31, 2022
361ce5c
Use compatible least/min on each db platform
MadLittleMods Aug 31, 2022
b09d8a2
Fix SQLite no such column error when comparing table to null
MadLittleMods Aug 31, 2022
965d142
Add comment about how these are sorted by depth now
MadLittleMods Aug 31, 2022
267777f
Apply same least compatiblity to insertion event extremities
MadLittleMods Aug 31, 2022
d0cd42a
Fix lints
MadLittleMods Aug 31, 2022
3d9f625
Try fix ambiguous column (remove unsued table)
MadLittleMods Sep 1, 2022
33a3c64
Fix ambiguous column
MadLittleMods Sep 1, 2022
6736d10
Add tests for get_insertion_event_backward_extremities_in_room
MadLittleMods Sep 1, 2022
6eba1d4
Fix up test descriptions
MadLittleMods Sep 1, 2022
1464101
Add _unsafe_to_upsert_tables check
MadLittleMods Sep 1, 2022
71c7738
Add foreign key references
MadLittleMods Sep 1, 2022
df8c76d
Merge branch 'develop' into madlittlemods/keep-track-when-we-tried-to…
MadLittleMods Sep 1, 2022
d45b078
Remove reference to event that won't be in the events table
MadLittleMods Sep 1, 2022
c939422
Merge branch 'madlittlemods/keep-track-when-we-tried-to-backfill-an-e…
MadLittleMods Sep 1, 2022
599e212
Fix approximate typo
MadLittleMods Sep 1, 2022
bc8046b
Clarify what depth sort
MadLittleMods Sep 1, 2022
ea08006
Fix typos
MadLittleMods Sep 1, 2022
9a85bb4
Normal is not obvious
MadLittleMods Sep 1, 2022
7204cce
Fix left-shift math
MadLittleMods Sep 1, 2022
8f214b1
Fix foreign key constraint
MadLittleMods Sep 2, 2022
33ad64e
Merge branch 'develop' into madlittlemods/keep-track-when-we-tried-to…
MadLittleMods Sep 9, 2022
63bec99
Remove emulated upsert code (all of our dbs no support it)
MadLittleMods Sep 9, 2022
31d7502
Table rename `event_failed_pull_attempts`
MadLittleMods Sep 9, 2022
0b5f1db
Add `last_cause` column
MadLittleMods Sep 9, 2022
4ce7709
Merge branch 'develop' into madlittlemods/keep-track-when-we-tried-to…
MadLittleMods Sep 12, 2022
d3a1f84
Merge branch 'develop' into madlittlemods/keep-track-when-we-tried-to…
MadLittleMods Sep 13, 2022
1347686
Update schema version summary
MadLittleMods Sep 13, 2022
57182dc
Remove backfilled check since we plan to go general anyway
MadLittleMods Sep 14, 2022
e58bc65
Merge branch 'develop' into madlittlemods/keep-track-when-we-tried-to…
MadLittleMods Sep 14, 2022
70019d2
Move change to latest delta 73
MadLittleMods Sep 14, 2022
46a1a20
Merge branch 'madlittlemods/keep-track-when-we-tried-to-backfill-an-e…
MadLittleMods Sep 14, 2022
91c5be0
Merge branch 'develop' into madlittlemods/13622-do-not-retry-backfill…
MadLittleMods Sep 14, 2022
7ea40b1
Updates after schema changes in the other PR
MadLittleMods Sep 14, 2022
40ec8d8
Remove debug logging
MadLittleMods Sep 14, 2022
47aa375
Merge branch 'develop' into madlittlemods/13622-do-not-retry-backfill…
MadLittleMods Sep 22, 2022
1208540
Remove orthogonal `current_depth` changes
MadLittleMods Sep 22, 2022
a121bc3
Fix lints
MadLittleMods Sep 22, 2022
491aac6
Add context for why we have the is_state condition
MadLittleMods Sep 22, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog.d/13635.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Exponentially backoff from backfilling the same event over and over.
4 changes: 1 addition & 3 deletions synapse/handlers/federation.py
Original file line number Diff line number Diff line change
Expand Up @@ -226,9 +226,7 @@ async def _maybe_backfill_inner(
"""
backwards_extremities = [
_BackfillPoint(event_id, depth, _BackfillPointType.BACKWARDS_EXTREMITY)
for event_id, depth in await self.store.get_oldest_event_ids_with_depth_in_room(
room_id
)
for event_id, depth in await self.store.get_backfill_points_in_room(room_id)
]

insertion_events_to_be_backfilled: List[_BackfillPoint] = []
Expand Down
188 changes: 147 additions & 41 deletions synapse/storage/databases/main/event_federation.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import datetime
import itertools
import logging
from queue import Empty, PriorityQueue
Expand Down Expand Up @@ -43,7 +44,7 @@
)
from synapse.storage.databases.main.events_worker import EventsWorkerStore
from synapse.storage.databases.main.signatures import SignatureWorkerStore
from synapse.storage.engines import PostgresEngine
from synapse.storage.engines import PostgresEngine, Sqlite3Engine
from synapse.types import JsonDict
from synapse.util import json_encoder
from synapse.util.caches.descriptors import cached
Expand Down Expand Up @@ -72,6 +73,13 @@

logger = logging.getLogger(__name__)

BACKFILL_EVENT_BACKOFF_UPPER_BOUND_SECONDS: int = int(
datetime.timedelta(days=7).total_seconds()
)
BACKFILL_EVENT_EXPONENTIAL_BACKOFF_STEP_SECONDS: int = int(
datetime.timedelta(hours=1).total_seconds()
)
erikjohnston marked this conversation as resolved.
Show resolved Hide resolved
Comment on lines +76 to +81
Copy link
Contributor Author

@MadLittleMods MadLittleMods Aug 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any opinions on these values?

I chose 7 days because trying again next week seemed reasonable if someone's server was offline and then they got it working again.

This means it takes 8 attempts to get to the upper bound (2^7 = 128, 2^8 = 256)

2hr, 4hr, 8hr, 16hr, 32hr, 64hr, 128hr, (capped at 168hr from now on)


Although for really dead end extremities, it feels like bumping up to 30 days doesn't feel awful either.

And even maybe calling some extremity completely dead at a certain point and never retrying it. We can iterate on this in a future PR though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a data point, for backoff from homeservers, we have

# a cap on the backoff. (Essentially none)
MAX_RETRY_INTERVAL = 2**62

'essentially no limit'.

I think I'm happy to leave it at 7 for now; I note that homeserver backoff timers are reset if the homeserver contacts us first and I would be uncomfortable with doing the same for backfill — unsure how much a room would break, but still.



# All the info we need while iterating the DAG while backfilling
@attr.s(frozen=True, slots=True, auto_attribs=True)
Expand Down Expand Up @@ -715,96 +723,189 @@ def _get_auth_chain_difference_txn(

@trace
@tag_args
async def get_oldest_event_ids_with_depth_in_room(
self, room_id: str
async def get_backfill_points_in_room(
self,
room_id: str,
) -> List[Tuple[str, int]]:
"""Gets the oldest events(backwards extremities) in the room along with the
aproximate depth.

We use this function so that we can compare and see if someones current
depth at their current scrollback is within pagination range of the
event extremeties. If the current depth is close to the depth of given
oldest event, we can trigger a backfill.
"""
Gets the oldest events(backwards extremities) in the room along with the
approximate depth. Sorted by depth, highest to lowest (descending).

Args:
room_id: Room where we want to find the oldest events

Returns:
List of (event_id, depth) tuples
List of (event_id, depth) tuples. Sorted by depth, highest to lowest
(descending)
"""

def get_oldest_event_ids_with_depth_in_room_txn(
def get_backfill_points_in_room_txn(
txn: LoggingTransaction, room_id: str
) -> List[Tuple[str, int]]:
# Assemble a dictionary with event_id -> depth for the oldest events
# Assemble a tuple lookup of event_id -> depth for the oldest events
# we know of in the room. Backwards extremeties are the oldest
# events we know of in the room but we only know of them because
# some other event referenced them by prev_event and aren't peristed
# in our database yet (meaning we don't know their depth
# specifically). So we need to look for the aproximate depth from
# some other event referenced them by prev_event and aren't
# persisted in our database yet (meaning we don't know their depth
# specifically). So we need to look for the approximate depth from
# the events connected to the current backwards extremeties.
sql = """
SELECT b.event_id, MAX(e.depth) FROM events as e
SELECT backward_extrem.event_id, event.depth FROM events AS event
/**
* Get the edge connections from the event_edges table
* so we can see whether this event's prev_events points
* to a backward extremity in the next join.
*/
INNER JOIN event_edges as g
ON g.event_id = e.event_id
INNER JOIN event_edges AS edge
ON edge.event_id = event.event_id
/**
* We find the "oldest" events in the room by looking for
* events connected to backwards extremeties (oldest events
* in the room that we know of so far).
*/
INNER JOIN event_backward_extremities as b
ON g.prev_event_id = b.event_id
WHERE b.room_id = ? AND g.is_state is ?
GROUP BY b.event_id
INNER JOIN event_backward_extremities AS backward_extrem
ON edge.prev_event_id = backward_extrem.event_id
/**
* We use this info to make sure we don't retry to use a backfill point
* if we've already attempted to backfill from it recently.
*/
LEFT JOIN event_failed_pull_attempts AS failed_backfill_attempt_info
ON
failed_backfill_attempt_info.room_id = backward_extrem.room_id
AND failed_backfill_attempt_info.event_id = backward_extrem.event_id
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
WHERE
backward_extrem.room_id = ?
/* We only care about non-state edges because we used to use
* `event_edges` for two different sorts of "edges" (the current
* event DAG, but also a link to the previous state, for state
* events). These legacy state event edges can be distinguished by
* `is_state` and are removed from the codebase and schema but
* because the schema change is in a background update, it's not
* necessarily safe to assume that it will have been completed.
*/
AND edge.is_state is ? /* False */
/**
* Exponential back-off (up to the upper bound) so we don't retry the
* same backfill point over and over. ex. 2hr, 4hr, 8hr, 16hr, etc.
*
* We use `1 << n` as a power of 2 equivalent for compatibility
* with older SQLites. The left shift equivalent only works with
* powers of 2 because left shift is a binary operation (base-2).
* Otherwise, we would use `power(2, n)` or the power operator, `2^n`.
*/
AND (
failed_backfill_attempt_info.event_id IS NULL
OR ? /* current_time */ >= failed_backfill_attempt_info.last_attempt_ts + /*least*/%s((1 << failed_backfill_attempt_info.num_attempts) * ? /* step */, ? /* upper bound */)
erikjohnston marked this conversation as resolved.
Show resolved Hide resolved
Comment on lines +787 to +798
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things here.

  1. If num_attempts is 10 then (1 << 10) ** 3600000 overflows a 32bit signed int and postgres errors. See CS-API /messages requests fail on develop when using postgres due to NumericValueOutOfRange if some events have repeatedly failed to backfill #13929 for the investigation.
  2. (Minor): the name "step" for the constant 3600000 (1 hour in ms) suggests that the first backoff period should be 1 hour: but as you noted above the first backoff period is 2 horus.

I'm attempting to PR a fix for (1).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR to fix this: #13936

)
/**
* Sort from highest to the lowest depth. Then tie-break on
* alphabetical order of the event_ids so we get a consistent
* ordering which is nice when asserting things in tests.
*/
ORDER BY event.depth DESC, backward_extrem.event_id DESC
"""

txn.execute(sql, (room_id, False))
if isinstance(self.database_engine, PostgresEngine):
least_function = "least"
elif isinstance(self.database_engine, Sqlite3Engine):
least_function = "min"
else:
raise RuntimeError("Unknown database engine")

txn.execute(
sql % (least_function,),
(
room_id,
False,
self._clock.time_msec(),
1000 * BACKFILL_EVENT_EXPONENTIAL_BACKOFF_STEP_SECONDS,
1000 * BACKFILL_EVENT_BACKOFF_UPPER_BOUND_SECONDS,
),
)

return cast(List[Tuple[str, int]], txn.fetchall())

return await self.db_pool.runInteraction(
"get_oldest_event_ids_with_depth_in_room",
get_oldest_event_ids_with_depth_in_room_txn,
"get_backfill_points_in_room",
get_backfill_points_in_room_txn,
room_id,
)

@trace
async def get_insertion_event_backward_extremities_in_room(
self, room_id: str
self,
room_id: str,
) -> List[Tuple[str, int]]:
"""Get the insertion events we know about that we haven't backfilled yet.

We use this function so that we can compare and see if someones current
depth at their current scrollback is within pagination range of the
insertion event. If the current depth is close to the depth of given
insertion event, we can trigger a backfill.
"""
Get the insertion events we know about that we haven't backfilled yet
along with the approximate depth. Sorted by depth, highest to lowest
(descending).

Args:
room_id: Room where we want to find the oldest events

Returns:
List of (event_id, depth) tuples
List of (event_id, depth) tuples. Sorted by depth, highest to lowest
(descending)
"""

def get_insertion_event_backward_extremities_in_room_txn(
txn: LoggingTransaction, room_id: str
) -> List[Tuple[str, int]]:
sql = """
SELECT b.event_id, MAX(e.depth) FROM insertion_events as i
SELECT
insertion_event_extremity.event_id, event.depth
/* We only want insertion events that are also marked as backwards extremities */
INNER JOIN insertion_event_extremities as b USING (event_id)
FROM insertion_event_extremities AS insertion_event_extremity
/* Get the depth of the insertion event from the events table */
INNER JOIN events AS e USING (event_id)
WHERE b.room_id = ?
GROUP BY b.event_id
INNER JOIN events AS event USING (event_id)
/**
* We use this info to make sure we don't retry to use a backfill point
* if we've already attempted to backfill from it recently.
*/
LEFT JOIN event_failed_pull_attempts AS failed_backfill_attempt_info
ON
failed_backfill_attempt_info.room_id = insertion_event_extremity.room_id
AND failed_backfill_attempt_info.event_id = insertion_event_extremity.event_id
WHERE
insertion_event_extremity.room_id = ?
/**
* Exponential back-off (up to the upper bound) so we don't retry the
* same backfill point over and over. ex. 2hr, 4hr, 8hr, 16hr, etc
*
* We use `1 << n` as a power of 2 equivalent for compatibility
* with older SQLites. The left shift equivalent only works with
* powers of 2 because left shift is a binary operation (base-2).
* Otherwise, we would use `power(2, n)` or the power operator, `2^n`.
*/
AND (
failed_backfill_attempt_info.event_id IS NULL
OR ? /* current_time */ >= failed_backfill_attempt_info.last_attempt_ts + /*least*/%s((1 << failed_backfill_attempt_info.num_attempts) * ? /* step */, ? /* upper bound */)
)
/**
* Sort from highest to the lowest depth. Then tie-break on
* alphabetical order of the event_ids so we get a consistent
* ordering which is nice when asserting things in tests.
*/
ORDER BY event.depth DESC, insertion_event_extremity.event_id DESC
"""

txn.execute(sql, (room_id,))
if isinstance(self.database_engine, PostgresEngine):
least_function = "least"
elif isinstance(self.database_engine, Sqlite3Engine):
least_function = "min"
else:
raise RuntimeError("Unknown database engine")

txn.execute(
sql % (least_function,),
(
room_id,
self._clock.time_msec(),
1000 * BACKFILL_EVENT_EXPONENTIAL_BACKOFF_STEP_SECONDS,
1000 * BACKFILL_EVENT_BACKOFF_UPPER_BOUND_SECONDS,
),
)
return cast(List[Tuple[str, int]], txn.fetchall())

return await self.db_pool.runInteraction(
Expand Down Expand Up @@ -1539,7 +1640,12 @@ async def get_next_staged_event_id_for_room(
self,
room_id: str,
) -> Optional[Tuple[str, str]]:
"""Get the next event ID in the staging area for the given room."""
"""
Get the next event ID in the staging area for the given room.

Returns:
Tuple of the `origin` and `event_id`
"""

def _get_next_staged_event_id_for_room_txn(
txn: LoggingTransaction,
Expand Down
Loading