Split activity table into sent and received parts (fixes #3103) #3583

Nutomic · 2023-07-11T10:06:30Z

The received activities are only stored in order to avoid processing the same incoming activity multiple times. For this purpose it is completely unnecessary to store the data. So we can split the table into sent_activity and received_activity parts, where only sent_activity table needs to store activity data. This should reduce storage use significantly.

Also reduces activity storage duration to three months, we can reduce this further if necessary.

Additionally the id columns of activity tables are removed because they are completely unused and risk overflowing (fixes #3560).

The received activities are only stored in order to avoid processing the same incoming activity multiple times. For this purpose it is completely unnecessary to store the data. So we can split the table into sent_activity and received_activity parts, where only sent_activity table needs to store activity data. This should reduce storage use significantly. Also reduces activity storage duration to three months, we can reduce this further if necessary. Additionally the id columns of activity tables are removed because they are completely unused and risk overflowing (fixes #3560).

phiresky · 2023-07-11T10:53:44Z

crates/db_schema/src/impls/activity.rs

+  pub async fn create(pool: &DbPool, ap_id_: DbUrl) -> Result<Self, Error> {
+    use crate::schema::received_activity::dsl::{ap_id, received_activity};
+    let conn = &mut get_conn(pool).await?;
+    // TODO: use exists() first to avoid spamming log with conflicts


You can use Insert.on_conflict_do_nothing(). exists() wouldn't be great because it would have a race condition between the check and the insert. you'll have to make the result optional because it will return None if it already exists

phiresky · 2023-07-11T11:01:27Z

migrations/2023-07-11-084714_receive_activity_table/up.sql

+    from activity
+    where local = false;
+
+drop table activity;


since the table has probably grown very large for many instances, this is a pretty expensive migration - the table might have 100 million + rows for people that haven't manually cleaned it. So this would cause some significant downtime (hard to estimate).

It might make sense to add a filter to the copies as where local = x AND id > (select max(id) from activity) - 100000 so only the most recent 100k activities are copied.

The reason my suggestion uses ID and not published is because there is no index on published . on a related note, there probably should be an index on published. It's also used for the deletion scheduled query.

phiresky · 2023-07-11T11:11:16Z

I was actually using the id column on sent activities on my (unfinished) experiment for a persistent and robust outgoing activity queue. So it would be helpful to keep for that purpose. You can easily replace it with a bigserial though for the wrapping problem.

Since the published() column is always initialized with now() that would be a kind of replacement - but system time isn't really guaranteed to be monotonic (e.g. summer time due to naivedatetime use, NTP synchronization) and the sequential ID makes sharding of fetches from the database easy.

dessalines

If this is only because of disk space concerns, then this is probably an overly-complicated fix. Pictures are the disk-space killer, not the activity table.

This could be simplified by doing one of the fixes here: deleting every activity older than 3 months, or even 1 month, in scheduled jobs, as they mostly function as logging tables anyway.

migrations/2023-07-11-084714_receive_activity_table/up.sql

Nutomic · 2023-07-13T09:29:06Z

If this is only because of disk space concerns, then this is probably an overly-complicated fix. Pictures are the disk-space killer, not the activity table.

This could be simplified by doing one of the fixes here: deleting every activity older than 3 months, or even 1 month, in scheduled jobs, as they mostly function as logging tables anyway.

The activity table is absolutely a disk space concern, on lemmy.ml it currently uses 24 gb while the next biggest table (comment_like) has only 470 mb. And the vast majority of stored activities are remote, meaning we dont have any reason to store the content. On lemmy.ml for the last month, there are 20 million remote and 4.5 million local activities stored. So the table split alone will reduce lemmy db usage by about 1/5.

Nutomic · 2023-07-13T10:01:10Z

Updated:

added bigserial id columns
use json instead of jsonb for smaller size
limit migration to 100k rows for sent and 1m for received (not sure if those numbers are good)
use exists and transaction to check that sent activity wasnt inserted before. on_conflict_do_nothing() doesnt work because it doesnt actually return Option, so there is no way if the value was inserted or if it was conflicting with an existing row
move insert_received_activity from receive to verify handlers, so that an activity which fails verification doesnt get verified multiple times

phiresky · 2023-07-13T10:14:04Z

Awesome!

use exists and transaction to check that sent activity wasnt inserted before. on_conflict_do_nothing()

I want to mention that wrapping it in a transaction does nothing here - Postgresql has no way of locking a non-existent row. So the exists() call does not actually prevent another task from inserting a row between the exists() and INSERT, regardless of there being a TX or not.

I suggested on_conflict_do_nothing because I thought the goal was to not log this as an error, but if you want it to actually log the error then why add the check, the insert will already reliably throw the same error? Maybe rather use .map_err()?

on_conflict_do_nothing() doesnt work

In pure SQL If you do ON CONFLICT DO NOTHING RETURNING ID it returns a row if insert succeed and no row if it already existed. That should translate to .on_conflict_do_nothing().returning(received_activity::id).get_result(conn).optional() though I haven't tried.

Nutomic · 2023-07-13T10:18:39Z

I suggested on_conflict_do_nothing because I thought the goal was to not log this as an error, but if you want it to actually log the error then why add the check, the insert will already reliably throw the same error? Maybe rather use .map_err()?

The error needs to be inserted in Rust so that it can stop processing this duplicate activity. However it should not go to the postgres logs, which are currently full of these lines:

postgres_1  | 2023-07-13 10:17:23.039 UTC [32101] STATEMENT:  INSERT INTO "activity" ("data", "local", "updated", "ap_id", "sensitive") VALUES ($1, $2, DEFAULT, $3, $4) RETURNING "activity"."id", "activity"."data", "activity"."local", "activity"."published", "activity"."updated", "activity"."ap_id", "activity"."sensitive"
postgres_1  | 2023-07-13 10:17:30.145 UTC [32108] ERROR:  duplicate key value violates unique constraint "idx_activity_ap_id"
postgres_1  | 2023-07-13 10:17:30.145 UTC [32108] DETAIL:  Key (ap_id)=(https://lemmy.ml/activities/dislike/ff0f47ab-896b-4d52-bf49-60ad290bf5cb) already exists.
postgres_1  | 2023-07-13 10:17:30.145 UTC [32108] STATEMENT:  INSERT INTO "activity" ("data", "local", "updated", "ap_id", "sensitive") VALUES ($1, $2, DEFAULT, $3, $4) RETURNING "activity"."id", "activity"."data", "activity"."local", "activity"."published", "activity"."updated", "activity"."ap_id", "activity"."sensitive"
postgres_1  | 2023-07-13 10:17:37.193 UTC [32069] ERROR:  duplicate key value violates unique constraint "idx_activity_ap_id"
postgres_1  | 2023-07-13 10:17:37.193 UTC [32069] DETAIL:  Key (ap_id)=(https://lemmy.ml/activities/dislike/a0077be3-5490-4764-b135-f95af9e57d9c) already exists.

Will try your suggestion.

Nutomic · 2023-07-13T10:23:56Z

The returning() line is also used when the insert was successful, not only in case of conflict. So it doesnt tell us if the activity is a duplicate.

phiresky · 2023-07-13T11:45:03Z

The returning line is only used if the insert was successful. If it wasn't, then it won't return anything. Check this example:

You can maybe add it before the onconflictdonothing to make it less confusing.

Nutomic · 2023-07-13T13:49:20Z

Looks like returning works differently between postgres and diesel then. At least I cant find any way to get an option as result.

phiresky · 2023-07-13T13:50:44Z

You tried with .optional() after .get_result()? after importing https://docs.diesel.rs/master/diesel/result/trait.OptionalExtension.html. it's an extension trait for some reason

Nutomic · 2023-07-14T10:27:29Z

Alright I got it working now and also added test cases. Ready to merge.

dessalines

Damn, didn't realize it was that much space.

Migrations tested out fine tho.

… (LemmyNet#3583) * Split activity table into sent and received parts (fixes LemmyNet#3103) The received activities are only stored in order to avoid processing the same incoming activity multiple times. For this purpose it is completely unnecessary to store the data. So we can split the table into sent_activity and received_activity parts, where only sent_activity table needs to store activity data. This should reduce storage use significantly. Also reduces activity storage duration to three months, we can reduce this further if necessary. Additionally the id columns of activity tables are removed because they are completely unused and risk overflowing (fixes LemmyNet#3560). * address review * move insert_received_activity() methods to verify handlers * remove unnecessary conflict line * clippy * use on conflict, add tests

* Split activity table into sent and received parts (fixes #3103) The received activities are only stored in order to avoid processing the same incoming activity multiple times. For this purpose it is completely unnecessary to store the data. So we can split the table into sent_activity and received_activity parts, where only sent_activity table needs to store activity data. This should reduce storage use significantly. Also reduces activity storage duration to three months, we can reduce this further if necessary. Additionally the id columns of activity tables are removed because they are completely unused and risk overflowing (fixes #3560). * address review * move insert_received_activity() methods to verify handlers * remove unnecessary conflict line * clippy * use on conflict, add tests

Nutomic requested a review from dessalines as a code owner July 11, 2023 10:06

phiresky reviewed Jul 11, 2023

View reviewed changes

dessalines reviewed Jul 12, 2023

View reviewed changes

migrations/2023-07-11-084714_receive_activity_table/up.sql Show resolved Hide resolved

migrations/2023-07-11-084714_receive_activity_table/up.sql Show resolved Hide resolved

phiresky mentioned this pull request Jul 12, 2023

Persistent, performant, reliable federation queue #3605

Merged

Merge branch 'main' into receive-activity-table

4ddfb87

address review

1f6e29d

move insert_received_activity() methods to verify handlers

a5542d4

remove unnecessary conflict line

1da9ea4

clippy

d52fbb7

Nutomic added 2 commits July 14, 2023 12:25

use on conflict, add tests

95f0fc2

Merge branch 'main' into receive-activity-table

0324308

dessalines approved these changes Jul 14, 2023

View reviewed changes

dessalines merged commit e9e7654 into main Jul 14, 2023

Nutomic mentioned this pull request Jul 20, 2023

[Bug]: endless "duplicate key" errors for federated activities #3510

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split activity table into sent and received parts (fixes #3103) #3583

Split activity table into sent and received parts (fixes #3103) #3583

Nutomic commented Jul 11, 2023

phiresky Jul 11, 2023 •

edited

Loading

phiresky Jul 11, 2023 •

edited

Loading

phiresky commented Jul 11, 2023

dessalines left a comment

Nutomic commented Jul 13, 2023

Nutomic commented Jul 13, 2023 •

edited

Loading

phiresky commented Jul 13, 2023

Nutomic commented Jul 13, 2023 •

edited

Loading

Nutomic commented Jul 13, 2023

phiresky commented Jul 13, 2023 •

edited

Loading

Nutomic commented Jul 13, 2023

phiresky commented Jul 13, 2023 •

edited

Loading

Nutomic commented Jul 14, 2023 •

edited

Loading

dessalines left a comment

Split activity table into sent and received parts (fixes #3103) #3583

Split activity table into sent and received parts (fixes #3103) #3583

Conversation

Nutomic commented Jul 11, 2023

phiresky Jul 11, 2023 • edited Loading

Choose a reason for hiding this comment

phiresky Jul 11, 2023 • edited Loading

Choose a reason for hiding this comment

phiresky commented Jul 11, 2023

dessalines left a comment

Choose a reason for hiding this comment

Nutomic commented Jul 13, 2023

Nutomic commented Jul 13, 2023 • edited Loading

phiresky commented Jul 13, 2023

Nutomic commented Jul 13, 2023 • edited Loading

Nutomic commented Jul 13, 2023

phiresky commented Jul 13, 2023 • edited Loading

Nutomic commented Jul 13, 2023

phiresky commented Jul 13, 2023 • edited Loading

Nutomic commented Jul 14, 2023 • edited Loading

dessalines left a comment

Choose a reason for hiding this comment

phiresky Jul 11, 2023 •

edited

Loading

phiresky Jul 11, 2023 •

edited

Loading

Nutomic commented Jul 13, 2023 •

edited

Loading

Nutomic commented Jul 13, 2023 •

edited

Loading

phiresky commented Jul 13, 2023 •

edited

Loading

phiresky commented Jul 13, 2023 •

edited

Loading

Nutomic commented Jul 14, 2023 •

edited

Loading