Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split activity table into sent and received parts (fixes #3103) #3583

Merged
merged 8 commits into from
Jul 14, 2023

Conversation

Nutomic
Copy link
Member

@Nutomic Nutomic commented Jul 11, 2023

The received activities are only stored in order to avoid processing the same incoming activity multiple times. For this purpose it is completely unnecessary to store the data. So we can split the table into sent_activity and received_activity parts, where only sent_activity table needs to store activity data. This should reduce storage use significantly.

Also reduces activity storage duration to three months, we can reduce this further if necessary.

Additionally the id columns of activity tables are removed because they are completely unused and risk overflowing (fixes #3560).

The received activities are only stored in order to avoid processing
the same incoming activity multiple times. For this purpose it is
completely unnecessary to store the data. So we can split the
table into sent_activity and received_activity parts, where
only sent_activity table needs to store activity data. This should
reduce storage use significantly.

Also reduces activity storage duration to three months, we can reduce
this further if necessary.

Additionally the id columns of activity tables are removed because
they are completely unused and risk overflowing (fixes #3560).
@Nutomic Nutomic requested a review from dessalines as a code owner July 11, 2023 10:06
pub async fn create(pool: &DbPool, ap_id_: DbUrl) -> Result<Self, Error> {
use crate::schema::received_activity::dsl::{ap_id, received_activity};
let conn = &mut get_conn(pool).await?;
// TODO: use exists() first to avoid spamming log with conflicts
Copy link
Collaborator

@phiresky phiresky Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use Insert.on_conflict_do_nothing(). exists() wouldn't be great because it would have a race condition between the check and the insert. you'll have to make the result optional because it will return None if it already exists

from activity
where local = false;

drop table activity;
Copy link
Collaborator

@phiresky phiresky Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the table has probably grown very large for many instances, this is a pretty expensive migration - the table might have 100 million + rows for people that haven't manually cleaned it. So this would cause some significant downtime (hard to estimate).

It might make sense to add a filter to the copies as where local = x AND id > (select max(id) from activity) - 100000 so only the most recent 100k activities are copied.

The reason my suggestion uses ID and not published is because there is no index on published . on a related note, there probably should be an index on published. It's also used for the deletion scheduled query.

@phiresky
Copy link
Collaborator

I was actually using the id column on sent activities on my (unfinished) experiment for a persistent and robust outgoing activity queue. So it would be helpful to keep for that purpose. You can easily replace it with a bigserial though for the wrapping problem.

Since the published() column is always initialized with now() that would be a kind of replacement - but system time isn't really guaranteed to be monotonic (e.g. summer time due to naivedatetime use, NTP synchronization) and the sequential ID makes sharding of fetches from the database easy.

Copy link
Member

@dessalines dessalines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is only because of disk space concerns, then this is probably an overly-complicated fix. Pictures are the disk-space killer, not the activity table.

This could be simplified by doing one of the fixes here: deleting every activity older than 3 months, or even 1 month, in scheduled jobs, as they mostly function as logging tables anyway.

@Nutomic
Copy link
Member Author

Nutomic commented Jul 13, 2023

If this is only because of disk space concerns, then this is probably an overly-complicated fix. Pictures are the disk-space killer, not the activity table.

This could be simplified by doing one of the fixes here: deleting every activity older than 3 months, or even 1 month, in scheduled jobs, as they mostly function as logging tables anyway.

The activity table is absolutely a disk space concern, on lemmy.ml it currently uses 24 gb while the next biggest table (comment_like) has only 470 mb. And the vast majority of stored activities are remote, meaning we dont have any reason to store the content. On lemmy.ml for the last month, there are 20 million remote and 4.5 million local activities stored. So the table split alone will reduce lemmy db usage by about 1/5.

@Nutomic
Copy link
Member Author

Nutomic commented Jul 13, 2023

Updated:

  • added bigserial id columns
  • use json instead of jsonb for smaller size
  • limit migration to 100k rows for sent and 1m for received (not sure if those numbers are good)
  • use exists and transaction to check that sent activity wasnt inserted before. on_conflict_do_nothing() doesnt work because it doesnt actually return Option, so there is no way if the value was inserted or if it was conflicting with an existing row
  • move insert_received_activity from receive to verify handlers, so that an activity which fails verification doesnt get verified multiple times

@phiresky
Copy link
Collaborator

Awesome!

use exists and transaction to check that sent activity wasnt inserted before. on_conflict_do_nothing()

I want to mention that wrapping it in a transaction does nothing here - Postgresql has no way of locking a non-existent row. So the exists() call does not actually prevent another task from inserting a row between the exists() and INSERT, regardless of there being a TX or not.

I suggested on_conflict_do_nothing because I thought the goal was to not log this as an error, but if you want it to actually log the error then why add the check, the insert will already reliably throw the same error? Maybe rather use .map_err()?

on_conflict_do_nothing() doesnt work

In pure SQL If you do ON CONFLICT DO NOTHING RETURNING ID it returns a row if insert succeed and no row if it already existed. That should translate to .on_conflict_do_nothing().returning(received_activity::id).get_result(conn).optional() though I haven't tried.

@Nutomic
Copy link
Member Author

Nutomic commented Jul 13, 2023

I suggested on_conflict_do_nothing because I thought the goal was to not log this as an error, but if you want it to actually log the error then why add the check, the insert will already reliably throw the same error? Maybe rather use .map_err()?

The error needs to be inserted in Rust so that it can stop processing this duplicate activity. However it should not go to the postgres logs, which are currently full of these lines:

postgres_1  | 2023-07-13 10:17:23.039 UTC [32101] STATEMENT:  INSERT INTO "activity" ("data", "local", "updated", "ap_id", "sensitive") VALUES ($1, $2, DEFAULT, $3, $4) RETURNING "activity"."id", "activity"."data", "activity"."local", "activity"."published", "activity"."updated", "activity"."ap_id", "activity"."sensitive"
postgres_1  | 2023-07-13 10:17:30.145 UTC [32108] ERROR:  duplicate key value violates unique constraint "idx_activity_ap_id"
postgres_1  | 2023-07-13 10:17:30.145 UTC [32108] DETAIL:  Key (ap_id)=(https://lemmy.ml/activities/dislike/ff0f47ab-896b-4d52-bf49-60ad290bf5cb) already exists.
postgres_1  | 2023-07-13 10:17:30.145 UTC [32108] STATEMENT:  INSERT INTO "activity" ("data", "local", "updated", "ap_id", "sensitive") VALUES ($1, $2, DEFAULT, $3, $4) RETURNING "activity"."id", "activity"."data", "activity"."local", "activity"."published", "activity"."updated", "activity"."ap_id", "activity"."sensitive"
postgres_1  | 2023-07-13 10:17:37.193 UTC [32069] ERROR:  duplicate key value violates unique constraint "idx_activity_ap_id"
postgres_1  | 2023-07-13 10:17:37.193 UTC [32069] DETAIL:  Key (ap_id)=(https://lemmy.ml/activities/dislike/a0077be3-5490-4764-b135-f95af9e57d9c) already exists.

Will try your suggestion.

@Nutomic
Copy link
Member Author

Nutomic commented Jul 13, 2023

The returning() line is also used when the insert was successful, not only in case of conflict. So it doesnt tell us if the activity is a duplicate.

@phiresky
Copy link
Collaborator

phiresky commented Jul 13, 2023

The returning line is only used if the insert was successful. If it wasn't, then it won't return anything. Check this example:
image

You can maybe add it before the onconflictdonothing to make it less confusing.

@Nutomic
Copy link
Member Author

Nutomic commented Jul 13, 2023

Looks like returning works differently between postgres and diesel then. At least I cant find any way to get an option as result.

@phiresky
Copy link
Collaborator

phiresky commented Jul 13, 2023

You tried with .optional() after .get_result()? after importing https://docs.diesel.rs/master/diesel/result/trait.OptionalExtension.html. it's an extension trait for some reason

@Nutomic
Copy link
Member Author

Nutomic commented Jul 14, 2023

Alright I got it working now and also added test cases. Ready to merge.

Copy link
Member

@dessalines dessalines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn, didn't realize it was that much space.

Migrations tested out fine tho.

@dessalines dessalines merged commit e9e7654 into main Jul 14, 2023
Nutomic added a commit to cetra3/lemmy that referenced this pull request Jul 19, 2023
… (LemmyNet#3583)

* Split activity table into sent and received parts (fixes LemmyNet#3103)

The received activities are only stored in order to avoid processing
the same incoming activity multiple times. For this purpose it is
completely unnecessary to store the data. So we can split the
table into sent_activity and received_activity parts, where
only sent_activity table needs to store activity data. This should
reduce storage use significantly.

Also reduces activity storage duration to three months, we can reduce
this further if necessary.

Additionally the id columns of activity tables are removed because
they are completely unused and risk overflowing (fixes LemmyNet#3560).

* address review

* move insert_received_activity() methods to verify handlers

* remove unnecessary conflict line

* clippy

* use on conflict, add tests
Nutomic added a commit that referenced this pull request Jul 21, 2023
* Split activity table into sent and received parts (fixes #3103)

The received activities are only stored in order to avoid processing
the same incoming activity multiple times. For this purpose it is
completely unnecessary to store the data. So we can split the
table into sent_activity and received_activity parts, where
only sent_activity table needs to store activity data. This should
reduce storage use significantly.

Also reduces activity storage duration to three months, we can reduce
this further if necessary.

Additionally the id columns of activity tables are removed because
they are completely unused and risk overflowing (fixes #3560).

* address review

* move insert_received_activity() methods to verify handlers

* remove unnecessary conflict line

* clippy

* use on conflict, add tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: activity table will overflow 32bit signed int at some point
3 participants