-
-
Notifications
You must be signed in to change notification settings - Fork 890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split activity table into sent and received parts (fixes #3103) #3583
Conversation
The received activities are only stored in order to avoid processing the same incoming activity multiple times. For this purpose it is completely unnecessary to store the data. So we can split the table into sent_activity and received_activity parts, where only sent_activity table needs to store activity data. This should reduce storage use significantly. Also reduces activity storage duration to three months, we can reduce this further if necessary. Additionally the id columns of activity tables are removed because they are completely unused and risk overflowing (fixes #3560).
pub async fn create(pool: &DbPool, ap_id_: DbUrl) -> Result<Self, Error> { | ||
use crate::schema::received_activity::dsl::{ap_id, received_activity}; | ||
let conn = &mut get_conn(pool).await?; | ||
// TODO: use exists() first to avoid spamming log with conflicts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use Insert.on_conflict_do_nothing(). exists() wouldn't be great because it would have a race condition between the check and the insert. you'll have to make the result optional because it will return None if it already exists
from activity | ||
where local = false; | ||
|
||
drop table activity; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the table has probably grown very large for many instances, this is a pretty expensive migration - the table might have 100 million + rows for people that haven't manually cleaned it. So this would cause some significant downtime (hard to estimate).
It might make sense to add a filter to the copies as where local = x AND id > (select max(id) from activity) - 100000
so only the most recent 100k activities are copied.
The reason my suggestion uses ID and not published is because there is no index on published . on a related note, there probably should be an index on published. It's also used for the deletion scheduled query.
I was actually using the id column on sent activities on my (unfinished) experiment for a persistent and robust outgoing activity queue. So it would be helpful to keep for that purpose. You can easily replace it with a Since the published() column is always initialized with now() that would be a kind of replacement - but system time isn't really guaranteed to be monotonic (e.g. summer time due to naivedatetime use, NTP synchronization) and the sequential ID makes sharding of fetches from the database easy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is only because of disk space concerns, then this is probably an overly-complicated fix. Pictures are the disk-space killer, not the activity table.
This could be simplified by doing one of the fixes here: deleting every activity older than 3 months, or even 1 month, in scheduled jobs, as they mostly function as logging tables anyway.
The activity table is absolutely a disk space concern, on lemmy.ml it currently uses 24 gb while the next biggest table (comment_like) has only 470 mb. And the vast majority of stored activities are remote, meaning we dont have any reason to store the content. On lemmy.ml for the last month, there are 20 million remote and 4.5 million local activities stored. So the table split alone will reduce lemmy db usage by about 1/5. |
Updated:
|
Awesome!
I want to mention that wrapping it in a transaction does nothing here - Postgresql has no way of locking a non-existent row. So the exists() call does not actually prevent another task from inserting a row between the exists() and INSERT, regardless of there being a TX or not. I suggested on_conflict_do_nothing because I thought the goal was to not log this as an error, but if you want it to actually log the error then why add the check, the insert will already reliably throw the same error? Maybe rather use .map_err()?
In pure SQL If you do ON CONFLICT DO NOTHING RETURNING ID it returns a row if insert succeed and no row if it already existed. That should translate to |
The error needs to be inserted in Rust so that it can stop processing this duplicate activity. However it should not go to the postgres logs, which are currently full of these lines:
Will try your suggestion. |
The |
Looks like |
You tried with |
Alright I got it working now and also added test cases. Ready to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Damn, didn't realize it was that much space.
Migrations tested out fine tho.
… (LemmyNet#3583) * Split activity table into sent and received parts (fixes LemmyNet#3103) The received activities are only stored in order to avoid processing the same incoming activity multiple times. For this purpose it is completely unnecessary to store the data. So we can split the table into sent_activity and received_activity parts, where only sent_activity table needs to store activity data. This should reduce storage use significantly. Also reduces activity storage duration to three months, we can reduce this further if necessary. Additionally the id columns of activity tables are removed because they are completely unused and risk overflowing (fixes LemmyNet#3560). * address review * move insert_received_activity() methods to verify handlers * remove unnecessary conflict line * clippy * use on conflict, add tests
* Split activity table into sent and received parts (fixes #3103) The received activities are only stored in order to avoid processing the same incoming activity multiple times. For this purpose it is completely unnecessary to store the data. So we can split the table into sent_activity and received_activity parts, where only sent_activity table needs to store activity data. This should reduce storage use significantly. Also reduces activity storage duration to three months, we can reduce this further if necessary. Additionally the id columns of activity tables are removed because they are completely unused and risk overflowing (fixes #3560). * address review * move insert_received_activity() methods to verify handlers * remove unnecessary conflict line * clippy * use on conflict, add tests
The received activities are only stored in order to avoid processing the same incoming activity multiple times. For this purpose it is completely unnecessary to store the data. So we can split the table into sent_activity and received_activity parts, where only sent_activity table needs to store activity data. This should reduce storage use significantly.
Also reduces activity storage duration to three months, we can reduce this further if necessary.
Additionally the id columns of activity tables are removed because they are completely unused and risk overflowing (fixes #3560).