Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Consistent High postgres CPU usage from federation worker calling community_follower #3958

Closed
4 tasks done
dessalines opened this issue Sep 10, 2023 · 10 comments
Closed
4 tasks done
Labels
area: database area: federation support federation via activitypub bug Something isn't working

Comments

@dessalines
Copy link
Member

Requirements

  • Is this a bug report? For questions or discussions use https://lemmy.ml/c/lemmy_support
  • Did you check to see if this issue already exists?
  • Is this only a single bug? Do not put multiple bugs in one issue.
  • Is this a backend issue? Use the lemmy-ui repo for UI / frontend issues.

Summary

#3605 Introduced a consistent, high-CPU postgres usage bug coming from the the federation worker queue.

The query is: SELECT DISTINCT "community"."id", coalesce("person"."shared_inbox_url", "person"."inbox_url") FROM (("community_follower" INNER JOIN "community" ON ("community_follower"."community_id" = "community"."id")) INNER JOIN "person" ON ("community_follower"."person_id" = "person"."id")) WHERE (((("person"."instance_id" = $1) AND "community"."local") AND NOT ("person"."local")) AND ("community_follower"."published" > $2))

The source function used by the new federation worker is this one: https://github.com/LemmyNet/lemmy/blob/main/crates/db_views_actor/src/community_follower_view.rs#L18

Its called about 12x every second.

cc @phiresky

Steps to Reproduce

See above

Technical Details

See above

Version

main

Lemmy Instance URL

No response

@dessalines dessalines added bug Something isn't working area: federation support federation via activitypub area: database labels Sep 10, 2023
@dessalines dessalines changed the title [Bug]: High postgres CPU usage from federation worker calling community_follower [Bug]: Consistent High postgres CPU usage from federation worker calling community_follower Sep 10, 2023
@phiresky
Copy link
Collaborator

phiresky commented Sep 10, 2023 via email

@phiresky
Copy link
Collaborator

phiresky commented Sep 10, 2023 via email

@dessalines
Copy link
Member Author

Its with debug, I'm testing increasing that now.

I increased it to 10 seconds, and there's still a ton of duped queries:

Screenshot_2023-09-10-11-02-58-748_com termux

They do complete fast, but they spike the CPU quite a bit.

@phiresky
Copy link
Collaborator

It's executed separately for every federated instance that's why there's many

@phiresky
Copy link
Collaborator

if they are indexed correctly they should take <1ms .. maybe an index is missing?

@dessalines
Copy link
Member Author

I am running it with lemmy.ml's prod DB, but here are the indexes:

Screenshot_2023-09-10-11-08-46-576_com termux

Could it be an issue with using NaiveDateTime ? I thought we should only be using DateTime<UTC>, but that is converting it to a naivedatetime:

.filter(community_follower::published.gt(published_since.naive_utc()))

@phiresky
Copy link
Collaborator

phiresky commented Sep 10, 2023

So frequent queries here is kinda expected with the current code. If there's 1000 instaences there's 1000 queues and each of them loads new followers every 60s, so that's ~15 per second. Since the instance_id is not part of the community_followers table it has to join to person which makes it a bit more expensive.

When I tested this it didn't seem to be much of an issue because each query was very cheap. Also this is kind of part of what I meant with "increases overhead for small instances".

On an active instance this shouldn't be too relevant because it replaces the previous follower queries that were always in the top lists of pg_stat_statements (where before it was O(n) wrt the activity count and now is O(1))

Now I'm running the query on a db and I am seeing 17ms exeution time so something is definitely wrong there, I'll check. My DB was missing the migration so now execution time is more like expected, 0.1ms execution (but 1.0 ms planning which is a fairly large amount). Can you do EXPLAIN ANALYZE on your DB? Set the published filter to 10s ago and use a random community_id.

The execution count could also be reduced, both by increasing the recheck delay and by adding more complicated logic or another table like create materialized view instance_followers as SELECT person.instance_id, "community"."id", min(community_follower.published) min_published, coalesce("person"."shared_inbox_url", "person"."inbox_url") as url FROM (("community_follower" INNER JOIN "community" ON ("community_follower"."community_id" = "community"."id")) INNER JOIN "person" ON ("community_follower"."person_id" = "person"."id")) WHERE ((("community"."local") AND NOT ("person"."local"))) group by person.instance_id, community.id, url;

The easiest improvement here would be to increase the delay. Something like 5min might be fine. This delay is only relevant the very first time a person from a different instance subscribes to a community.

@phiresky
Copy link
Collaborator

phiresky commented Sep 10, 2023

This is the query it replaces from 0.18.x SELECT DISTINCT coalesce("person"."shared_inbox_url", "person"."inbox_url") FROM ("community_follower" INNER JOIN "person" ON ("community_follower"."person_id" = "person"."id")) WHERE (("community_follower"."community_id" = $1) AND NOT ("person"."local")).

This query is(was) executed very frequently on an active instance and was in the top 3 most expensive queries if I remember correctly ( called once per user action)

@dessalines
Copy link
Member Author

Okay I've found one issue:

Screenshot_2023-09-12-15-41-33-360_com termux

Appears I need to add an index for create index on person (local desc, instance_id); which appears to fix it:

Screenshot_2023-09-12-15-45-03-788_com termux

I'll have a PR for that in a sec.

@dessalines
Copy link
Member Author

I'm not sure what the defaults should be, I'll let you comment on the PR for those.

phiresky added a commit that referenced this issue Sep 21, 2023
Fixing high CPU usage on federation worker recheck + fix federation tests. Fixes #3958
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: database area: federation support federation via activitypub bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants