-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Precalculate actions #4625
Comments
@EDsCODE I'm mainly replying to this comment from you on Slack:
I'm excited for this I think it's probably making the right tradeoff – what do you think reasonable refresh rates would be (let's use cloud as an example)? One thing I'd add is for this to work well I think it's critical that anywhere we're using one of these pre-computed values, it's clear to the user when it was last computed wherever it's being displayed (and ideally, a way to click refresh and have it update in RT). |
Note this is also the piece that falls over flat on it's face the most, failing in inconvenient ways (e.g. not running, not having enough CPU, breaking between releases, etc). |
I assume this proposal as-is is talking about adding another table that contains action_id <-> event_id mapping. I'd love good metrics around how much this would win us, but without them I'm a bit wary about this proposal. On broad terms, we are limited by the following factors:
Everything that touches every event affects all 4 of these, potentially driving our cost up significantly. Size-per-eventAs we scale, this (along with eng cost) is gonna be the largest lineitem on our bill. Adding a mapping table means that we add a relevant percentage to our Note that person tables are small enough (rule-of-thumb 100x) that doubling that data e.g. in cohorts caching tables is not an issue and we can replicate the table onto each db node. Query performanceHow do you shard this intermediary table? If it's doing a join over the network it's going to be slower than any compute you're doing locally. We don't have the know-how on how to do colocated sharding on clickhouse and this table is too big to be replicating on all nodes (see size-per-event). If it's doing a on-disk join even that is not ideal - would be great to have measurements. ComputeHow do you update the actions table? This is bound by CPU and/or netio on either clickhouse or on celery nodes. You can't be using This is the piece that fell over the most on open-source postgres because of the expense of this. Note plugin server is also doing some on-the-fly calculations now. CorrectnessWhat kunal said above Alternative proposalThe root problem being solved here is not laid out in the ticket, but I assume it's to do with action predicates being expensive to match on larger clients. This is because they need to deserialize a whole bunch of json and then do regex matching on the values. Assuming that is correct, one potential solution which might work is: Automatically add MATERIALIZED columns onto events table for queries that need speedup This:
The idea is similar to what Heap does with partial indexes:
IMO it's not worth pulling every action out since small companies don't need a speedup and each extra column adds a tiny bit of extra cost. Instead, consider this:
Since actions can get updated, naming the column something like Not 100% sure this works since haven't done measurements, but thought it was worth writing down. |
If we wanted to save event-actions relationships to ClickHouse too (as we do in the Postgres pipeline already), this will actually be perfectly possible right away when PostHog/plugin-server#436 gets merged. We're adding this feature to the plugin server, so that it can do matching in memory for webhooks, REST hooks, Though, like Karl, I'm worried about scalability here – these associative tables would likely end up huge row count-wise, and correctness would be made more difficult, since when the action is changed, we'd have to invalidate all its dynamically found matches and recalculate everything, this time in SQL. |
This issue hasn't seen activity in two years! If you want to keep it open, post a comment or remove the |
This issue was closed due to lack of activity. Feel free to reopen if it's still relevant. |
Is your feature request related to a problem?
performance improvement
Please describe.
On postgres, we precalculate action-event relationships so that at query time we do not need to compute the events that are categorized by an action
Describe the solution you'd like
We can now do the same on clickhouse rather than calculating actions at query time. After #4622, we've figured out how to use collapsingmergetree to maintain updates on a mutable clickhouse table. This has allowed us to improve queries that use cohorts because now cohorts are precalculated in the background.
Describe alternatives you've considered
Thank you for your feature request – we love each and every one!
The text was updated successfully, but these errors were encountered: