Precalculate actions #4625

EDsCODE · 2021-06-08T01:22:19Z

Is your feature request related to a problem?

performance improvement
Please describe.
On postgres, we precalculate action-event relationships so that at query time we do not need to compute the events that are categorized by an action

Describe the solution you'd like

We can now do the same on clickhouse rather than calculating actions at query time. After #4622, we've figured out how to use collapsingmergetree to maintain updates on a mutable clickhouse table. This has allowed us to improve queries that use cohorts because now cohorts are precalculated in the background.

Describe alternatives you've considered

keep as is

Thank you for your feature request – we love each and every one!

kpthatsme · 2021-06-08T02:04:40Z

@EDsCODE I'm mainly replying to this comment from you on Slack:

Our system will start relying on eventual consistency very heavily but this is a worthwhile tradeoff because we can boost resources on our end to precalculate more frequently and keep the user experience as smooth as possible. If eventual consistency still proves to be an issue because users want data up-to-date to the second, we can add queries that specifically query for ranges not covered by the precaclulated groupings

I'm excited for this I think it's probably making the right tradeoff – what do you think reasonable refresh rates would be (let's use cloud as an example)?

One thing I'd add is for this to work well I think it's critical that anywhere we're using one of these pre-computed values, it's clear to the user when it was last computed wherever it's being displayed (and ideally, a way to click refresh and have it update in RT).

macobo · 2021-06-08T06:07:48Z

On postgres, we precalculate action-event relationships so that at query time we do not need to compute the events that are categorized by an action

Note this is also the piece that falls over flat on it's face the most, failing in inconvenient ways (e.g. not running, not having enough CPU, breaking between releases, etc).

macobo · 2021-06-08T06:29:25Z

I assume this proposal as-is is talking about adding another table that contains action_id <-> event_id mapping.

I'd love good metrics around how much this would win us, but without them I'm a bit wary about this proposal.

On broad terms, we are limited by the following factors:

ingestion (handled by team extensibility)
size-per-event
query performance
compute (async)
correctness

Everything that touches every event affects all 4 of these, potentially driving our cost up significantly.

Size-per-event

As we scale, this (along with eng cost) is gonna be the largest lineitem on our bill. Adding a mapping table means that we add a relevant percentage to our

Note that person tables are small enough (rule-of-thumb 100x) that doubling that data e.g. in cohorts caching tables is not an issue and we can replicate the table onto each db node.

Query performance

How do you shard this intermediary table?

If it's doing a join over the network it's going to be slower than any compute you're doing locally. We don't have the know-how on how to do colocated sharding on clickhouse and this table is too big to be replicating on all nodes (see size-per-event).

If it's doing a on-disk join even that is not ideal - would be great to have measurements.

Compute

How do you update the actions table? This is bound by CPU and/or netio on either clickhouse or on celery nodes.

You can't be using INSERT INTO as with cohorts. Why: way way too slow, would likely slow down other queries due to disk usage/cpu on clickhouse nodes, would not react gracefully to e.g. clickhouse being down or other operations.

This is the piece that fell over the most on open-source postgres because of the expense of this.

Note plugin server is also doing some on-the-fly calculations now.

Correctness

What kunal said above

Alternative proposal

The root problem being solved here is not laid out in the ticket, but I assume it's to do with action predicates being expensive to match on larger clients. This is because they need to deserialize a whole bunch of json and then do regex matching on the values.

Assuming that is correct, one potential solution which might work is:

Automatically add MATERIALIZED columns onto events table for queries that need speedup

This:

Skips the expensive deserialization/matching of the json with action, hence speeding things up
No joins needed, sharded/managed alongside the events table
Work is done ingestion-time from kafka, so no need to muck with workers/separate compute
Increases ingestion db load/compute only by as much as the predicate themselves at query-time
Correctness should be there assuming action definitions changing is handled gracefully
Size-per-event stays similar, since non-matching columns containing NULLs are not stored by ch

The idea is similar to what Heap does with partial indexes:

IMO it's not worth pulling every action out since small companies don't need a speedup and each extra column adds a tiny bit of extra cost.

Instead, consider this:

We add a comment to each query with e.g. the company_id action_id etc.
We have a celery job which reads system.query_log, identifying queries/companies that need speedup and materializing columns as needed.
At query-time we look up the materialized columns from a cache

Since actions can get updated, naming the column something like action_{id}_{action_hash}, where action_hash is hash of the action definiton should help get around inconsistency issues.

Not 100% sure this works since haven't done measurements, but thought it was worth writing down.

Twixes · 2021-06-08T08:58:48Z

If we wanted to save event-actions relationships to ClickHouse too (as we do in the Postgres pipeline already), this will actually be perfectly possible right away when PostHog/plugin-server#436 gets merged. We're adding this feature to the plugin server, so that it can do matching in memory for webhooks, REST hooks, onAction plugin methods and possibly more.

Though, like Karl, I'm worried about scalability here – these associative tables would likely end up huge row count-wise, and correctness would be made more difficult, since when the action is changed, we'd have to invalidate all its dynamically found matches and recalculate everything, this time in SQL.

posthog-bot · 2024-03-22T07:31:39Z

This issue hasn't seen activity in two years! If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in two weeks.

posthog-bot · 2024-04-08T07:31:47Z

This issue was closed due to lack of activity. Feel free to reopen if it's still relevant.

EDsCODE added enhancement New feature or request clickhouse labels Jun 8, 2021

EDsCODE self-assigned this Jun 8, 2021

EDsCODE mentioned this issue Jun 8, 2021

Release planning 1.27.0 1/2 #4582

Closed

macobo added the performance Has to do with performance. For PRs, runs the clickhouse query performance suite label Jun 8, 2021

posthog-bot added the stale label Aug 10, 2021

Twixes removed the stale label Aug 26, 2021

EDsCODE mentioned this issue Sep 13, 2021

Including actions in paths #5928

Closed

paolodamico added feature/actions Feature Tag: Actions and removed clickhouse labels Feb 22, 2022

posthog-bot added the stale label Mar 22, 2024

posthog-bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Precalculate actions #4625

Precalculate actions #4625

EDsCODE commented Jun 8, 2021

kpthatsme commented Jun 8, 2021

macobo commented Jun 8, 2021

macobo commented Jun 8, 2021 •

edited

Loading

Twixes commented Jun 8, 2021 •

edited

Loading

posthog-bot commented Mar 22, 2024

posthog-bot commented Apr 8, 2024

Precalculate actions #4625

Precalculate actions #4625

Comments

EDsCODE commented Jun 8, 2021

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Thank you for your feature request – we love each and every one!

kpthatsme commented Jun 8, 2021

macobo commented Jun 8, 2021

macobo commented Jun 8, 2021 • edited Loading

Size-per-event

Query performance

Compute

Correctness

Alternative proposal

Twixes commented Jun 8, 2021 • edited Loading

posthog-bot commented Mar 22, 2024

posthog-bot commented Apr 8, 2024

macobo commented Jun 8, 2021 •

edited

Loading

Twixes commented Jun 8, 2021 •

edited

Loading