Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precalculate actions #4625

Closed
EDsCODE opened this issue Jun 8, 2021 · 6 comments
Closed

Precalculate actions #4625

EDsCODE opened this issue Jun 8, 2021 · 6 comments
Assignees
Labels
enhancement New feature or request feature/actions Feature Tag: Actions performance Has to do with performance. For PRs, runs the clickhouse query performance suite stale

Comments

@EDsCODE
Copy link
Member

EDsCODE commented Jun 8, 2021

Is your feature request related to a problem?

performance improvement
Please describe.
On postgres, we precalculate action-event relationships so that at query time we do not need to compute the events that are categorized by an action

Describe the solution you'd like

We can now do the same on clickhouse rather than calculating actions at query time. After #4622, we've figured out how to use collapsingmergetree to maintain updates on a mutable clickhouse table. This has allowed us to improve queries that use cohorts because now cohorts are precalculated in the background.

Describe alternatives you've considered

  • keep as is

Thank you for your feature request – we love each and every one!

@EDsCODE EDsCODE added enhancement New feature or request clickhouse labels Jun 8, 2021
@EDsCODE EDsCODE self-assigned this Jun 8, 2021
@kpthatsme
Copy link
Contributor

@EDsCODE I'm mainly replying to this comment from you on Slack:

Our system will start relying on eventual consistency very heavily but this is a worthwhile tradeoff because we can boost resources on our end to precalculate more frequently and keep the user experience as smooth as possible. If eventual consistency still proves to be an issue because users want data up-to-date to the second, we can add queries that specifically query for ranges not covered by the precaclulated groupings

I'm excited for this I think it's probably making the right tradeoff – what do you think reasonable refresh rates would be (let's use cloud as an example)?

One thing I'd add is for this to work well I think it's critical that anywhere we're using one of these pre-computed values, it's clear to the user when it was last computed wherever it's being displayed (and ideally, a way to click refresh and have it update in RT).

@macobo
Copy link
Contributor

macobo commented Jun 8, 2021

On postgres, we precalculate action-event relationships so that at query time we do not need to compute the events that are categorized by an action

Note this is also the piece that falls over flat on it's face the most, failing in inconvenient ways (e.g. not running, not having enough CPU, breaking between releases, etc).

@macobo
Copy link
Contributor

macobo commented Jun 8, 2021

I assume this proposal as-is is talking about adding another table that contains action_id <-> event_id mapping.

I'd love good metrics around how much this would win us, but without them I'm a bit wary about this proposal.

On broad terms, we are limited by the following factors:

  • ingestion (handled by team extensibility)
  • size-per-event
  • query performance
  • compute (async)
  • correctness

Everything that touches every event affects all 4 of these, potentially driving our cost up significantly.

Size-per-event

As we scale, this (along with eng cost) is gonna be the largest lineitem on our bill. Adding a mapping table means that we add a relevant percentage to our

Note that person tables are small enough (rule-of-thumb 100x) that doubling that data e.g. in cohorts caching tables is not an issue and we can replicate the table onto each db node.

Query performance

How do you shard this intermediary table?

If it's doing a join over the network it's going to be slower than any compute you're doing locally. We don't have the know-how on how to do colocated sharding on clickhouse and this table is too big to be replicating on all nodes (see size-per-event).

If it's doing a on-disk join even that is not ideal - would be great to have measurements.

Compute

How do you update the actions table? This is bound by CPU and/or netio on either clickhouse or on celery nodes.

You can't be using INSERT INTO as with cohorts. Why: way way too slow, would likely slow down other queries due to disk usage/cpu on clickhouse nodes, would not react gracefully to e.g. clickhouse being down or other operations.

This is the piece that fell over the most on open-source postgres because of the expense of this.

Note plugin server is also doing some on-the-fly calculations now.

Correctness

What kunal said above


Alternative proposal

The root problem being solved here is not laid out in the ticket, but I assume it's to do with action predicates being expensive to match on larger clients. This is because they need to deserialize a whole bunch of json and then do regex matching on the values.

Assuming that is correct, one potential solution which might work is:

Automatically add MATERIALIZED columns onto events table for queries that need speedup

This:

  • Skips the expensive deserialization/matching of the json with action, hence speeding things up
  • No joins needed, sharded/managed alongside the events table
  • Work is done ingestion-time from kafka, so no need to muck with workers/separate compute
  • Increases ingestion db load/compute only by as much as the predicate themselves at query-time
  • Correctness should be there assuming action definitions changing is handled gracefully
  • Size-per-event stays similar, since non-matching columns containing NULLs are not stored by ch

The idea is similar to what Heap does with partial indexes:

IMO it's not worth pulling every action out since small companies don't need a speedup and each extra column adds a tiny bit of extra cost.

Instead, consider this:

  1. We add a comment to each query with e.g. the company_id action_id etc.
  2. We have a celery job which reads system.query_log, identifying queries/companies that need speedup and materializing columns as needed.
  3. At query-time we look up the materialized columns from a cache

Since actions can get updated, naming the column something like action_{id}_{action_hash}, where action_hash is hash of the action definiton should help get around inconsistency issues.

Not 100% sure this works since haven't done measurements, but thought it was worth writing down.

@macobo macobo added the performance Has to do with performance. For PRs, runs the clickhouse query performance suite label Jun 8, 2021
@Twixes
Copy link
Member

Twixes commented Jun 8, 2021

If we wanted to save event-actions relationships to ClickHouse too (as we do in the Postgres pipeline already), this will actually be perfectly possible right away when PostHog/plugin-server#436 gets merged. We're adding this feature to the plugin server, so that it can do matching in memory for webhooks, REST hooks, onAction plugin methods and possibly more.

Though, like Karl, I'm worried about scalability here – these associative tables would likely end up huge row count-wise, and correctness would be made more difficult, since when the action is changed, we'd have to invalidate all its dynamically found matches and recalculate everything, this time in SQL.

@Twixes Twixes removed the stale label Aug 26, 2021
@paolodamico paolodamico added feature/actions Feature Tag: Actions and removed clickhouse labels Feb 22, 2022
@posthog-bot
Copy link
Contributor

This issue hasn't seen activity in two years! If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in two weeks.

@posthog-bot
Copy link
Contributor

This issue was closed due to lack of activity. Feel free to reopen if it's still relevant.

@posthog-bot posthog-bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature/actions Feature Tag: Actions performance Has to do with performance. For PRs, runs the clickhouse query performance suite stale
Projects
None yet
Development

No branches or pull requests

6 participants