Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Time on Page metric to Top Pages report #1007

Merged
merged 14 commits into from
May 18, 2021

Conversation

Vigasaurus
Copy link
Contributor

@Vigasaurus Vigasaurus commented May 8, 2021

Changes

Title.

Couple of things here to discuss though -

  1. Yes, adding this 4th column of data to the Top Pages report isn't great for the already rough mobile UI - but I figured I'd leave that to Mobile dashboard design / UX #972 to resolve instead of me causing unnecessary conflicts
  2. Currently I've just set up the naive approach to time on page (where we can't compute the time for the exit page of a session) - but if you want, I could add the better method in the form of a beacon and creating a new type of event, exit, such that it won't count as a pageview, and will always end a session. (There is another thing to discuss here on if you think using the visibility change method for ending sessions is appropriate - as this can have false positives with users swapping tabs) lmk what you think.
  3. This query isn't exactly super fast, but I'm not sure if that's just my amount of data vs the speed of my dev environment, so this may need some testing with some of the larger data sets. (I was getting about 1.5ish seconds to load the 100 top pages with time on page for a period with 2M pageviews)
  4. I have a couple of other code-based/style questions, but I'll put those in line-comments

Tests

  • Automated tests have been added

Changelog

  • Entry has been added to changelog

Documentation

  • Docs have been updated

Screenshots

image

This was due to Clickhouse setup not inserting the sessions with the exact same timestamp consistently and making the test inconsistent
Copy link
Contributor

@ukutaht ukutaht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

Currently I've just set up the naive approach

That's fine. We'll do some experiments with sendBeacon in the future but I think this is good enough for now.

This query isn't exactly super fast, but I'm not sure if that's just my amount of data vs the speed of my dev environment

I did some testing. With sampling enabled for our biggest client, it's taking about 4-5s on our production environment. I think that's acceptable for a detail view which is supposed to calculate more stuff. Some sort of sendBeacon approach would certainly be much faster because then we wouldn't need to calculate neighbours or anything like that, it would be stored in a table like any other metric.

In terms of performance, I would like to get the bounce_rate and time_on_page queries running in parallel so the user can get their page rendered quicker.

@Vigasaurus
Copy link
Contributor Author

Vigasaurus commented May 14, 2021

As for improving performance with beacons, we may have some different ideas there - my intention was that beacons would only add a exit type of event, and that this same query could then look to also calculate the time delta between pageviews and exits (only for the exit page) to get the more accurate time on page - but we can discuss that implementation further down the line I guess.

@Vigasaurus Vigasaurus requested a review from ukutaht May 14, 2021 18:22
@ukutaht
Copy link
Contributor

ukutaht commented May 17, 2021

I noticed that filters that act on the sessions table do not affect time on page calculation.

For example when you click on a source from Top Sources or an entry page from the entry pages list, it should also filter down and show me time on page values for the subset of visitors included in the filter. Currently when using a filter like source= or entry_page=, the time on page values shown do not account for the filter.

@Vigasaurus
Copy link
Contributor Author

Ah yeah good call - I pretty much fully forgot that sessions could get filtered on haha

@Vigasaurus
Copy link
Contributor Author

@ukutaht The concerns on the session-based filtering are all sorted now - I did also now put in a fix where a page filter was completely breaking the calculations, since it was trying to only calculate neighbors with same-page transitions (whoops).

One other big thing I just added though, is some better timeout and error handling with the parallelized tasks, in the old method if either one failed or timed out - the entire request would exit, now, if one errors or times out (more likely to be the time on page, on slower hardware/larger data sets), it gracefully handles it and just returns no data for that column instead of no data for any of the columns. I'm not entirely certain that my Elixir here is great though, so lmk what you think there. I have the timeout set at 15 seconds which I think is a reasonable timeout where any data possible should be sent and the rest aborted, but if you want to increase that, it could be.

@ukutaht
Copy link
Contributor

ukutaht commented May 18, 2021

Everything looks good @Vigasaurus!

@ukutaht ukutaht merged commit 41e4690 into plausible:master May 18, 2021
@ukutaht
Copy link
Contributor

ukutaht commented May 18, 2021

As for improving performance with beacons, we may have some different ideas there - my intention was that beacons would only add a exit type of event, and that this same query could then look to also calculate the time delta between pageviews and exits (only for the exit page) to get the more accurate time on page - but we can discuss that implementation further down the line I guess.

I think we're mostly on the same page about how to design this thing. The way you've described it sounds good.

We already have an in-memory record of each session along with the latest page the person viewed in the exit_page field and the timestamp of that pageview in the timestamp field. So if we add a page_exit event or something like that, it would be trivial calculate the duration of the latest pageview during ingest. My suggestion was that the duration can be calculated already during ingestion time rather than at query time.

Adding a page_exit event or something like that is certainly the most straightforward way of going about it.

A potential solution could also just update the latest pageview record with a duration field when the page_exit event is received by the backend. Instead of storing it as a separate event, it could amend the existing pageview record. That way there's less stuff to join when querying.

@Vigasaurus
Copy link
Contributor Author

Yeah I think those are all really good ways to go about it. I honestly think that last method could work pretty well for maybe every type of pageview or exit - such that the closest previous event gets the duration set on ingest of its forward neighbor. Only potential concern I have there is how we'd control which query needs to happen - this version or the manual neighbor version from this PR; maybe something like the grandfathered plans to pick a date when the new query is usable and only use it if the date range ends before it.

@metmarkosaric
Copy link
Contributor

Thanks @Vigasaurus for this! Just testing it on staging. When I drill down and filter by a page, the top graph doesn't show time on page. Would be nice to display it next to the bounce rate there on top.

@Vigasaurus
Copy link
Contributor Author

Good idea - will do 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants