-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import of historical data #1466
Conversation
do we need to show that difference at all? we should not allow any imported data for the period after your Plausible stats have started tracking. the graph and presentation can stay the same for both first party and imported data. the best way to show that there's some imported data in any specific view would be to introduce a little "i" icon perhaps next to the download icon that tells something like "this date range includes some imported data". from here we can link to the docs that tells a bit more about the imported data and what it lacks compared to the first party data. this "i" only shows up in views that feature imported data and is not there otherwise. seems clean and minimal way of doing this. what do you think? we just need to understand what the actual differences in usage there will be between first party and imported data. like for instance in the "month to date" view will people be able to click on a day that has imported data only like they can on a normal day that has Plausible data? if that feature is there, then at least for the top chart the third party data has a parity with the first party data so no real reason to visually differentiate between them |
Ah sorry I should clarified - currently data is only imported up to the date of the first plausible data point. For testing I have disabled that so I could visualise things better and that's what in the image. The red and blue lines will never overlap in time.
Do you mean that they were form the same line (in the case of the visitors graph), and be counted together into the same counts in the other tables? I think this might work for some metrics (like the visitors graph) but others it might not make sense from a statistics point of view. However I can try to combine the stats for each panel as I'm adding it to the feature keeping this in mind and so for the time being we can assume they will be merged.
Yeah that's nice!
That would be ideal, and what I'm aiming for :) |
That's great! And yeah, we keep imported data consistent visually with the native data. No differences at all in the visual presentation (same line, color, font etc). Then we describe any possible drawbacks with the third party data in the docs. Say "third party data cannot be aggregated with native data when we do the calculation for visit duration" or whatever the drawbacks end up being in the final implementation. |
@metmarkosaric @m-col - How does this work for events that are sent to GA (ex: "viewed product", "added product to shopping cart", "made purchase of multiple items")? FYI - We've manually imported this historical data to Clickhouse via CSV imports. One pain has been importing the event names to "goals" because the events are in Clickhouse and the goals table is in Postgres. As such, I've been exporting the goal list grouped by goal name to CSV and then importing the names of unique goal names to Postgres. My colleague asked why the goals table cannot show all "goals" directly from click house grouped by goal name. |
To update: the current state has all of the required data being imported from GA, and all of it is being merged into the plausible dashboard in the corresponding panel and tab. There are still a few issues that need ironing out. I've tried to keep the checkbox list on the first post up to date so they should be listed there. There are a few questions which would help to improve/fix the implementation. Currently plausible distinguishes between 4 device types in the dashboard: desktop, laptop, tablet, mobile. Google Analytics uses 3: desktop, tablet and mobile. However I think the definitions the same: An extract from the Plausible docs (link):
"Screen size" seems misleading here, because both the device screen resolution and the web page's viewport are valid and attainable metrics for a session, and "screen size" implies the former. GA exposes both screen size and browser size. The screen size thresholds that distinguish device types also seem odd. My low-end 5+ year old has a 1080x2160 screen size, and my laptop is 1920x1080. These would be considered a tablet and desktop according to plausible. It's possible that I've misunderstood how the calculations are done, or how the data is reported by the browser though! The relevance for this PR is that I'm wondering what the best way to get equivalent from GA. The options are the screen size or browser size, but also device category directly. That latter one sounds like what the plausible "screen size" is meant to be so I wonder how they are deducing that informaiton. Regarding locations: with regions and cities on the way, it might sense to also fetch these from GA at this point. They would likely have to be fetched by the time this PR is merged, even if cities and regions aren't yet tracked by plausible, as the import is a one-shot "get everything then you no longer need your google account" feature. Think I should work that into the clickhouse table now, and leave it out of the queries etc, or just leave it out altogether? @ACPK The plans for our import are to consider each dimension individually, which makes import/export much easier, but that means that filtering and goals won't really be compatible. I tried importing everything into one table to enable this but the data can't be fetched all together from google (they limit what can be requested per query), leading to possibly duplicated visitor counts. Similarly data is exported spread across individual CSVs from Fathom (like Plausible) and this imported data also cannot be filtered. |
@m-col Did you use Google Analytics 4? https://support.google.com/datastudio/answer/6370352?hl=en#zippy=%2Cin-this-article |
It's using Reporting API v4, which has equivalent methods and they have the same limits on dimensions etc. Is there an advantage to using Google Analytics Data API v1 (GA4) that I may have overlooked? |
About devicesYes I think your criticism of the screen size thing in Plausible is correct. We are planning to stop using the viewport size and instead infer the device type from the User-Agent in the future. I believe this is what GA's deviceCategory means as well. So I think we should import We probably have to get rid of the 'laptop' category when we use the user-agent anyways, so it's OK if the import doesn't have any data for 'Laptop'. LocationsYeah, since the import cannot be run again in the future, it's best to get all of the data in one go. It would be great if GA could export the same identifiers that we use - @m-col how are you dealing with countries at the moment? Are we getting it as a name or as an identifier from GA? How is it merged with our own data? |
Countries are being fetched as Regions can also be fetched in the format we want: "Users' region ISO code in ISO-3166-2 format, derived from their IP addresses or Geographical IDs." ( City is less clear: we have |
That's great for countries and regions! It does look like cities might be a pain. Let's see what we can do but even if they're missing from the import it's not a huge deal I think |
There are some updates I need to make in line with some of the new features, but in its current form it is working 100%! I'm first going to rebase and make those updates but will then be focussing on getting tests written for all of the changes. Any review/comments on its current form would be appreciated! |
The import is not yet run in the background. It's convenient for development for it to be synchronous so I'm leaving that until the end. |
Sweet! Sounds good @m-col |
69d408b
to
195d5e9
Compare
BundleMonFiles updated (1)
Unchanged files (6)
Total files change +305B +0.04% Final result: ✅ View report in BundleMon website ➡️ |
Quick update: the PR is rebased and updated such that utm_term and utm_content are imported and merged, as are regions. Cities are added to the imported_locations table but are not imported from GA. As discussed on matrix, this is due to the GA city data not being compatible with the city data used by plausible, and so the data cannot be merged. The field is kept in the table for future imports from other sources. I am now continuing work on fixing current tests + adding new ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall.
I'm concerned about the big change from strings to atoms for some metrics. The state it's currently in feels very error-prone. Maybe the right solution is to move completely to using atoms.
Missing from my perspective:
- Tests
- Run the import job in the background
I think it would be useful to show an indicator to the user for what time period we will import. Currently I don't know if the user gets any feedback about that.
I'm getting some incompatibilities with the countries and regions from data imported from GA, which allegedly follows the ISO standard. Still investigating whether the issue is with GA or |
The ISO standard changes all the time. My local region had it's ISO code change as recently as 2019. Maybe we need some mappings from older to newer ones. I like the patch, we had some cases on prod as well so this is useful. |
Happy to submit it as a standalone PR if you'd like it sooner. |
@m-col Will importing CSVs be part of this PR or an additional PR? |
It won't be part of this PR. Importing via CSVs is something we have dicussed and are open to adding but there are no immediate plans to implement it. |
I'm seeing all region values fetched from GA being |
That's disappointing but yeah, let's attempt to import in case it changes |
Also changes a conditional to be a bit nicer
GA has only a "source" dimension and no "UTM source" dimension. Instead it returns these combined. The logic herein to tease these apart is: 1. "(direct)" -> it's a direct source 2. if the source is a domain -> it's a source 3. "google" -> it's from adwords; let's make this a UTM source "adwords" 4. else -> just a UTM source
Rebased. What kind of timeline do you envisage for merging this feature? |
Thanks @m-col. The plan is to integrate it this week and start testing with customers next week. |
Thanks for all the work @m-col. We'll do some internal testing tomorrow and real user testing next week. |
This was completed in #1753 |
Changes
As a bit of an update, here is the current state of my work on importing data from Google Analytics (as a first step - generic CSV input to be added later).
My ongoing to do list:
* we also must consider CSV-imported data; data will be imported from multiple CSVs as it covers a number of tables, so having a representation of what is available and ability to delete individual components may be needed to avoid import of and simultaneous validation of a batch of CSVs.
Tests
Changelog
Documentation