-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Do Not Merge) GoogleSearchConsole - Keywords Missing Data #53617
base: master
Are you sure you want to change the base?
(Do Not Merge) GoogleSearchConsole - Keywords Missing Data #53617
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
The first approach worked locally (sending solo API requests without syncs), but I didn't test reading the streams. After nullifying Instead of nullifying these field, I'm now testing complete removal to replicate the request exactly as it appears in the docs. Testing reads locally did produce records so hopefully, this should solve this issue. |
Shows no lost or added records, I will test again on the users connection to see if there is any change. |
@@ -407,6 +407,7 @@ def stream_slices( | |||
for keyword in keywords: | |||
filters = {"dimension": "searchAppearance", "operator": "equals", "expression": keyword} | |||
stream_slice["dimensionFilterGroups"] = [{"groupType": "and", "filters": filters}] | |||
stream_slice["dimensions"] = self.dimensions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the GSC docs, the second request should be formatted like this:
{
"startDate": "2018-05-01",
"endDate": "2018-05-31",
"type": "web",
"dimensions": [
"device" // and/or page, country, ...
],
"dimensionFilterGroups": [
{
"filters": [
{
"dimension": "searchAppearance",
"operator": "equals",
"expression": "AMP_BLUE_LINK"
}
]
}
]
}
But in our case, we were sending the following:
{
"site_url": "https%3A%2F%2Fairbyte.io%2F",
"search_type": "web",
"start_date": "2024-09-05",
"end_date": "2024-09-07",
"data_state": "final",
"dimensionFilterGroups": [
{
"groupType": "and",
"filters": {
"dimension": "searchAppearance",
"operator": "equals",
"expression": "PRODUCT_SNIPPETS"
}
}
]
}
I'm hoping by adding the missing dimensions defined for these streams at the top level of the stream slice, that we will get the missing records reported by the user.
What
Users are reporting missing data on
keywords
streams from https://github.com/airbytehq/oncall/issues/7045Users have checked raw tables and found no data & have attempted full refreshes but still get missing data.
The issue seems to be related to how we get records with the GSC API.
GSC Keyword analytics docs: https://developers.google.com/webmaster-tools/v1/how-tos/all-your-data#search-appearance-data
How
For these streams to sync all the data, we need to first get all search appearance types. This will return a list of
keys
that we will then need to make subsequent requests with all the dimensions to get all the records.I suspect the issue occurs in the first request when we attempt to obtain the list of keys. I noticed that sometimes, the GSC API does not return any keys even though some should exist. In those cases, we will just get a response like the one below:
In my local tests, I replicated the request & started to nullify/remove fields:
aggregationType
,startRow',
rowLimit, &
dataState`Below is the original request:
The idea I have is to replicate the exact request format used in the example on the GSC docs & hope this will return the Keys from the API. (These fields are omitted in the example on the docs:
aggregationType
,startRow',
rowLimit, &
dataState`). Once I removed those fields, I started to get records:I tried to test each individual field but was getting different results each time. However, right now, I've decided to keep
dataState
androwLimit
and nullify the rest. For dataState, I don't think it matters in this request, but I'm not sure & don't want to risk losing records. For rowLimit, it is a globalVar & I didn't want to risk breaking other stuff that relies on this by changing it.As I was typing this out, I ran another test locally where I kept all the fields currently in production, and strangely, this time, I was able to get records.
I checked on GSC docs & found this:
Due to this, I suspect the GSC API sometimes limits the data we request based on our request body parameters. With this PR, we hope to run regression tests and see if we can get an increased number of records with fewer fields that are unnecessary for this API request.
Review guide
User Impact
Can this PR be safely reverted and rolled back?