Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Do Not Merge) GoogleSearchConsole - Keywords Missing Data #53617

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

agarctfi
Copy link
Contributor

What

Users are reporting missing data on keywords streams from https://github.com/airbytehq/oncall/issues/7045
Users have checked raw tables and found no data & have attempted full refreshes but still get missing data.
The issue seems to be related to how we get records with the GSC API.
GSC Keyword analytics docs: https://developers.google.com/webmaster-tools/v1/how-tos/all-your-data#search-appearance-data

How

For these streams to sync all the data, we need to first get all search appearance types. This will return a list of keys that we will then need to make subsequent requests with all the dimensions to get all the records.

I suspect the issue occurs in the first request when we attempt to obtain the list of keys. I noticed that sometimes, the GSC API does not return any keys even though some should exist. In those cases, we will just get a response like the one below:

{'responseAggregationType': 'byPage'}

In my local tests, I replicated the request & started to nullify/remove fields: aggregationType, startRow', rowLimit, & dataState`

Below is the original request:

URL: "https://www.googleapis.com/webmasters/v3/sites/https%3A%2F%2Fairbyte.io%2F/searchAnalytics/query"
BODY:
{
    "startDate": "2024-09-05",
    "endDate": "2024-09-07",
    "dimensions": ["searchAppearance"],
    "type": "web",
    "aggregationType": "auto",
    "startRow": 0,
    "rowLimit": 25000,
    "dataState": "final"
}

The idea I have is to replicate the exact request format used in the example on the GSC docs & hope this will return the Keys from the API. (These fields are omitted in the example on the docs: aggregationType, startRow', rowLimit, & dataState`). Once I removed those fields, I started to get records:

{'rows': [{'keys': ['PRODUCT_SNIPPETS'], 'clicks': 0, 'impressions': 4, 'ctr': 0, 'position': 7}], 'responseAggregationType': 'byPage'}

I tried to test each individual field but was getting different results each time. However, right now, I've decided to keep dataStateand rowLimit and nullify the rest. For dataState, I don't think it matters in this request, but I'm not sure & don't want to risk losing records. For rowLimit, it is a globalVar & I didn't want to risk breaking other stuff that relies on this by changing it.

As I was typing this out, I ran another test locally where I kept all the fields currently in production, and strangely, this time, I was able to get records.
I checked on GSC docs & found this:

When you group by page and/or query, our system may drop some data in order to be able to calculate results in a reasonable time using a reasonable amount of computing resources.

Due to this, I suspect the GSC API sometimes limits the data we request based on our request body parameters. With this PR, we hope to run regression tests and see if we can get an increased number of records with fewer fields that are unnecessary for this API request.

Review guide

User Impact

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

Copy link

vercel bot commented Feb 10, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Feb 14, 2025 10:22pm

@agarctfi
Copy link
Contributor Author

agarctfi commented Feb 11, 2025

The first approach worked locally (sending solo API requests without syncs), but I didn't test reading the streams. After nullifying start_row, next_page_token failed, causing errors and missing records.

Instead of nullifying these field, I'm now testing complete removal to replicate the request exactly as it appears in the docs. Testing reads locally did produce records so hopefully, this should solve this issue.

@agarctfi
Copy link
Contributor Author

@@ -407,6 +407,7 @@ def stream_slices(
for keyword in keywords:
filters = {"dimension": "searchAppearance", "operator": "equals", "expression": keyword}
stream_slice["dimensionFilterGroups"] = [{"groupType": "and", "filters": filters}]
stream_slice["dimensions"] = self.dimensions
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the GSC docs, the second request should be formatted like this:

{
  "startDate": "2018-05-01",
  "endDate": "2018-05-31",
  "type": "web",
  "dimensions": [
    "device" // and/or page, country, ...
  ],
  "dimensionFilterGroups": [
    {
      "filters": [
        {
          "dimension": "searchAppearance",
          "operator": "equals",
          "expression": "AMP_BLUE_LINK"
        }
      ]
    }
  ]
}

But in our case, we were sending the following:

{
    "site_url": "https%3A%2F%2Fairbyte.io%2F",
    "search_type": "web",
    "start_date": "2024-09-05",
    "end_date": "2024-09-07",
    "data_state": "final",
    "dimensionFilterGroups": [
        {
            "groupType": "and",
            "filters": {
                "dimension": "searchAppearance",
                "operator": "equals",
                "expression": "PRODUCT_SNIPPETS"
            }
        }
    ]
}

I'm hoping by adding the missing dimensions defined for these streams at the top level of the stream slice, that we will get the missing records reported by the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/google-search-console
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants