Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QA engine: add adoption metrics to the QA report #21917

Merged
merged 10 commits into from
Jan 27, 2023

Conversation

alafanechere
Copy link
Contributor

@alafanechere alafanechere commented Jan 26, 2023

What

Closes #21721
We want to have adoption metrics per connector in the QA report:

  • number_of_connections:
  • number_of_users
  • sync_success_rate
  • total_syncs_count
  • failed_syncs_count
  • succeeded_syncs_count

We implemented the logic to fetch this data in a previous PR.
This PR adds this data to the QA report dataframe but does not persist it to GCS as we consider this data private.

@alafanechere alafanechere temporarily deployed to more-secrets January 26, 2023 14:53 — with GitHub Actions Inactive
@alafanechere alafanechere temporarily deployed to more-secrets January 26, 2023 14:53 — with GitHub Actions Inactive
@alafanechere alafanechere force-pushed the augustin/qa-engine/adoption-metrics-in-qa-report branch from cff4b3b to 9a4225e Compare January 26, 2023 14:57
@alafanechere alafanechere temporarily deployed to more-secrets January 26, 2023 14:59 — with GitHub Actions Inactive
@alafanechere alafanechere temporarily deployed to more-secrets January 26, 2023 14:59 — with GitHub Actions Inactive
@@ -50,6 +49,3 @@ def fetch_adoption_metrics_per_connector_version() -> pd.DataFrame:
"total_syncs_count",
"sync_success_rate",
]]

CLOUD_CATALOG = fetch_remote_catalog(CLOUD_CATALOG_URL)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed these constants from the module because they cause unnecessary network call on module import. I preferred to use dependency injections and test fixtures to expose these (direct calls to the fetch_remote_catalog function.

@alafanechere alafanechere force-pushed the augustin/qa-engine/adoption-metrics-in-qa-report branch from 9a4225e to 15a056b Compare January 26, 2023 15:02
@alafanechere alafanechere marked this pull request as ready for review January 26, 2023 15:02
@alafanechere alafanechere requested a review from a team January 26, 2023 15:03
@github-actions
Copy link
Contributor

github-actions bot commented Jan 26, 2023

Airbyte Code Coverage

There is no coverage information present for the Files changed

Total Project Coverage 24%

@alafanechere alafanechere temporarily deployed to more-secrets January 26, 2023 15:04 — with GitHub Actions Inactive
@alafanechere alafanechere temporarily deployed to more-secrets January 26, 2023 15:04 — with GitHub Actions Inactive
@alafanechere alafanechere temporarily deployed to more-secrets January 26, 2023 15:12 — with GitHub Actions Inactive
@alafanechere alafanechere temporarily deployed to more-secrets January 26, 2023 15:12 — with GitHub Actions Inactive
@alafanechere
Copy link
Contributor Author

Switching it to draft because I'm considering changing to storage backend to bigquery.

@alafanechere alafanechere marked this pull request as draft January 26, 2023 19:18
Copy link
Contributor

@bnchrch bnchrch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems strainght forward to me.

Im going to hold off approving for the first week or so though 😛

But for my own understanding. What is the purpose of the

QA report dataframe

?

@@ -20,7 +19,7 @@ def url_is_reachable(url: str) -> bool:
def is_appropriate_for_cloud_use(definition_id: str) -> bool:
return definition_id not in INAPPROPRIATE_FOR_CLOUD_USE_CONNECTORS

def get_qa_report(enriched_catalog: pd.DataFrame) -> pd.DataFrame:
def get_qa_report(enriched_catalog: pd.DataFrame, oss_catalog_length: int) -> pd.DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always a fan of bring only the data you need 👍

)
main.validations.get_qa_report.assert_called_with(
main.enrichments.get_enriched_catalog.return_value,
3 # len of the "oss" string...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you do this instead?

Suggested change
3 # len of the "oss" string...
len(main.enrichments.get_enriched_catalog.return_value)

@alafanechere alafanechere temporarily deployed to more-secrets January 27, 2023 09:50 — with GitHub Actions Inactive
@alafanechere alafanechere temporarily deployed to more-secrets January 27, 2023 09:51 — with GitHub Actions Inactive
@alafanechere alafanechere force-pushed the augustin/qa-engine/adoption-metrics-in-qa-report branch from f337ef5 to 95dd606 Compare January 27, 2023 09:52
@alafanechere alafanechere temporarily deployed to more-secrets January 27, 2023 09:54 — with GitHub Actions Inactive
@alafanechere alafanechere temporarily deployed to more-secrets January 27, 2023 09:54 — with GitHub Actions Inactive
@alafanechere
Copy link
Contributor Author

After discussion with @evantahler the adoption metrics should be considered private.
I made the required changes not to persist these fields in our public GCS bucket and disabled GCS persistence for extra safety.

The storage persistence logic is going to change soon following discussions with @airbytehq/airbyte-analytics :

  • We might store the QA report without adoption metrics to GCS and ingest this data into BigQuery
  • The adoption metrics computation will be done by a DBT model and not computed from the QA engine
  • The QA report consumers will query BigQuery to get the QA report with adoption metrics instead of consuming a JSON file.

In the meantime we can continue the work planned on this project:

  • The QA report with adoption metrics can be consumed in memory. The Cloud Availability Updater will call get_qa_report instead of reading a file from GCS.

@alafanechere alafanechere marked this pull request as ready for review January 27, 2023 10:05
@alafanechere
Copy link
Contributor Author

What is the purpose of the QA report dataframe

@bechurch this dataframe will be used to spot which connectors not on Airbyte cloud are eligible for it according to the QA flags we computed.

@alafanechere alafanechere temporarily deployed to more-secrets January 27, 2023 10:07 — with GitHub Actions Inactive
@alafanechere alafanechere temporarily deployed to more-secrets January 27, 2023 10:08 — with GitHub Actions Inactive

from .models import ConnectorQAReport

def persist_qa_report(qa_report: pd.DataFrame, path: str, public_fields_only: bool =True):
Copy link
Contributor Author

@alafanechere alafanechere Jan 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not called as I disabled GCS persistence in main for safety.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻 I assume we'll usually persist with all of the info, but can use the public-only if we plan to use it in the public repo anywhere

Comment on lines 38 to 40
enriched_catalog.columns = enriched_catalog.columns.str.replace(
"(?<=[a-z])(?=[A-Z])", "_", regex=True
).str.lower() # column names to snake case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: might be nice to move this into a function instead of needing comment to explain what its doing:

enriched_catalog.columns = enriched_catalog.columns.to_snake_case()

@@ -33,10 +38,13 @@ def get_enriched_catalog(oss_catalog: pd.DataFrame, cloud_catalog: pd.DataFrame)
enriched_catalog.columns = enriched_catalog.columns.str.replace(
"(?<=[a-z])(?=[A-Z])", "_", regex=True
).str.lower() # column names to snake case
enriched_catalog = enriched_catalog[[c for c in enriched_catalog.columns if "_del" not in c]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: I took a minute to understand why we were removing these - maybe using the suffix _cloud would make it clear that when merging the two, we prefer the OSS column and drop the cloud one

Comment on lines +21 to +22
PUBLIC_FIELD = Field(..., is_public=True)
PRIVATE_FIELD = Field(..., is_public=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!


from .models import ConnectorQAReport

def persist_qa_report(qa_report: pd.DataFrame, path: str, public_fields_only: bool =True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻 I assume we'll usually persist with all of the info, but can use the public-only if we plan to use it in the public repo anywhere

@octavia-squidington-iv octavia-squidington-iv added the area/frontend Related to the Airbyte webapp label Jan 27, 2023
@alafanechere alafanechere temporarily deployed to more-secrets January 27, 2023 15:57 — with GitHub Actions Inactive
@alafanechere alafanechere temporarily deployed to more-secrets January 27, 2023 15:57 — with GitHub Actions Inactive
@alafanechere alafanechere force-pushed the augustin/qa-engine/adoption-metrics-in-qa-report branch from c53fb6f to dd33060 Compare January 27, 2023 15:57
@octavia-squidington-iv octavia-squidington-iv added area/frontend Related to the Airbyte webapp and removed area/frontend Related to the Airbyte webapp labels Jan 27, 2023
@alafanechere alafanechere force-pushed the augustin/qa-engine/adoption-metrics-in-qa-report branch from 46f8662 to a687435 Compare January 27, 2023 16:02
@octavia-squidington-iv octavia-squidington-iv removed the area/frontend Related to the Airbyte webapp label Jan 27, 2023
@alafanechere alafanechere temporarily deployed to more-secrets January 27, 2023 16:13 — with GitHub Actions Inactive
@alafanechere alafanechere temporarily deployed to more-secrets January 27, 2023 16:13 — with GitHub Actions Inactive
@alafanechere alafanechere enabled auto-merge (squash) January 27, 2023 16:29
@alafanechere alafanechere merged commit 3a81ffc into master Jan 27, 2023
@alafanechere alafanechere deleted the augustin/qa-engine/adoption-metrics-in-qa-report branch January 27, 2023 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

QA engine: Get adoption and sync success rate for connectors from our Datawarehouse
4 participants