-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QA engine: add adoption metrics to the QA report #21917
QA engine: add adoption metrics to the QA report #21917
Conversation
cff4b3b
to
9a4225e
Compare
@@ -50,6 +49,3 @@ def fetch_adoption_metrics_per_connector_version() -> pd.DataFrame: | |||
"total_syncs_count", | |||
"sync_success_rate", | |||
]] | |||
|
|||
CLOUD_CATALOG = fetch_remote_catalog(CLOUD_CATALOG_URL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed these constants from the module because they cause unnecessary network call on module import. I preferred to use dependency injections and test fixtures to expose these (direct calls to the fetch_remote_catalog
function.
9a4225e
to
15a056b
Compare
Airbyte Code Coverage
|
Switching it to draft because I'm considering changing to storage backend to bigquery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems strainght forward to me.
Im going to hold off approving for the first week or so though 😛
But for my own understanding. What is the purpose of the
QA report dataframe
?
@@ -20,7 +19,7 @@ def url_is_reachable(url: str) -> bool: | |||
def is_appropriate_for_cloud_use(definition_id: str) -> bool: | |||
return definition_id not in INAPPROPRIATE_FOR_CLOUD_USE_CONNECTORS | |||
|
|||
def get_qa_report(enriched_catalog: pd.DataFrame) -> pd.DataFrame: | |||
def get_qa_report(enriched_catalog: pd.DataFrame, oss_catalog_length: int) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always a fan of bring only the data you need 👍
) | ||
main.validations.get_qa_report.assert_called_with( | ||
main.enrichments.get_enriched_catalog.return_value, | ||
3 # len of the "oss" string... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you do this instead?
3 # len of the "oss" string... | |
len(main.enrichments.get_enriched_catalog.return_value) |
f337ef5
to
95dd606
Compare
After discussion with @evantahler the adoption metrics should be considered private. The storage persistence logic is going to change soon following discussions with @airbytehq/airbyte-analytics :
In the meantime we can continue the work planned on this project:
|
@bechurch this dataframe will be used to spot which connectors not on Airbyte cloud are eligible for it according to the QA flags we computed. |
|
||
from .models import ConnectorQAReport | ||
|
||
def persist_qa_report(qa_report: pd.DataFrame, path: str, public_fields_only: bool =True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not called as I disabled GCS persistence in main
for safety.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻 I assume we'll usually persist with all of the info, but can use the public-only if we plan to use it in the public repo anywhere
enriched_catalog.columns = enriched_catalog.columns.str.replace( | ||
"(?<=[a-z])(?=[A-Z])", "_", regex=True | ||
).str.lower() # column names to snake case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: might be nice to move this into a function instead of needing comment to explain what its doing:
enriched_catalog.columns = enriched_catalog.columns.to_snake_case()
@@ -33,10 +38,13 @@ def get_enriched_catalog(oss_catalog: pd.DataFrame, cloud_catalog: pd.DataFrame) | |||
enriched_catalog.columns = enriched_catalog.columns.str.replace( | |||
"(?<=[a-z])(?=[A-Z])", "_", regex=True | |||
).str.lower() # column names to snake case | |||
enriched_catalog = enriched_catalog[[c for c in enriched_catalog.columns if "_del" not in c]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: I took a minute to understand why we were removing these - maybe using the suffix _cloud
would make it clear that when merging the two, we prefer the OSS column and drop the cloud one
PUBLIC_FIELD = Field(..., is_public=True) | ||
PRIVATE_FIELD = Field(..., is_public=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
|
||
from .models import ConnectorQAReport | ||
|
||
def persist_qa_report(qa_report: pd.DataFrame, path: str, public_fields_only: bool =True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻 I assume we'll usually persist with all of the info, but can use the public-only if we plan to use it in the public repo anywhere
c53fb6f
to
dd33060
Compare
46f8662
to
a687435
Compare
What
Closes #21721
We want to have adoption metrics per connector in the QA report:
We implemented the logic to fetch this data in a previous PR.
This PR adds this data to the QA report dataframe but does not persist it to GCS as we consider this data private.