Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destination BigQuery (normalization): Investigate "pickling client objects is not supported" error #17327

Closed
edgao opened this issue Sep 28, 2022 · 23 comments
Labels
team/destinations Destinations team's backlog type/bug Something isn't working

Comments

@edgao
Copy link
Contributor

edgao commented Sep 28, 2022

see https://github.com/airbytehq/oncall/issues/417

Normalization fails with this error:

2022-09-20 18:26:17 normalization > Unhandled error while executing model.<redacted>.<redacted>
Pickling client objects is explicitly not supported.
Clients have non-trivial state that is local and unpickleable.
2022-09-20 18:26:17 normalization > 2 of 2 ERROR creating incremental model <redacted>.<redacted>............................................... [ERROR in 0.64s]

Enabling GCS staging on the bigquery destination config apparently helps, but it's unclear why that would change anything. Possibly actually a permissions problem, since they also switched the dataset ID.


my original theorizing: (https://github.com/airbytehq/oncall/issues/417#issuecomment-1226588642)

This error is pretty weird - I suspect it's actually hiding some other error message, which we don't have a great way to recover. That would explain why resetting streams resolved it. (ref: similar-looking issue on a completely unrelated github repo PrefectHQ/prefect#2178; the tl;dr is that Google's bigquery library returns an error object that behaves weirdly, so it gets logged in a useless way)

@edgao edgao added type/bug Something isn't working needs-triage team/destinations Destinations team's backlog labels Sep 28, 2022
@noviceanalyst
Copy link

Hi,

I am getting this exact error while trying to connect hubspot to big query for the first time. Specifically, I get errors that look like this for any tables that I try to transfer, whether contacts, campaign, e-mail, etc.

Unhandled error while executing model.airbyte_utils.contacts
Pickling client objects is explicitly not supported.
Clients have non-trivial state that is local and unpickleable.
1 of 2 ERROR creating table model Marketing.contacts.................................................................... [ERROR in 0.43s]
Pickling client objects is explicitly not supported.
Clients have non-trivial state that is local and unpickleable.
Unhandled error while executing model.airbyte_utils.contacts
Pickling client objects is explicitly not supported.
Clients have non-trivial state that is local and unpickleable.
1 of 2 ERROR creating table model Marketing.contacts.................................................................... [ERROR in 0.43s]
Pickling client objects is explicitly not supported.

When we sync with raw json instead of tabular normalization, the records do transfer to big query as json entries in the resultant table. I am new to Airbyte so any suggestions in the right direction are welcome.

@edgao
Copy link
Contributor Author

edgao commented Oct 14, 2022

can you post the full log file from the sync? I was looking into this recently and it looked like the source actually failed, which led to some _raw tables not being created (https://github.com/airbytehq/airbyte-internal-issues/issues/1038) - would be great to confirm/deny that theory.

@noviceanalyst
Copy link

noviceanalyst commented Oct 14, 2022

HI @edgao please find attached the error log.

huspot_bq_errorlog.txt
I notice that I am able to successfully move data from the source to destination if it is in json format. Using the normalized tabular data format in Airbyte leads to an unsuccessful data transfer. Not sure if this helps.

@edgao
Copy link
Contributor Author

edgao commented Oct 18, 2022

interesting - so it sounds like e.g. Marketing._airbyte_raw_contact_lists exists and is non-empty? (the sync in your logs produced 0 records, I'm guessing there was a previous sync that did actually process data?)

That's certainly different behavior from what I ran into. Doesn't seem like a dataset ID mismatch either (destination and normalization both referenced the Marketing dataset) so presumably normalization is looking for raw tables in the right location.

@killthekitten
Copy link
Contributor

We're getting the same error on two very different connections: PostgreSQL CDC -> BigQuery and Stripe -> BigQuery.

What's interesting, identical connections run just fine when not using the GCS staging area. Previously we haven't used the staging area at all, so it might be related to the staging bucket or service account configuration.

While I won't be comfortable sharing the entire log, I could share parts of it that are relevant to investigation is needed. For example, I've seen this happening a lot in the log (but not with all streams):

2022-10-27 12:33:52 �[43mdestination�[0m >  For more details see Big Query Error collection: BigQueryError{reason=accessDenied, location=null, message=Access Denied: File gs://<staging_area_name_redacted>/<name_redacted>/<name_redacted>_checkout_sessions/2022/10/27/03/c4a35eda-05aa-4703-b03f-6ab1d223649d/1.avro: Access Denied}:
2022-10-27 12:33:52 �[43mdestination�[0m > 2022-10-27 12:33:52 �[32mINFO�[m i.a.i.d.b.BigQueryGcsOperations(cleanUpStage):152 - Deleting staging files for stream checkout_sessions (dataset <name_redacted>): [1.avro]
2022-10-27 12:33:52 �[43mdestination�[0m > 2022-10-27 12:33:52 �[32mINFO�[m i.a.i.d.s.S3StorageOperations(cleanUpBucketObject):290 - Storage bucket <name_redacted>/<name_redacted>_checkout_sessions has been cleaned-up (0 objects were deleted)...
2022-10-27 12:33:52 �[43mdestination�[0m > 2022-10-27 12:33:52 �[1;31mERROR�[m i.a.i.d.b.BufferedStreamConsumer(close):199 - Close failed.
2022-10-27 12:33:52 �[43mdestination�[0m > java.lang.RuntimeException: Failed to upload data from stage <name_redacted>/<name_redacted>_checkout_sessions/2022/10/27/03/c4a35eda-05aa-4703-b03f-6ab1d223649d/

@killthekitten
Copy link
Contributor

killthekitten commented Oct 27, 2022

^ Update, it was indeed a misconfiguration. There were two service accounts involved:

  • Default service account; can access BigQuery, can't access to the staging bucket
  • Custom staging area service account; can't access BigQuery, can access the bucket

The destination was configured to use the default account (no JSON uploaded) and the HMAC key of the bucket account. I figure this means the data could be written to the bucket, but couldn't be loaded to BigQuery.

I fixed this by adding the bucket account JSON to the destination and updating the permissions of that account to also have access to BigQuery.

So, in this case the documentation for the connector was a bit ambiguous and should probably be more explicit on generating HMAC in different scenarios. My gut feeling was not to add more permissions to the default account, so I created a new one.

@edgao
Copy link
Contributor Author

edgao commented Oct 27, 2022

@killthekitten nice debugging, and thanks for the detailed report! There's also got an issue open to improve check behavior, because the connector should catch that sort of thing when you hit the save+retest button (#16992 (comment))

just to be super explicit - once you solved the permissions problems, your syncs were able to run successfully?

@killthekitten
Copy link
Contributor

@edgao that's right, the sync completed correctly and the errors are not present in the log.

Thanks for referring the check discussion, that one actually helped me diagnose the issue, I just forgot to mention it.

@edgao
Copy link
Contributor Author

edgao commented Oct 28, 2022

got it. I think the general statement is that the service account needs to have both bigquery write and GCS read permissions - if GCP didn't require GCS read permissions, then malicious actors could circumvent GCS permissioning by running a bigquery load operation. I'll update our docs to make that more explicit.

@killthekitten
Copy link
Contributor

killthekitten commented Oct 28, 2022

@edgao agreed! The confusing part is that service accounts are mentioned in multiple places, in different context.

In the pre-requisites section:

(Required for Airbyte Cloud; Optional for Airbyte Open Source) A Google Cloud Service Account with the BigQuery User and BigQuery Data Editor roles and the Service Account Key in JSON format.

In the storage bucket guide:

  1. Create an HMAC key and access ID.
  2. Grant the Storage Object Admin role to the Google Cloud Service Account.

And in the connector setup guide:

For Service Account Key JSON (Required for cloud, optional for open-source), enter the Google Cloud Service Account Key in JSON format.

The bit from the storage bucket guide is especially confusing (and probably dangerous), because when I followed the Create HMAC key and access ID guide, I've actually created a second service account at the end of the GCS interoperability settings flow. It didn't occur to me that I could specify the service account already used by airbyte (we run our instance on GCP):

image

I'd say the doc needs a bit of restructuring and deserves a separate section talking about service accounts, and their possible combinations. The reader might have either one or two service accounts configured, and whether they've attached the JSON in the destination settings or not, would define how this combination behaves.

@noviceanalyst
Copy link

interesting - so it sounds like e.g. Marketing._airbyte_raw_contact_lists exists and is non-empty? (the sync in your logs produced 0 records, I'm guessing there was a previous sync that did actually process data?)

That's certainly different behavior from what I ran into. Doesn't seem like a dataset ID mismatch either (destination and normalization both referenced the Marketing dataset) so presumably normalization is looking for raw tables in the right location.

@edgao I did create Dataset ID, but not tables. I assumed that airbyte would create the required tables from the json files. Is this assumption wrong?

@ryjabe
Copy link

ryjabe commented Nov 14, 2022

Just wanted to add another data point here. Was working with a user that ran into this issue as well. Their logs are attached.

Logs:
first_attempt.txt.zip

@noviceanalyst
Copy link

@edgao I was able to resolve my issue with the tabular normalization today. I am not sure what really changed, but I created a new destination connection to BigQuery and made sure that my destination during setup in BQ and in Airbyte were the same (for example both have location as us central 1) just picking US in Airbyte and is going to throw a destination error. Also, for the GCS Bucket name and Bucket path was confusing for me in the beginning , but I ended up not creating a folder in the bucket, thus the name of the bucket and the bucket path are the same this time around. Finally I got a new service account key json for the bucket project I created. I had done all these in the past so I cannot point out specifically that this was what solved the issue. I started out with just 2 streams (raw json) and when that sync succeeded, I then tried the normalization and it worked. If I figure out what caused the sync to work this time around I will be sure to return and share my findings. Thank you @edgao for taking time to read the error log.

Regards,
Noviceanalyst

@bloodyburger
Copy link

I get the same error "Pickling client objects is explicitly not supported.", moving data from Postgresql to BigQuery in Standard Mode.

@haithem-souala
Copy link
Contributor

+1
MSSQL -> BQ (GCS staging)

@goldengrisha
Copy link

Using Raw data (JSON) in the transformation tab solved the issue.

@edgao
Copy link
Contributor Author

edgao commented Nov 13, 2023

we've released bigquery under destinations v2, which replaces normalization. DV2 doesn't use dbt under the hood, so we'll never see this error again. Closing this issue.

@edgao edgao closed this as not planned Won't fix, can't repro, duplicate, stale Nov 13, 2023
@PaxonF
Copy link

PaxonF commented Nov 20, 2023

we've released bigquery under destinations v2, which replaces normalization. DV2 doesn't use dbt under the hood, so we'll never see this error again. Closing this issue.

@edgao I'm on the latest BigQuery Destination Connector (2.3.16) and still get the "pickling client objects is not supported" error on quite a few of my source connectors...

Is it still in progress of being released, or does it only work with some sources? (I'm using Airbyte Open Source btw)

@edgao
Copy link
Contributor Author

edgao commented Nov 20, 2023

that's surprising - normalization shouldn't launch at all with bigquery 2.x. What version of platform are you using? DV2 destinations require platform version 0.50.24+

@PaxonF
Copy link

PaxonF commented Nov 20, 2023

I just upgraded to 0.50.34

@PaxonF
Copy link

PaxonF commented Nov 20, 2023

@edgao Here are some example logs from Postgres to BigQuery, both on the latest versions of their connectors too:
4f5991c5_c655_4d64_b680_583b3ae5e02e_job_22846_attempt_5_txt.txt

@PaxonF
Copy link

PaxonF commented Nov 21, 2023

I realize now that the reason this was failing is because the "Normalized Tabular Data" option was checked. Switching to "Raw data (JSON)" fixes the issue AND results in normalized tables as needed as long as I delete the schema and do a hard reset (even though the settings are a bit confusing). So, user error on my part.

@edgao
Copy link
Contributor Author

edgao commented Nov 21, 2023

interesting - I'd thought we turned off the ability to use normalization with bigquery 2.x. Glad you figured it out, but I've got some more investigation to do now :)

(so... user error sort of, but ideally we wouldn't give you the option at all)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team/destinations Destinations team's backlog type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants