Destination BigQuery (normalization): Investigate "pickling client objects is not supported" error #17327

edgao · 2022-09-28T14:48:18Z

see https://github.com/airbytehq/oncall/issues/417

Normalization fails with this error:

2022-09-20 18:26:17 normalization > Unhandled error while executing model.<redacted>.<redacted>
Pickling client objects is explicitly not supported.
Clients have non-trivial state that is local and unpickleable.
2022-09-20 18:26:17 normalization > 2 of 2 ERROR creating incremental model <redacted>.<redacted>............................................... [ERROR in 0.64s]

Enabling GCS staging on the bigquery destination config apparently helps, but it's unclear why that would change anything. Possibly actually a permissions problem, since they also switched the dataset ID.

my original theorizing: (https://github.com/airbytehq/oncall/issues/417#issuecomment-1226588642)

This error is pretty weird - I suspect it's actually hiding some other error message, which we don't have a great way to recover. That would explain why resetting streams resolved it. (ref: similar-looking issue on a completely unrelated github repo PrefectHQ/prefect#2178; the tl;dr is that Google's bigquery library returns an error object that behaves weirdly, so it gets logged in a useless way)

The text was updated successfully, but these errors were encountered:

noviceanalyst · 2022-10-14T03:03:28Z

Hi,

I am getting this exact error while trying to connect hubspot to big query for the first time. Specifically, I get errors that look like this for any tables that I try to transfer, whether contacts, campaign, e-mail, etc.

Unhandled error while executing model.airbyte_utils.contacts
Pickling client objects is explicitly not supported.
Clients have non-trivial state that is local and unpickleable.
1 of 2 ERROR creating table model Marketing.contacts.................................................................... [ERROR in 0.43s]
Pickling client objects is explicitly not supported.
Clients have non-trivial state that is local and unpickleable.
Unhandled error while executing model.airbyte_utils.contacts
Pickling client objects is explicitly not supported.
Clients have non-trivial state that is local and unpickleable.
1 of 2 ERROR creating table model Marketing.contacts.................................................................... [ERROR in 0.43s]
Pickling client objects is explicitly not supported.

When we sync with raw json instead of tabular normalization, the records do transfer to big query as json entries in the resultant table. I am new to Airbyte so any suggestions in the right direction are welcome.

edgao · 2022-10-14T15:11:21Z

can you post the full log file from the sync? I was looking into this recently and it looked like the source actually failed, which led to some _raw tables not being created (https://github.com/airbytehq/airbyte-internal-issues/issues/1038) - would be great to confirm/deny that theory.

noviceanalyst · 2022-10-14T17:24:14Z

HI @edgao please find attached the error log.

huspot_bq_errorlog.txt
I notice that I am able to successfully move data from the source to destination if it is in json format. Using the normalized tabular data format in Airbyte leads to an unsuccessful data transfer. Not sure if this helps.

edgao · 2022-10-18T15:43:07Z

interesting - so it sounds like e.g. Marketing._airbyte_raw_contact_lists exists and is non-empty? (the sync in your logs produced 0 records, I'm guessing there was a previous sync that did actually process data?)

That's certainly different behavior from what I ran into. Doesn't seem like a dataset ID mismatch either (destination and normalization both referenced the Marketing dataset) so presumably normalization is looking for raw tables in the right location.

killthekitten · 2022-10-27T21:32:04Z

We're getting the same error on two very different connections: PostgreSQL CDC -> BigQuery and Stripe -> BigQuery.

What's interesting, identical connections run just fine when not using the GCS staging area. Previously we haven't used the staging area at all, so it might be related to the staging bucket or service account configuration.

While I won't be comfortable sharing the entire log, I could share parts of it that are relevant to investigation is needed. For example, I've seen this happening a lot in the log (but not with all streams):

2022-10-27 12:33:52 �[43mdestination�[0m >  For more details see Big Query Error collection: BigQueryError{reason=accessDenied, location=null, message=Access Denied: File gs://<staging_area_name_redacted>/<name_redacted>/<name_redacted>_checkout_sessions/2022/10/27/03/c4a35eda-05aa-4703-b03f-6ab1d223649d/1.avro: Access Denied}:
2022-10-27 12:33:52 �[43mdestination�[0m > 2022-10-27 12:33:52 �[32mINFO�[m i.a.i.d.b.BigQueryGcsOperations(cleanUpStage):152 - Deleting staging files for stream checkout_sessions (dataset <name_redacted>): [1.avro]
2022-10-27 12:33:52 �[43mdestination�[0m > 2022-10-27 12:33:52 �[32mINFO�[m i.a.i.d.s.S3StorageOperations(cleanUpBucketObject):290 - Storage bucket <name_redacted>/<name_redacted>_checkout_sessions has been cleaned-up (0 objects were deleted)...
2022-10-27 12:33:52 �[43mdestination�[0m > 2022-10-27 12:33:52 �[1;31mERROR�[m i.a.i.d.b.BufferedStreamConsumer(close):199 - Close failed.
2022-10-27 12:33:52 �[43mdestination�[0m > java.lang.RuntimeException: Failed to upload data from stage <name_redacted>/<name_redacted>_checkout_sessions/2022/10/27/03/c4a35eda-05aa-4703-b03f-6ab1d223649d/

killthekitten · 2022-10-27T23:11:20Z

^ Update, it was indeed a misconfiguration. There were two service accounts involved:

Default service account; can access BigQuery, can't access to the staging bucket
Custom staging area service account; can't access BigQuery, can access the bucket

The destination was configured to use the default account (no JSON uploaded) and the HMAC key of the bucket account. I figure this means the data could be written to the bucket, but couldn't be loaded to BigQuery.

I fixed this by adding the bucket account JSON to the destination and updating the permissions of that account to also have access to BigQuery.

So, in this case the documentation for the connector was a bit ambiguous and should probably be more explicit on generating HMAC in different scenarios. My gut feeling was not to add more permissions to the default account, so I created a new one.

edgao · 2022-10-27T23:33:21Z

@killthekitten nice debugging, and thanks for the detailed report! There's also got an issue open to improve check behavior, because the connector should catch that sort of thing when you hit the save+retest button (#16992 (comment))

just to be super explicit - once you solved the permissions problems, your syncs were able to run successfully?

killthekitten · 2022-10-27T23:37:21Z

@edgao that's right, the sync completed correctly and the errors are not present in the log.

Thanks for referring the check discussion, that one actually helped me diagnose the issue, I just forgot to mention it.

edgao · 2022-10-28T23:27:03Z

got it. I think the general statement is that the service account needs to have both bigquery write and GCS read permissions - if GCP didn't require GCS read permissions, then malicious actors could circumvent GCS permissioning by running a bigquery load operation. I'll update our docs to make that more explicit.

killthekitten · 2022-10-28T23:52:18Z

@edgao agreed! The confusing part is that service accounts are mentioned in multiple places, in different context.

In the pre-requisites section:

(Required for Airbyte Cloud; Optional for Airbyte Open Source) A Google Cloud Service Account with the BigQuery User and BigQuery Data Editor roles and the Service Account Key in JSON format.

In the storage bucket guide:

Create an HMAC key and access ID.

Grant the Storage Object Admin role to the Google Cloud Service Account.

And in the connector setup guide:

For Service Account Key JSON (Required for cloud, optional for open-source), enter the Google Cloud Service Account Key in JSON format.

The bit from the storage bucket guide is especially confusing (and probably dangerous), because when I followed the Create HMAC key and access ID guide, I've actually created a second service account at the end of the GCS interoperability settings flow. It didn't occur to me that I could specify the service account already used by airbyte (we run our instance on GCP):

I'd say the doc needs a bit of restructuring and deserves a separate section talking about service accounts, and their possible combinations. The reader might have either one or two service accounts configured, and whether they've attached the JSON in the destination settings or not, would define how this combination behaves.

noviceanalyst · 2022-11-11T17:47:10Z

interesting - so it sounds like e.g. Marketing._airbyte_raw_contact_lists exists and is non-empty? (the sync in your logs produced 0 records, I'm guessing there was a previous sync that did actually process data?)

That's certainly different behavior from what I ran into. Doesn't seem like a dataset ID mismatch either (destination and normalization both referenced the Marketing dataset) so presumably normalization is looking for raw tables in the right location.

@edgao I did create Dataset ID, but not tables. I assumed that airbyte would create the required tables from the json files. Is this assumption wrong?

ryjabe · 2022-11-14T17:01:54Z

Just wanted to add another data point here. Was working with a user that ran into this issue as well. Their logs are attached.

Logs:
first_attempt.txt.zip

noviceanalyst · 2022-11-14T22:35:28Z

@edgao I was able to resolve my issue with the tabular normalization today. I am not sure what really changed, but I created a new destination connection to BigQuery and made sure that my destination during setup in BQ and in Airbyte were the same (for example both have location as us central 1) just picking US in Airbyte and is going to throw a destination error. Also, for the GCS Bucket name and Bucket path was confusing for me in the beginning , but I ended up not creating a folder in the bucket, thus the name of the bucket and the bucket path are the same this time around. Finally I got a new service account key json for the bucket project I created. I had done all these in the past so I cannot point out specifically that this was what solved the issue. I started out with just 2 streams (raw json) and when that sync succeeded, I then tried the normalization and it worked. If I figure out what caused the sync to work this time around I will be sure to return and share my findings. Thank you @edgao for taking time to read the error log.

Regards,
Noviceanalyst

bloodyburger · 2023-05-27T17:17:56Z

I get the same error "Pickling client objects is explicitly not supported.", moving data from Postgresql to BigQuery in Standard Mode.

haithem-souala · 2023-07-28T11:06:50Z

+1
MSSQL -> BQ (GCS staging)

goldengrisha · 2023-09-18T14:46:51Z

Using Raw data (JSON) in the transformation tab solved the issue.

edgao · 2023-11-13T19:24:47Z

we've released bigquery under destinations v2, which replaces normalization. DV2 doesn't use dbt under the hood, so we'll never see this error again. Closing this issue.

PaxonF · 2023-11-20T17:31:07Z

we've released bigquery under destinations v2, which replaces normalization. DV2 doesn't use dbt under the hood, so we'll never see this error again. Closing this issue.

@edgao I'm on the latest BigQuery Destination Connector (2.3.16) and still get the "pickling client objects is not supported" error on quite a few of my source connectors...

Is it still in progress of being released, or does it only work with some sources? (I'm using Airbyte Open Source btw)

edgao · 2023-11-20T22:13:15Z

that's surprising - normalization shouldn't launch at all with bigquery 2.x. What version of platform are you using? DV2 destinations require platform version 0.50.24+

PaxonF · 2023-11-20T22:24:33Z

I just upgraded to 0.50.34

PaxonF · 2023-11-20T22:25:29Z

@edgao Here are some example logs from Postgres to BigQuery, both on the latest versions of their connectors too:
4f5991c5_c655_4d64_b680_583b3ae5e02e_job_22846_attempt_5_txt.txt

PaxonF · 2023-11-21T16:30:17Z

I realize now that the reason this was failing is because the "Normalized Tabular Data" option was checked. Switching to "Raw data (JSON)" fixes the issue AND results in normalized tables as needed as long as I delete the schema and do a hard reset (even though the settings are a bit confusing). So, user error on my part.

edgao · 2023-11-21T16:46:25Z

interesting - I'd thought we turned off the ability to use normalization with bigquery 2.x. Glad you figured it out, but I've got some more investigation to do now :)

(so... user error sort of, but ideally we wouldn't give you the option at all)

edgao added type/bug Something isn't working needs-triage team/destinations Destinations team's backlog labels Sep 28, 2022

edgao mentioned this issue Oct 27, 2022

Destination BigQuery: Check should fail if we have insufficient permissions #16992

Closed

edgao mentioned this issue Oct 28, 2022

Destination BIgquery: Clarify service account usage #18642

Merged

igrankova removed the needs-triage label May 29, 2023

edgao closed this as not planned Won't fix, can't repro, duplicate, stale Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Destination BigQuery (normalization): Investigate "pickling client objects is not supported" error #17327

Destination BigQuery (normalization): Investigate "pickling client objects is not supported" error #17327

edgao commented Sep 28, 2022 •

edited

Loading

noviceanalyst commented Oct 14, 2022

edgao commented Oct 14, 2022

noviceanalyst commented Oct 14, 2022 •

edited

Loading

edgao commented Oct 18, 2022

killthekitten commented Oct 27, 2022

killthekitten commented Oct 27, 2022 •

edited

Loading

edgao commented Oct 27, 2022

killthekitten commented Oct 27, 2022

edgao commented Oct 28, 2022

killthekitten commented Oct 28, 2022 •

edited

Loading

noviceanalyst commented Nov 11, 2022

ryjabe commented Nov 14, 2022

noviceanalyst commented Nov 14, 2022

bloodyburger commented May 27, 2023

haithem-souala commented Jul 28, 2023

goldengrisha commented Sep 18, 2023

edgao commented Nov 13, 2023

PaxonF commented Nov 20, 2023

edgao commented Nov 20, 2023

PaxonF commented Nov 20, 2023

PaxonF commented Nov 20, 2023

PaxonF commented Nov 21, 2023

edgao commented Nov 21, 2023

Destination BigQuery (normalization): Investigate "pickling client objects is not supported" error #17327

Destination BigQuery (normalization): Investigate "pickling client objects is not supported" error #17327

Comments

edgao commented Sep 28, 2022 • edited Loading

noviceanalyst commented Oct 14, 2022

edgao commented Oct 14, 2022

noviceanalyst commented Oct 14, 2022 • edited Loading

edgao commented Oct 18, 2022

killthekitten commented Oct 27, 2022

killthekitten commented Oct 27, 2022 • edited Loading

edgao commented Oct 27, 2022

killthekitten commented Oct 27, 2022

edgao commented Oct 28, 2022

killthekitten commented Oct 28, 2022 • edited Loading

noviceanalyst commented Nov 11, 2022

ryjabe commented Nov 14, 2022

noviceanalyst commented Nov 14, 2022

bloodyburger commented May 27, 2023

haithem-souala commented Jul 28, 2023

goldengrisha commented Sep 18, 2023

edgao commented Nov 13, 2023

PaxonF commented Nov 20, 2023

edgao commented Nov 20, 2023

PaxonF commented Nov 20, 2023

PaxonF commented Nov 20, 2023

PaxonF commented Nov 21, 2023

edgao commented Nov 21, 2023

edgao commented Sep 28, 2022 •

edited

Loading

noviceanalyst commented Oct 14, 2022 •

edited

Loading

killthekitten commented Oct 27, 2022 •

edited

Loading

killthekitten commented Oct 28, 2022 •

edited

Loading