load_table_from_json interpolates string as int #1228

ben-marengo-msmg · 2022-04-22T12:52:01Z

Issue

using load_table_from_json() to load data into a string column fails when using a value that can be interpolated as an int (eg. "123")

see code snippet, which shows inconsistency with the other data loading apis

Environment details

OS type and version: macos 12.2.1
Python version: 3.9.5 (also errors on 3.8)
pip version: pip --version
google-cloud-bigquery version: 2.31.0 & 3.0.1

Code example

import io
import json

from google.api_core.exceptions import BadRequest
from google.cloud.bigquery import Client, Table, SchemaField

data = [{"col_1": "123"}]

project = 'my-project'
bq = Client(project)

table = Table(f'{project}.aaa.data_load_test', schema=[SchemaField("col_1", "STRING"), ])
bq.delete_table(table, not_found_ok=True)
bq.create_table(table)


# WORKS: bq load from byte stream
with io.StringIO() as fp:
    json.dump(data, fp)
    fp.seek(0)
    bq.load_table_from_file(fp, table).result()


# WORKS: streaming api
bq.insert_rows_json(table, data)


try:
    # FAILS: bq load from python dict.
    # 123 is interpolated as an int, even though i supply it as a string
    bq.load_table_from_json(data, table).result()
except BadRequest as e:
    print("ERROR")
    print(e)


# WORKS: bq load from python dict, but 123d cannot be interpolated as an int
bq.load_table_from_json([{"col_1": "123d"}], table).result()

Stack trace

Traceback (most recent call last):
  File "/Users/ben.marengo/code/scratch/bq_load.py", line 28, in <module>
    bq.load_table_from_json(data, table).result()
  File "/Users/ben.marengo/code/pyenvs/scratch/lib/python3.9/site-packages/google/cloud/bigquery/job/base.py", line 728, in result
    return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
  File "/Users/ben.marengo/code/pyenvs/scratch/lib/python3.9/site-packages/google/api_core/future/polling.py", line 137, in result
    raise self._exception
google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table my-project:aaa.data_load_test. Field col_1 has changed type from STRING to INTEGER

The text was updated successfully, but these errors were encountered:

tswast · 2022-05-02T20:23:51Z

This is a tough one. One possibility would be to omit any local schema interpolation (if there is any, I forget) and use an autodetect schema, instead.

I suspect you should be able to work around this issue by manually supplying the correct schema in a job configuration object.

tswast · 2022-05-06T20:46:14Z

New theory: I see we don't have any client-side schema generation. Instead, the client adds autodetect = True if no schema is supplied.

python-bigquery/google/cloud/bigquery/client.py

Lines 2760 to 2761 in 66500b4

    
           if job_config.schema is None: 
        
               job_config.autodetect = True

Perhaps omitting this flag when the table already exists will help?

I suspect that the backend first attempts to generate a schema from the provided data and then compares that generated schema with the existing table. If this is the case, we should also file a backend issue (possibly in addition to adding a workaround in the client if the table already exists).

miguel-graindata · 2023-05-08T12:19:19Z

I have been coming across the same issue when trying to load json files into BigQuery. I thought the problem was already fixed (#1248) but the issue is still around.

How can we request that it is solved? This is the code that I am using to load data.

job_config = LoadJobConfig(create_disposition='CREATE_IF_NEEDED',
                                        write_disposition='WRITE_APPEND',
                                        autodetect=True)

bq_table = f'{project_id}.{dataset_id}.{table_id}'

job = bqclient.load_table_from_json(json_data, bq_table, job_config=job_config)

meredithslota · 2023-06-05T19:14:20Z

#1248 was never merged, so perhaps that fix is still relevant.

Re: #1228 (comment) — if this turns out to be a backend issue we should absolutely file a bug with the correct team (and I'd propose not adding a client-side workaround since we'd have to undo it later and that might cause even more problems for folks).

Linchin · 2024-01-25T00:58:30Z

I think this is less of a backend bug, since we are sending conflicting information to the backend: autodetect shouldn't be true when the table already exists. The problem is the client does not know if the table exists without making an API call to verify it. Adding this call may make the method take longer, but will reduce user confusion and time spent on errors like this. We may also update the documentation to ask people to make sure local schema aligns with existing table. WDYT? @tswast

tswast · 2024-01-25T16:09:49Z

Adding this call may make the method take longer, but will reduce user confusion and time spent on errors like this.

We already do this in some other methods, such as list_rows:

python-bigquery/google/cloud/bigquery/client.py

Lines 3928 to 3932 in 39f33b2

    
           # No schema, but no selected_fields. Assume the developer wants all 
        
           # columns, so get the table resource for them rather than failing. 
        
           elif len(schema) == 0: 
        
               table = self.get_table(table.reference, retry=retry, timeout=timeout) 
        
               schema = table.schema

Note: We should only do this call to get_table if all of the following is true:

The job config specifies "APPEND" mode (or is not set, as this is the default, I believe)
The job config does not have a schema or autodetect manually set already.

Linchin · 2024-01-25T18:23:03Z

Awesome, I'll go fix this issue.

Linchin · 2024-01-25T20:26:00Z

Truth table: under which condition we should check table existence, and whether client should set autodetect value for user.

Schema Provided	(user) Autodetect	writeDisposition	Check Table Existence	(client) Autodetect
T	T/F/not provided	*	F	same as user value
F	T/F	*	F	same as user value
F	not provided	WRITE_APPEND / not provided	T	F if table exists, otherwise T
F	not provided	WRITE_TRUNCATE / WRITE_EMPTY	F	T

Linchin · 2024-01-25T21:17:29Z

After playing around, I'm starting to feel like we just don't have to set autodetect = True in the first place. For example, when I use load_table_from_json with schema and autodetect not set, the backend is still able to detect the schema. As these two lines of code were added five years ago, I guess maybe we had to do it in the past, but the backend has updated since? @tswast I wanted to double check the reason why we added it in the past.

I think my test sample was wrong, and the backend still throws an error if no schema is provided and autodetect = False/None.

tswast · 2024-01-25T22:51:15Z

If a table doesn't already exist, it used to be the case that either you had to specify a schema or you had to specify autodetect = True. With neither specified, it would error. It could be that now BigQuery automatically autodetects the schema in this case?

tswast · 2024-01-25T22:51:44Z

If the table does exist, you don't have to set anything.

Linchin · 2024-01-25T23:03:23Z

It could be that now BigQuery automatically autodetects the schema in this case?

I think so. I'll just remove these two lines then.

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Apr 22, 2022

yoshi-automation added the triage me I really want to be triaged. label Apr 23, 2022

steffnay added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. and removed triage me I really want to be triaged. labels Apr 24, 2022

steffnay self-assigned this May 2, 2022

steffnay assigned abecerrilsalas May 2, 2022

abecerrilsalas mentioned this issue May 11, 2022

fix: load table from json #1248

Closed

4 tasks

brijk7 assigned chalmerlowe and unassigned steffnay and abecerrilsalas Jan 12, 2023

chalmerlowe mentioned this issue Aug 17, 2023

Revisit the method load_table_from_json() #1646

Open

This was referenced Jan 25, 2024

fix: no longer set autodetect=True with empty schema in load_table_from_json() #1801

Closed

fix: change load_table_from_json autodetect logic #1804

Merged

Linchin closed this as completed in #1804 Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load_table_from_json interpolates string as int #1228

load_table_from_json interpolates string as int #1228

ben-marengo-msmg commented Apr 22, 2022

tswast commented May 2, 2022

tswast commented May 6, 2022

miguel-graindata commented May 8, 2023

meredithslota commented Jun 5, 2023

Linchin commented Jan 25, 2024

tswast commented Jan 25, 2024

Linchin commented Jan 25, 2024

Linchin commented Jan 25, 2024 •

edited

Loading

Linchin commented Jan 25, 2024 •

edited

Loading

tswast commented Jan 25, 2024

tswast commented Jan 25, 2024

Linchin commented Jan 25, 2024

load_table_from_json interpolates string as int #1228

load_table_from_json interpolates string as int #1228

Comments

ben-marengo-msmg commented Apr 22, 2022

Issue

Environment details

Code example

Stack trace

tswast commented May 2, 2022

tswast commented May 6, 2022

miguel-graindata commented May 8, 2023

meredithslota commented Jun 5, 2023

Linchin commented Jan 25, 2024

tswast commented Jan 25, 2024

Linchin commented Jan 25, 2024

Linchin commented Jan 25, 2024 • edited Loading

Linchin commented Jan 25, 2024 • edited Loading

tswast commented Jan 25, 2024

tswast commented Jan 25, 2024

Linchin commented Jan 25, 2024

Linchin commented Jan 25, 2024 •

edited

Loading

Linchin commented Jan 25, 2024 •

edited

Loading