Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_table_from_json interpolates string as int #1228

Closed
ben-marengo-msmg opened this issue Apr 22, 2022 · 12 comments · Fixed by #1804
Closed

load_table_from_json interpolates string as int #1228

ben-marengo-msmg opened this issue Apr 22, 2022 · 12 comments · Fixed by #1804
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@ben-marengo-msmg
Copy link

Issue

using load_table_from_json() to load data into a string column fails when using a value that can be interpolated as an int (eg. "123")

see code snippet, which shows inconsistency with the other data loading apis

Environment details

  • OS type and version: macos 12.2.1
  • Python version: 3.9.5 (also errors on 3.8)
  • pip version: pip --version
  • google-cloud-bigquery version: 2.31.0 & 3.0.1

Code example

import io
import json

from google.api_core.exceptions import BadRequest
from google.cloud.bigquery import Client, Table, SchemaField

data = [{"col_1": "123"}]

project = 'my-project'
bq = Client(project)

table = Table(f'{project}.aaa.data_load_test', schema=[SchemaField("col_1", "STRING"), ])
bq.delete_table(table, not_found_ok=True)
bq.create_table(table)


# WORKS: bq load from byte stream
with io.StringIO() as fp:
    json.dump(data, fp)
    fp.seek(0)
    bq.load_table_from_file(fp, table).result()


# WORKS: streaming api
bq.insert_rows_json(table, data)


try:
    # FAILS: bq load from python dict.
    # 123 is interpolated as an int, even though i supply it as a string
    bq.load_table_from_json(data, table).result()
except BadRequest as e:
    print("ERROR")
    print(e)


# WORKS: bq load from python dict, but 123d cannot be interpolated as an int
bq.load_table_from_json([{"col_1": "123d"}], table).result()

Stack trace

Traceback (most recent call last):
  File "/Users/ben.marengo/code/scratch/bq_load.py", line 28, in <module>
    bq.load_table_from_json(data, table).result()
  File "/Users/ben.marengo/code/pyenvs/scratch/lib/python3.9/site-packages/google/cloud/bigquery/job/base.py", line 728, in result
    return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
  File "/Users/ben.marengo/code/pyenvs/scratch/lib/python3.9/site-packages/google/api_core/future/polling.py", line 137, in result
    raise self._exception
google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table my-project:aaa.data_load_test. Field col_1 has changed type from STRING to INTEGER
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Apr 22, 2022
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Apr 23, 2022
@steffnay steffnay added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. and removed triage me I really want to be triaged. labels Apr 24, 2022
@steffnay steffnay self-assigned this May 2, 2022
@tswast
Copy link
Contributor

tswast commented May 2, 2022

This is a tough one. One possibility would be to omit any local schema interpolation (if there is any, I forget) and use an autodetect schema, instead.

I suspect you should be able to work around this issue by manually supplying the correct schema in a job configuration object.

@tswast
Copy link
Contributor

tswast commented May 6, 2022

New theory: I see we don't have any client-side schema generation. Instead, the client adds autodetect = True if no schema is supplied.

if job_config.schema is None:
job_config.autodetect = True

Perhaps omitting this flag when the table already exists will help?

I suspect that the backend first attempts to generate a schema from the provided data and then compares that generated schema with the existing table. If this is the case, we should also file a backend issue (possibly in addition to adding a workaround in the client if the table already exists).

@miguel-graindata
Copy link

I have been coming across the same issue when trying to load json files into BigQuery. I thought the problem was already fixed (#1248) but the issue is still around.

How can we request that it is solved? This is the code that I am using to load data.

job_config = LoadJobConfig(create_disposition='CREATE_IF_NEEDED',
                                        write_disposition='WRITE_APPEND',
                                        autodetect=True)

bq_table = f'{project_id}.{dataset_id}.{table_id}'

job = bqclient.load_table_from_json(json_data, bq_table, job_config=job_config)

@meredithslota
Copy link
Contributor

#1248 was never merged, so perhaps that fix is still relevant.

Re: #1228 (comment) — if this turns out to be a backend issue we should absolutely file a bug with the correct team (and I'd propose not adding a client-side workaround since we'd have to undo it later and that might cause even more problems for folks).

@Linchin
Copy link
Contributor

Linchin commented Jan 25, 2024

I think this is less of a backend bug, since we are sending conflicting information to the backend: autodetect shouldn't be true when the table already exists. The problem is the client does not know if the table exists without making an API call to verify it. Adding this call may make the method take longer, but will reduce user confusion and time spent on errors like this. We may also update the documentation to ask people to make sure local schema aligns with existing table. WDYT? @tswast

@tswast
Copy link
Contributor

tswast commented Jan 25, 2024

Adding this call may make the method take longer, but will reduce user confusion and time spent on errors like this.

We already do this in some other methods, such as list_rows:

# No schema, but no selected_fields. Assume the developer wants all
# columns, so get the table resource for them rather than failing.
elif len(schema) == 0:
table = self.get_table(table.reference, retry=retry, timeout=timeout)
schema = table.schema

Note: We should only do this call to get_table if all of the following is true:

  • The job config specifies "APPEND" mode (or is not set, as this is the default, I believe)
  • The job config does not have a schema or autodetect manually set already.

@Linchin
Copy link
Contributor

Linchin commented Jan 25, 2024

Awesome, I'll go fix this issue.

@Linchin
Copy link
Contributor

Linchin commented Jan 25, 2024

Truth table: under which condition we should check table existence, and whether client should set autodetect value for user.

Schema Provided (user) Autodetect writeDisposition Check Table Existence (client) Autodetect
T T/F/not provided * F same as user value
F T/F * F same as user value
F not provided WRITE_APPEND / not provided T F if table exists, otherwise T
F not provided WRITE_TRUNCATE / WRITE_EMPTY F T

@Linchin
Copy link
Contributor

Linchin commented Jan 25, 2024

After playing around, I'm starting to feel like we just don't have to set autodetect = True in the first place. For example, when I use load_table_from_json with schema and autodetect not set, the backend is still able to detect the schema. As these two lines of code were added five years ago, I guess maybe we had to do it in the past, but the backend has updated since? @tswast I wanted to double check the reason why we added it in the past.

I think my test sample was wrong, and the backend still throws an error if no schema is provided and autodetect = False/None.

@tswast
Copy link
Contributor

tswast commented Jan 25, 2024

If a table doesn't already exist, it used to be the case that either you had to specify a schema or you had to specify autodetect = True. With neither specified, it would error. It could be that now BigQuery automatically autodetects the schema in this case?

@tswast
Copy link
Contributor

tswast commented Jan 25, 2024

If the table does exist, you don't have to set anything.

@Linchin
Copy link
Contributor

Linchin commented Jan 25, 2024

It could be that now BigQuery automatically autodetects the schema in this case?

I think so. I'll just remove these two lines then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
9 participants