StreamingInsertError occurs when uploading to a table with a new schema #75

parthea · 2017-07-24T03:58:47Z

As mentioned in #74, around July 11th pandas-gbq builds started failing this test: test_gbq.py::TestToGBQIntegrationWithServiceAccountKeyPath::test_upload_data_if_table_exists_replace.

I reviewed the test failure and my initial thought is that there was change made in the BigQuery backend recently that triggered this. The issue is related to deleting and recreating a table with a different schema. Currently we force a delay of 2 minutes when a table with a modified schema is recreated. This delay is suggested in this StackOverflow post and this entry in the BigQuery issue tracker . Based on my limited testing, it seems that in addition to waiting 2 minutes, you also need to upload the data twice in order to see the data in BigQuery. During the first upload StreamingInsertError is raised. The second upload is successful.

You can easily confirm this when running the test locally. The test failure no longer appears when I change

        connector.load_data(dataframe, dataset_id, table_id, chunksize)

at
https://github.com/pydata/pandas-gbq/blob/master/pandas_gbq/gbq.py#L1056
to

    try:
        connector.load_data(dataframe, dataset_id, table_id, chunksize)
    except:
        connector.load_data(dataframe, dataset_id, table_id, chunksize)

Based on this behaviour, I believe that now you need to upload data twice after changing the schema. It seems like this issue could be a regression on the BigQuery side (since re-uploading data wasn't required before).

I was also able to create this issue with the google-cloud-bigquery package with the following code:

from google.cloud import bigquery
from google.cloud.bigquery import SchemaField
import time

client = bigquery.Client(project=<your_project_id>)

dataset = client.dataset('test_dataset')
if not dataset.exists():
    dataset.create()

SCHEMA = [
    SchemaField('full_name', 'STRING', mode='required'),
    SchemaField('age', 'INTEGER', mode='required'),
]

table = dataset.table('test_table', SCHEMA)

if table.exists:
    try:
        table.delete()
    except:
        pass
    
table.create()
ROWS_TO_INSERT = [
    (u'Phred Phlyntstone', 32),
    (u'Wylma Phlyntstone', 29),
]
table.insert_data(ROWS_TO_INSERT)

# Now change the schema
SCHEMA = [
    SchemaField('name', 'STRING', mode='required'),
    SchemaField('age', 'STRING', mode='required'),
]
table = dataset.table('test_table', SCHEMA)

# Delete the table, wait 2 minutes and re-create the table
table.delete()
time.sleep(120)
table.create()

ROWS_TO_INSERT = [
    (u'Phred Phlyntstone', '32'),
    (u'Wylma Phlyntstone', '29'),
]
for _ in range(5):
    insert_errors = table.insert_data(ROWS_TO_INSERT)
    if len(insert_errors):
        print(insert_errors)
        print('Retrying')
    else:
        break

The output was :

>>[{'index': 0, 'errors': [{u'debugInfo': u'generic::not_found: no such field.', u'reason': u'invalid', u'message': u'no such field.', u'location': u'name'}]}, {'index': 1, 'errors': [{u'debugInfo': u'generic::not_found: no such field.', u'reason': u'invalid', u'message': u'no such field.', u'location': u'name'}]}]
>>Retrying

but prior to July 11th (or so) the retry wasn't required.

One thing that google-cloud-bigquery does is return streaming insert errors rather than raising StreamingInsertError like we do in pandas-gbq. See https://github.com/GoogleCloudPlatform/google-cloud-python/blob/master/bigquery/google/cloud/bigquery/table.py#L826 .

We could follow a similar behaviour and add a return in to_gbq which contains the streaming insert errors rather than raising StreamingInsertError. We can leave it up to the user to check for streaming insert errors and retry if needed https://github.com/pydata/pandas-gbq/blob/master/pandas_gbq/gbq.py#L1056

The text was updated successfully, but these errors were encountered:

parthea · 2017-07-24T04:03:51Z

@tswast Would you be able to provide feedback on the above findings, and whether you think this is a regression in the BigQuery backend? The solution in https://issuetracker.google.com/issues/35905247 which is to delay 120 seconds doesn't appear to work. You now have to upload data twice.

tswast · 2017-12-11T17:33:25Z

As far as I know the 2 minute waiting time to stream to a new table still applies.

This issue will be fixed by #25 which updates this library to use a load job to add data to a table instead of streaming inserts. Load jobs have better guarantees on data consistency.

tswast · 2018-02-12T04:23:25Z

Closing as this is no longer relevant now that to_gbq creates load jobs instead of using the streaming API.

parthea assigned tswast Jul 24, 2017

parthea mentioned this issue Jul 25, 2017

RLS: 0.2.0 #74

Closed

parthea mentioned this issue Aug 3, 2017

TST: xfail test replace schema due change in BQ backend #77

Merged

parthea added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Aug 4, 2017

parthea unassigned tswast Aug 4, 2017

tswast self-assigned this Dec 11, 2017

tswast closed this as completed Feb 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StreamingInsertError occurs when uploading to a table with a new schema #75

StreamingInsertError occurs when uploading to a table with a new schema #75

parthea commented Jul 24, 2017 •

edited

Loading

parthea commented Jul 24, 2017

tswast commented Dec 11, 2017

tswast commented Feb 12, 2018

StreamingInsertError occurs when uploading to a table with a new schema #75

StreamingInsertError occurs when uploading to a table with a new schema #75

Comments

parthea commented Jul 24, 2017 • edited Loading

parthea commented Jul 24, 2017

tswast commented Dec 11, 2017

tswast commented Feb 12, 2018

parthea commented Jul 24, 2017 •

edited

Loading