You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As mentioned in #74, around July 11th pandas-gbqbuilds started failing this test: test_gbq.py::TestToGBQIntegrationWithServiceAccountKeyPath::test_upload_data_if_table_exists_replace.
I reviewed the test failure and my initial thought is that there was change made in the BigQuery backend recently that triggered this. The issue is related to deleting and recreating a table with a different schema. Currently we force a delay of 2 minutes when a table with a modified schema is recreated. This delay is suggested in this StackOverflow post and this entry in the BigQuery issue tracker . Based on my limited testing, it seems that in addition to waiting 2 minutes, you also need to upload the data twice in order to see the data in BigQuery. During the first upload StreamingInsertError is raised. The second upload is successful.
You can easily confirm this when running the test locally. The test failure no longer appears when I change
Based on this behaviour, I believe that now you need to upload data twice after changing the schema. It seems like this issue could be a regression on the BigQuery side (since re-uploading data wasn't required before).
I was also able to create this issue with the google-cloud-bigquery package with the following code:
from google.cloud import bigquery
from google.cloud.bigquery import SchemaField
import time
client = bigquery.Client(project=<your_project_id>)
dataset = client.dataset('test_dataset')
if not dataset.exists():
dataset.create()
SCHEMA = [
SchemaField('full_name', 'STRING', mode='required'),
SchemaField('age', 'INTEGER', mode='required'),
]
table = dataset.table('test_table', SCHEMA)
if table.exists:
try:
table.delete()
except:
pass
table.create()
ROWS_TO_INSERT = [
(u'Phred Phlyntstone', 32),
(u'Wylma Phlyntstone', 29),
]
table.insert_data(ROWS_TO_INSERT)
# Now change the schema
SCHEMA = [
SchemaField('name', 'STRING', mode='required'),
SchemaField('age', 'STRING', mode='required'),
]
table = dataset.table('test_table', SCHEMA)
# Delete the table, wait 2 minutes and re-create the table
table.delete()
time.sleep(120)
table.create()
ROWS_TO_INSERT = [
(u'Phred Phlyntstone', '32'),
(u'Wylma Phlyntstone', '29'),
]
for _ in range(5):
insert_errors = table.insert_data(ROWS_TO_INSERT)
if len(insert_errors):
print(insert_errors)
print('Retrying')
else:
break
The output was :
>>[{'index': 0, 'errors': [{u'debugInfo': u'generic::not_found: no such field.', u'reason': u'invalid', u'message': u'no such field.', u'location': u'name'}]}, {'index': 1, 'errors': [{u'debugInfo': u'generic::not_found: no such field.', u'reason': u'invalid', u'message': u'no such field.', u'location': u'name'}]}]
>>Retrying
but prior to July 11th (or so) the retry wasn't required.
@tswast Would you be able to provide feedback on the above findings, and whether you think this is a regression in the BigQuery backend? The solution in https://issuetracker.google.com/issues/35905247 which is to delay 120 seconds doesn't appear to work. You now have to upload data twice.
As far as I know the 2 minute waiting time to stream to a new table still applies.
This issue will be fixed by #25 which updates this library to use a load job to add data to a table instead of streaming inserts. Load jobs have better guarantees on data consistency.
As mentioned in #74, around July 11th
pandas-gbq
builds started failing this test:test_gbq.py::TestToGBQIntegrationWithServiceAccountKeyPath::test_upload_data_if_table_exists_replace
.I reviewed the test failure and my initial thought is that there was change made in the BigQuery backend recently that triggered this. The issue is related to deleting and recreating a table with a different schema. Currently we force a delay of 2 minutes when a table with a modified schema is recreated. This delay is suggested in this StackOverflow post and this entry in the BigQuery issue tracker . Based on my limited testing, it seems that in addition to waiting 2 minutes, you also need to upload the data twice in order to see the data in BigQuery. During the first upload
StreamingInsertError
is raised. The second upload is successful.You can easily confirm this when running the test locally. The test failure no longer appears when I change
at
https://github.com/pydata/pandas-gbq/blob/master/pandas_gbq/gbq.py#L1056
to
Based on this behaviour, I believe that now you need to upload data twice after changing the schema. It seems like this issue could be a regression on the BigQuery side (since re-uploading data wasn't required before).
I was also able to create this issue with the
google-cloud-bigquery
package with the following code:The output was :
but prior to July 11th (or so) the retry wasn't required.
One thing that
google-cloud-bigquery
does is return streaming insert errors rather than raisingStreamingInsertError
like we do inpandas-gbq
. See https://github.com/GoogleCloudPlatform/google-cloud-python/blob/master/bigquery/google/cloud/bigquery/table.py#L826 .We could follow a similar behaviour and add a return in
to_gbq
which contains the streaming insert errors rather than raisingStreamingInsertError
. We can leave it up to the user to check for streaming insert errors and retry if needed https://github.com/pydata/pandas-gbq/blob/master/pandas_gbq/gbq.py#L1056The text was updated successfully, but these errors were encountered: