Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functions equivalent to create_rows and create_rows_json that create a table for you using a load job #4553

Closed
nanodan opened this issue Dec 7, 2017 · 15 comments · Fixed by #9076
Assignees
Labels
api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@nanodan
Copy link

nanodan commented Dec 7, 2017

  1. BigQuery API
  2. Linux, Ubuntu 16.04.3 LTS running Google Cloud Datalab 1.0.1
  3. Python 2.7.12
  4. Google cloud v 0.31.0
  5. //
  6. Create a dataframe or dict list and try to send to an existing table. The first request will raise no errors but nothing will be appended to the table. Running the same command again and rows are added.

This will not add rows:

from google.cloud import bigquery as bq
import time

client = bq.Client(project='project-name')
dataset = bq.DatasetReference('project-name', 'dataset-name')
tableref = bq.table.TableReference(dataset, 'users')
schema = [bq.SchemaField('email', 'STRING'), bq.SchemaField('id', 'STRING'), bq.SchemaField('added', 'TIMESTAMP')]
table = bq.Table(tableref, schema=schema)
create = client.create_table(table)
public_members = [{'email': '[email protected]', 'id': '1234', 'added': '2017-12-01 13:13:13 UTC'}]
client.create_rows_json(table, public_members)

This will add exactly one row:

from google.cloud import bigquery as bq
import time

client = bq.Client(project='project-name')
dataset = bq.DatasetReference('project-name', 'dataset-name')
tableref = bq.table.TableReference(dataset, 'users')
schema = [bq.SchemaField('email', 'STRING'), bq.SchemaField('id', 'STRING'), bq.SchemaField('added', 'TIMESTAMP')]
table = bq.Table(tableref, schema=schema)
create = client.create_table(table)
public_members = [{'email': '[email protected]', 'id': '1234', 'added': '2017-12-01 13:13:13 UTC'}]
client.create_rows_json(table, public_members)
time.sleep(15)
client.create_rows_json(table, public_members)
@theacodes theacodes added api: bigquery Issues related to the BigQuery API. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Dec 7, 2017
@theacodes
Copy link
Contributor

@tswast any ideas on this one?

@tswast
Copy link
Contributor

tswast commented Dec 7, 2017

I believe this is a backend issue with the streaming buffer taking a little time to get created beyond that of the table create call.

According to https://stackoverflow.com/a/41446002/101923

If you delete or create a table, you must wait a least 2 minutes to start streaming data on it.

This may be due to the fact that the streaming buffer has a cache of table information, which gets refreshed every 60 seconds.

Closing since this isn't a client library issue.

@tswast tswast closed this as completed Dec 7, 2017
@theacodes
Copy link
Contributor

@tswast is the recommend alternative to just create tables ahead of time?

@tswast
Copy link
Contributor

tswast commented Dec 7, 2017

If you know what rows you want to insert at table creation time, I recommend using the Client.load_table_from_file() method to insert the rows using a StringIO object as the file.

For example:

from google.cloud.bigquery import LoadJobConfig
from six import StringIO

destination_table = client.dataset(dataset_id).table(table_id)
job_config = LoadJobConfig()
job_config.write_disposition = 'WRITE_APPEND'
job_config.source_format = 'NEWLINE_DELIMITED_JSON'
rows = []

for row in maybe_a_dataframe:
    row_json = row.to_json(force_ascii=False, date_unit='s', date_format='iso')
    rows.append(row_json)

body = StringIO('{}\n'.format('\n'.join(rows)))

client.load_table_from_file(
    body,
    destination_table,
    job_config=job_config).result()

@tswast tswast reopened this Dec 7, 2017
@tswast
Copy link
Contributor

tswast commented Dec 7, 2017

Reopening because we could probably help out in this case by having something like create_rows but that creates the table for you like this.

@tswast tswast changed the title create_rows and create_rows_json only work if I send the request twice with a delay Add functions equivalent to create_rows and create_rows_json that create a table for you using a load job Dec 7, 2017
@tswast tswast added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. priority: p2 Moderately-important priority. Fix may not be included in next release. and removed type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Dec 7, 2017
@tswast
Copy link
Contributor

tswast commented Dec 7, 2017

If we do implement such a method, we should probably encode as Avro, since Avro is supposedly 10x faster to load than JSON or CSV.

@nanodan
Copy link
Author

nanodan commented Dec 7, 2017

So I dropped the immediate push after create step from this to see if the "error" would replicate after a wait like suggested.

I created the table (and verified that it was indeed created), and set a timer to sleep for 4 minutes. After which I sent the request and waited another 4 minutes before checking the table; there was a row in the streaming buffer - so it seems like you are correct in that this is not a bug.

This would be a nice feature though as I sometimes have to programmatically create tables based on large JSON files with things like user lists on the fly, but now that I know I have to wait I can simply add a delay in the code.

@tswast
Copy link
Contributor

tswast commented Dec 7, 2017

It's a little clunky but if you use client.load_table_from_file() with a StringIO object, you can avoid having to wait.

@tseaver
Copy link
Contributor

tseaver commented Jun 19, 2018

@tswast Does this issue need to remain open?

@tswast
Copy link
Contributor

tswast commented Jun 19, 2018

Yeah, we haven't completed this feature request. It is on @alixhami's list of OKRs to tackle.

@tswast
Copy link
Contributor

tswast commented Jun 19, 2018

Actually, @alixhami implemented load_table_from_dataframe() which covers a similar use case, but uploads a Pandas DataFrame rather than JSON or JSON-like rows.

@tseaver tseaver removed 🚨 This issue needs some love. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Jul 20, 2018
@tseaver
Copy link
Contributor

tseaver commented Oct 11, 2018

@tswast Should this item remain open here, or are we tracking it somewhere in a feature backlog?

@tswast
Copy link
Contributor

tswast commented Oct 11, 2018

@tseaver We can add to a feature request backlog.

@plamut
Copy link
Contributor

plamut commented Aug 15, 2019

Posting for better visibility - when this is implemented, use it instead of _add_rows(), a similar helper method in system tests - #8992 (comment)

@smarquezs
Copy link

If you know what rows you want to insert at table creation time, I recommend using the Client.load_table_from_file() method to insert the rows using a StringIO object as the file.

For example:

from google.cloud.bigquery import LoadJobConfig
from six import StringIO

destination_table = client.dataset(dataset_id).table(table_id)
job_config = LoadJobConfig()
job_config.write_disposition = 'WRITE_APPEND'
job_config.source_format = 'NEWLINE_DELIMITED_JSON'
rows = []

for row in maybe_a_dataframe:
    row_json = row.to_json(force_ascii=False, date_unit='s', date_format='iso')
    rows.append(row_json)

body = StringIO('{}\n'.format('\n'.join(rows)))

client.load_table_from_file(
    body,
    destination_table,
    job_config=job_config).result()

It worked fine for me, thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
8 participants