User-provided schema as to_gbq parameter #44

ghost · 2017-05-31T11:17:02Z

gbq.to_gbq function currently does the schema conversion based on a given DataFrame's dtypes attribute, based on the dtype -> BQ data type map. This reflection is then passed as a fields value in BQ API call.

I'd like to propose that it should be possible to include a schema as an argument in the to_gbq call. This would save the users from painful, unnecessary dtype conversion as BQ API does the same thing again on create operation, and is not provided a schema on append loadAll operation.

@jreback what do you think? I have started with the implementation in my fork's feature branch.

The text was updated successfully, but these errors were encountered:

jreback · 2017-05-31T12:50:05Z

This would save the users from painful, unnecessary dtype conversion as BQ API does the same thing again on

not sure what you mean. How is this painful and unnecessary? what exactly are you proposing?

ghost · 2017-05-31T13:02:32Z

to_gbq would include table_schema parameter which would be a dict in a format:

[
  {'name': 'col1', 'type': 'TIMESTAMP'},
  {'name': 'col2', 'type': 'STRING'},
  ...
]

If this schema is provided as an argument, then to_gbq would not do _generate_bq_schema(dataframe) but use the schema provided by the users in the BQ API call.

What I mean by painful, unnecessary dtype conversion is that if you e.g. have a DataFrame of np.dtype('O') np.Series but some of those Series are e.g. timestamps or numerical, it's not always easy to cast the columns (e.g. by df.astype call). This data wrangling must be currently done even though all this could be avoided by providing the schema as I'm proposing.

Did you get my idea? Please see the code in the branch I linked to see how this would be done with a minimal effort.

jreback · 2017-05-31T13:46:15Z

again not sure why this is necessary. we already know the types and simply need to map them. can you show an example where you would not want to do this?

max-sixty · 2017-05-31T18:26:02Z

An option to provide the schema sounds reasonable - e.g. sometimes you want to specify date rather than datetime

ghost · 2017-05-31T19:06:33Z

Let's take a practical example: a CSV file that would be read via pandas.read_csv(...) it to be read to a dataframe. There is an id column, and in this case the id values would be pure integers. Thus read_csv would give this particular Series dtype of integer. This needs then to be casted to string so that bq == df schema validation would be successful.

Or it would be a integer or string field (1,0 or true,false) which would be a boolean field in the destination table. Some additional casting operations would be required here again. Or bytes. Or date and time.

Alternative to this would be to to_gbq the dataframe as it is and not infer the column types – as they are in a BigQuery table – from a DataFrame but provide a schema – a contract – to describe the tabular data to be sent.

At least I'm working with integrations which have a solid contract about their output's types, and which conform to some standard. Now it requires some unnecessary boilerplate in a situation where we the transform part of the ETL equation could be dropped altogether in this kind of use cases.

I could continue with a pythonic argument that explicit (providing the schema) is better than implicit (inferring the schema) but I think the previous points are already pretty valid.

This wouldn't risk breaking anything as the parameter would have a None default and a simple conditional for whether the argument is not None. Probably some try-catching to the connector to give some reasonable exception in case user's schema doesn't cast the data right in Google Cloud end.

jreback · 2017-06-01T13:51:24Z

@mremes I don't have a problem with providing a scheme, this just puts the burden on BQ conversions, rather than on the pandas side. So as an optional argument this would be ok. If its provided the df is passed as is (no conversions done), so the burden moves entirely to the caller.

robfraz · 2017-09-20T10:18:45Z

Just wanted to add a note of support for this feature. I'm having a hard time using to_gbq to upload data into a table that has a schema with a column of GBQ StandardSQL type 'DATE'. Being able to be explicit about the data sounds like might help with my issues, although I'm not sure if it's possibly more to do pandas-gbq having a general lack of understanding of the 'DATE' type?

DanielWFrancis · 2018-01-23T22:57:31Z

Wishing this existed at the moment because of the DATE type issue.

max-sixty · 2018-01-24T01:21:54Z

With a PR... your wish can be answered!

It's Aladdin's genie, but you get as many wishes as you want

mremes · 2018-01-24T01:24:49Z

PR’s been ready for months... #46

max-sixty · 2018-01-24T02:32:55Z

PR’s been ready for months... #46

OK, sorry for both the delay and the ignorance of the response above. I will ping the maintainer to try and get this merged (it needs a rebase first though)

max-sixty · 2018-01-24T02:35:36Z

The author seems to have deleted his account. Unless he responds, would anyone like to take this up?

I will commit to quarterbacking this through if someone can make the required changes

DanielWFrancis · 2018-01-24T04:11:32Z

I haven’t made a public pull request before but I can give it a shot.

…

On Tue, Jan 23, 2018 at 6:35 PM Maximilian Roos ***@***.***> wrote: The author seems to have deleted his account. Unless he responds, would anyone like to take this up? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#44 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AVxHjCYS_zDK-NgY_XT_NfAw6Lk-Rmh6ks5tNpb4gaJpZM4NrgW2> .

ghost mentioned this issue Jun 3, 2017

ENH: Add table_schema parameter for user-defined BigQuery schema #46

Merged

jreback added this to the 0.2.0 milestone Jul 5, 2017

jreback added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Jul 5, 2017

jreback modified the milestones: 0.2.0, 0.3.0 Jul 22, 2017

jreback removed this from the 0.3.0 milestone Nov 25, 2017

max-sixty mentioned this issue Jan 5, 2018

Inserting boolean NULL values in BigQuery table — needs a test added #105

Closed

tswast closed this as completed in #46 Jan 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User-provided schema as to_gbq parameter #44

User-provided schema as to_gbq parameter #44

ghost commented May 31, 2017

jreback commented May 31, 2017

ghost commented May 31, 2017

jreback commented May 31, 2017

max-sixty commented May 31, 2017

ghost commented May 31, 2017 •

edited by ghost

Loading

jreback commented Jun 1, 2017

robfraz commented Sep 20, 2017

DanielWFrancis commented Jan 23, 2018

max-sixty commented Jan 24, 2018

mremes commented Jan 24, 2018

max-sixty commented Jan 24, 2018

max-sixty commented Jan 24, 2018 •

edited

Loading

DanielWFrancis commented Jan 24, 2018 via email

User-provided schema as to_gbq parameter #44

User-provided schema as to_gbq parameter #44

Comments

ghost commented May 31, 2017

jreback commented May 31, 2017

ghost commented May 31, 2017

jreback commented May 31, 2017

max-sixty commented May 31, 2017

ghost commented May 31, 2017 • edited by ghost Loading

jreback commented Jun 1, 2017

robfraz commented Sep 20, 2017

DanielWFrancis commented Jan 23, 2018

max-sixty commented Jan 24, 2018

mremes commented Jan 24, 2018

max-sixty commented Jan 24, 2018

max-sixty commented Jan 24, 2018 • edited Loading

DanielWFrancis commented Jan 24, 2018 via email

ghost commented May 31, 2017 •

edited by ghost

Loading

max-sixty commented Jan 24, 2018 •

edited

Loading