Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User-provided schema as to_gbq parameter #44

Closed
ghost opened this issue May 31, 2017 · 13 comments
Closed

User-provided schema as to_gbq parameter #44

ghost opened this issue May 31, 2017 · 13 comments
Labels
type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@ghost
Copy link

ghost commented May 31, 2017

gbq.to_gbq function currently does the schema conversion based on a given DataFrame's dtypes attribute, based on the dtype -> BQ data type map. This reflection is then passed as a fields value in BQ API call.

I'd like to propose that it should be possible to include a schema as an argument in the to_gbq call. This would save the users from painful, unnecessary dtype conversion as BQ API does the same thing again on create operation, and is not provided a schema on append loadAll operation.

@jreback what do you think? I have started with the implementation in my fork's feature branch.

@jreback
Copy link
Contributor

jreback commented May 31, 2017

This would save the users from painful, unnecessary dtype conversion as BQ API does the same thing again on

not sure what you mean. How is this painful and unnecessary? what exactly are you proposing?

@ghost
Copy link
Author

ghost commented May 31, 2017

to_gbq would include table_schema parameter which would be a dict in a format:

[
  {'name': 'col1', 'type': 'TIMESTAMP'},
  {'name': 'col2', 'type': 'STRING'},
  ...
]

If this schema is provided as an argument, then to_gbq would not do _generate_bq_schema(dataframe) but use the schema provided by the users in the BQ API call.

What I mean by painful, unnecessary dtype conversion is that if you e.g. have a DataFrame of np.dtype('O') np.Series but some of those Series are e.g. timestamps or numerical, it's not always easy to cast the columns (e.g. by df.astype call). This data wrangling must be currently done even though all this could be avoided by providing the schema as I'm proposing.

Did you get my idea? Please see the code in the branch I linked to see how this would be done with a minimal effort.

@jreback
Copy link
Contributor

jreback commented May 31, 2017

again not sure why this is necessary. we already know the types and simply need to map them. can you show an example where you would not want to do this?

@max-sixty
Copy link
Contributor

An option to provide the schema sounds reasonable - e.g. sometimes you want to specify date rather than datetime

@ghost
Copy link
Author

ghost commented May 31, 2017

Let's take a practical example: a CSV file that would be read via pandas.read_csv(...) it to be read to a dataframe. There is an id column, and in this case the id values would be pure integers. Thus read_csv would give this particular Series dtype of integer. This needs then to be casted to string so that bq == df schema validation would be successful.

Or it would be a integer or string field (1,0 or true,false) which would be a boolean field in the destination table. Some additional casting operations would be required here again. Or bytes. Or date and time.

Alternative to this would be to to_gbq the dataframe as it is and not infer the column types – as they are in a BigQuery table – from a DataFrame but provide a schema – a contract – to describe the tabular data to be sent.

At least I'm working with integrations which have a solid contract about their output's types, and which conform to some standard. Now it requires some unnecessary boilerplate in a situation where we the transform part of the ETL equation could be dropped altogether in this kind of use cases.

I could continue with a pythonic argument that explicit (providing the schema) is better than implicit (inferring the schema) but I think the previous points are already pretty valid.

This wouldn't risk breaking anything as the parameter would have a None default and a simple conditional for whether the argument is not None. Probably some try-catching to the connector to give some reasonable exception in case user's schema doesn't cast the data right in Google Cloud end.

@jreback
Copy link
Contributor

jreback commented Jun 1, 2017

@mremes I don't have a problem with providing a scheme, this just puts the burden on BQ conversions, rather than on the pandas side. So as an optional argument this would be ok. If its provided the df is passed as is (no conversions done), so the burden moves entirely to the caller.

@jreback jreback added this to the 0.2.0 milestone Jul 5, 2017
@jreback jreback added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Jul 5, 2017
@jreback jreback modified the milestones: 0.2.0, 0.3.0 Jul 22, 2017
@robfraz
Copy link

robfraz commented Sep 20, 2017

Just wanted to add a note of support for this feature. I'm having a hard time using to_gbq to upload data into a table that has a schema with a column of GBQ StandardSQL type 'DATE'. Being able to be explicit about the data sounds like might help with my issues, although I'm not sure if it's possibly more to do pandas-gbq having a general lack of understanding of the 'DATE' type?

@DanielWFrancis
Copy link

Wishing this existed at the moment because of the DATE type issue.

@max-sixty
Copy link
Contributor

With a PR... your wish can be answered!

It's Aladdin's genie, but you get as many wishes as you want

@mremes
Copy link
Contributor

mremes commented Jan 24, 2018

PR’s been ready for months... #46

@max-sixty
Copy link
Contributor

PR’s been ready for months... #46

OK, sorry for both the delay and the ignorance of the response above. I will ping the maintainer to try and get this merged (it needs a rebase first though)

@max-sixty
Copy link
Contributor

max-sixty commented Jan 24, 2018

The author seems to have deleted his account. Unless he responds, would anyone like to take this up?

I will commit to quarterbacking this through if someone can make the required changes

@DanielWFrancis
Copy link

DanielWFrancis commented Jan 24, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

5 participants