-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add table_schema parameter for user-defined BigQuery schema #46
Conversation
Codecov Report
@@ Coverage Diff @@
## master #46 +/- ##
==========================================
- Coverage 28.49% 27.54% -0.96%
==========================================
Files 4 4
Lines 1537 1561 +24
==========================================
- Hits 438 430 -8
- Misses 1099 1131 +32
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. can you add an entry in doc/source/changelog.rst for 0.2.0
pandas_gbq/gbq.py
Outdated
List of BigQuery table fields to which according DataFrame columns | ||
conform to, e.g. `[{'name': 'col1', 'type': 'STRING'},...]`. If | ||
schema is not provided, it will be generated according to dtypes | ||
of DataFrame columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a versionadded tag. are there required names in the schema (e.g. name, type)? list what is required.
if not table_schema: | ||
table_schema = _generate_bq_schema(dataframe) | ||
else: | ||
table_schema = dict(fields=table_schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assume validation is now up to BQ. can you test this though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current implementation will throw StreamingInsetError
after a chunk is done (see tests) along printing error trace from the BQ API. Which is OK.
pandas_gbq/tests/test_gbq.py
Outdated
@@ -1258,6 +1258,34 @@ def test_verify_schema_ignores_field_mode(self): | |||
assert self.sut.verify_schema( | |||
self.dataset_prefix + "1", TABLE_ID + test_id, test_schema_2) | |||
|
|||
def test_upload_data_with_valid_user_schema(self): | |||
df = tm.makeMixedDataFrame() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number as a comment
dict(fields=test_schema)) | ||
|
||
def test_upload_data_with_invalid_user_schema_raises_error(self): | ||
df = tm.makeMixedDataFrame() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also tests with missing keys in the schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test added.
cc @parthea |
once this is merged, will need a PR to pandas itself to modify the signature of |
1c89ef2
to
fc948bb
Compare
This can be re-reviewed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add an entry in changelog (0.2.0). minor comments. ping when pushed / green.
pandas_gbq/gbq.py
Outdated
@@ -815,6 +816,13 @@ def to_gbq(dataframe, destination_table, project_id, chunksize=10000, | |||
Service account private key in JSON format. Can be file path | |||
or string contents. This is useful for remote server | |||
authentication (eg. jupyter iPython notebook on remote host) | |||
table_schema : list of dicts | |||
.. versionadded:: 0.2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the versionadded should go after the expl (with a blank line before ), otherwise doesn't render.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also needs to line up with the indent for the expl.
pandas_gbq/tests/test_gbq.py
Outdated
@@ -1258,6 +1258,49 @@ def test_verify_schema_ignores_field_mode(self): | |||
assert self.sut.verify_schema( | |||
self.dataset_prefix + "1", TABLE_ID + test_id, test_schema_2) | |||
|
|||
# Issue #46; tests test scenarios with user-provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put the comment inside the test
cc @parthea |
c0e3cd8
to
4e9a6cd
Compare
@jreback this is green now. |
can you rebase |
@jreback done |
@mremes you need to rebase on master so that there are no conflicts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny comment. ping on green.
docs/source/changelog.rst
Outdated
@@ -6,6 +6,7 @@ Changelog | |||
|
|||
- Resolve issue where the optional ``--noauth_local_webserver`` command line argument would not be propagated during the authentication process. (:issue:`35`) | |||
- Drop support for Python 3.4 (:issue:`40`) | |||
- Add support for users to provide a table schema in ``to_gbq`` call instead of letting module to infer the schema from ``DataFrame.dtypes`` (:issue:`46`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add support for a passed schema in :func:to_gbq
instead inferring the schema from the passed DataFrame
with DataFrame.dtypes
pls rebase |
sorry...just saw your comment about after #47 |
Ghost - did you delete your account? This needs a small rebase and then I am happy to quarterback it to get it approved and merged |
7345159
to
6792076
Compare
Ghost here 👻 This is rebased now. |
6792076
to
8791c49
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you setup Travis CI on your fork according to https://pandas-gbq.readthedocs.io/en/latest/contributing.html#running-google-bigquery-integration-tests and link to the build logs?
pandas_gbq/tests/test_gbq.py
Outdated
{'name': 'C', 'type': 'FLOAT'}, | ||
{'name': 'D', 'type': 'FLOAT'}] | ||
destination_table = self.destination_table + test_id | ||
with tm.assertRaises(gbq.StreamingInsertError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StreamingInsertError was removed in 0.3.0. to_gbq
now creates a load job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced with the generic error. Errors need a better hierarchy in this module though.
pandas_gbq/tests/test_gbq.py
Outdated
# schemas | ||
df = tm.makeMixedDataFrame() | ||
test_id = "15" | ||
test_schema = [{'name': 'A', 'type': 'FLOAT'}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a chance this might fail with version 0.29.0 of google-cloud-bigquery
due to googleapis/google-cloud-python#4456
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generated schemas do not include the mode
property in the fields either. So this should be fine.
088038e
to
6740634
Compare
@tswast Comments addressed / responded. I cleaned BigQuery for any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mremes ! Most of the test failures look like flakes (BigQuery is having some troubles right now) but there's one real one. I've left a comment.
pandas_gbq/tests/test_gbq.py
Outdated
{'name': 'C', 'type': 'FLOAT'}, | ||
{'name': 'D', 'type': 'FLOAT'}] | ||
destination_table = self.destination_table + test_id | ||
with tm.assertRaises(gbq.GenericGBQException): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tm.assertRaises()
doesn't exist: https://travis-ci.org/mremes/pandas-gbq/jobs/333441824#L2850
Since we are using PyTest as our test runner, I think you can use pytest.raises(...)
https://docs.pytest.org/en/latest/assert.html#assertions-about-expected-exceptions
Thanks for fixing it to use |
4fc475d
to
4cfc00b
Compare
Done. |
Thanks a lot @mremes ! |
Is this integrated in the latest |
@faresbessrour It hasn't been released yet. |
But you can use with |
@faresbessrour Just released! |
Closes #44