-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When appending to a table, load if the dataframe contains a subset of the existing schema #24
Conversation
Codecov Report
@@ Coverage Diff @@
## master #24 +/- ##
===========================================
- Coverage 74.18% 28.93% -45.26%
===========================================
Files 4 4
Lines 1511 1552 +41
===========================================
- Hits 1121 449 -672
- Misses 390 1103 +713
Continue to review full report at Codecov.
|
pandas_gbq/tests/test_gbq.py
Outdated
PROJECT_ID = None | ||
PRIVATE_KEY_JSON_PATH = None | ||
PRIVATE_KEY_JSON_CONTENTS = None | ||
PROJECT_ID = os.getenv('PROJECT_ID') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the Google library pulls from GOOGLE_CLOUD_PROJECT
(more confident about the newer library, which this package doesn't use yet). So you could either replicate that variable, or pass None
through and the Google library should pick it up
pandas_gbq/tests/test_gbq.py
Outdated
PRIVATE_KEY_JSON_PATH = None | ||
PRIVATE_KEY_JSON_CONTENTS = None | ||
PROJECT_ID = os.getenv('PROJECT_ID') | ||
PRIVATE_KEY_JSON_PATH = os.getenv('PRIVATE_KEY_JSON_PATH') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above for GOOGLE_APPLICATION_CREDENTIALS
pandas_gbq/tests/test_gbq.py
Outdated
PRIVATE_KEY_JSON_CONTENTS = None | ||
PROJECT_ID = os.getenv('PROJECT_ID') | ||
PRIVATE_KEY_JSON_PATH = os.getenv('PRIVATE_KEY_JSON_PATH') | ||
PRIVATE_KEY_JSON_CONTENTS = os.getenv('PRIVATE_KEY_JSON_CONTENTS') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Contents doesn't have a Google version, so we do need to choose our own for that)
self.destination_table + test_id, _get_project_id(), | ||
if_exists='append', private_key=_get_private_key_path()) | ||
|
||
sleep(30) # <- Curses Google!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a function wait_for_job
which we copied from the Google docs; I'll put PR it in down the road (it relies on google.cloud.bigquery
, so need to move some other code onto that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls update as @MaximilianR suggests (look at other tests to see how this is done)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaximilianR hmm, wait_for_job
must be old.....
pandas_gbq/tests/test_gbq.py
Outdated
PROJECT_ID = None | ||
PRIVATE_KEY_JSON_PATH = None | ||
PRIVATE_KEY_JSON_CONTENTS = None | ||
PROJECT_ID = os.getenv('GOOGLE_CLOUD_PROJECT') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://pandas-gbq.readthedocs.io/en/latest/contributing.html are the docs for how to test this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with these environment variables (They are just for local testing). I agree with @jreback that is also necessary to follow the steps in the contributing docs to configure Travis-CI to run on your pandas-gbq fork. You only need to configure Travis-CI once. After that unit tests run automatically for your pandas-gbq fork after each push.
is there any reason to change |
pandas_gbq/gbq.py
Outdated
@@ -558,7 +558,9 @@ def load_data(self, dataframe, dataset_id, table_id, chunksize): | |||
|
|||
self._print("\n") | |||
|
|||
def verify_schema(self, dataset_id, table_id, schema): | |||
def schema(self, dataset_id, table_id): | |||
"""Retrieve the schema of the table""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a doc-string
pandas_gbq/gbq.py
Outdated
except HttpError as ex: | ||
self.process_http_error(ex) | ||
|
||
def verify_schema(self, dataset_id, table_id, schema): | ||
fields_remote = set([json.dumps(field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
pandas_gbq/gbq.py
Outdated
return fields_remote == fields_local | ||
|
||
def schema_is_subset(self, dataset_id, table_id, schema): | ||
fields_remote = set([json.dumps(field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here
pandas_gbq/gbq.py
Outdated
for field in self.schema(dataset_id, table_id)]) | ||
fields_local = set(json.dumps(field_local) | ||
for field_local in schema['fields']) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these routines should prob just take the result of self.schema(dataset_id, table_id)
, yes?
pandas_gbq/tests/test_gbq.py
Outdated
@@ -1080,6 +1082,29 @@ def test_upload_data_if_table_exists_append(self): | |||
_get_project_id(), if_exists='append', | |||
private_key=_get_private_key_path()) | |||
|
|||
def test_upload_subset_columns_if_table_exists_append(self): | |||
test_id = "16" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment with the issue ref number (this PR number is ok)
pandas_gbq/tests/test_gbq.py
Outdated
@@ -1275,6 +1300,18 @@ def test_verify_schema_ignores_field_mode(self): | |||
self.dataset_prefix + "1", TABLE_ID + test_id, test_schema_2), | |||
'Expected schema to match') | |||
|
|||
def test_retrieve_schema(self): | |||
test_id = "15" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment of issue number
actual = self.sut.schema(self.dataset_prefix + "1", TABLE_ID + test_id) | ||
expected = test_schema['fields'] | ||
assert expected == actual, 'Expected schema used to create table' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some tests that exercise schema_is_subset
@jreback I apologize if I missed the context for this comment. The default behaviour of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mr-mcox ! All unit tests passed locally. Code worked as expected in my limited testing.
Could you also update https://github.com/pydata/pandas-gbq/blob/master/docs/source/writing.rst and https://github.com/pydata/pandas-gbq/blob/master/docs/source/changelog.rst to document the improvements in this PR?
Specifically in writing.rst, the following text could be improved: "The dataframe must match the destination table in structure and data types."
Otherwise this LGTM (assuming that we also address the review comments from others, including additional tests for schema_is_subset ).
what I mean is should this be a warning? you could have a 4th (maybe default state), like |
My initial thought is that we don't need to add another |
@parthea Regarding changing the @jreback has a good point in making the behavior of |
54af80f
to
3e693b0
Compare
ping, can you rebase and update to comments. |
@jreback Sorry - I didn't realize that you were waiting on me for something. Can you be a little more specific about what you'd like me to do? Are you asking for me to take the five commits on this PR and rebase them to one, is there a comment that I neglected to address, or something else entirely? |
well needs a rebase, yes squashing is nice too (though not super necessary at this stage) |
f336d4a
to
f612f95
Compare
@jreback Alright - I've done a |
@mr-mcox Thanks for the rebase. I'm really sorry for the delay in reviewing this pull request. I'm hoping to complete reviews for open pull requests this weekend. |
pandas_gbq/gbq.py
Outdated
Obtain from BigQuery the field names and field types | ||
for the table defined by the parameters | ||
|
||
:param str dataset_id: Name of the BigQuery dataset for the table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls use the numpy-doc style (see all other doc-strings)
docs/source/changelog.rst
Outdated
@@ -9,6 +9,10 @@ Changelog | |||
|
|||
- All gbq errors will simply be subclasses of ``ValueError`` and no longer inherit from the deprecated ``PandasError``. | |||
|
|||
0.1.5 / 2017-04-20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to latest (0.1.7)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the issue reference is misformatted, see the other entries
pandas_gbq/gbq.py
Outdated
the schema passed in and indicate whether all fields in the former | ||
are present in the latter. Order is not considered. | ||
|
||
:param str dataset_id: Name of the BigQuery dataset for the table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas_gbq/gbq.py
Outdated
the schema passed in and indicate whether a subset of the fields in | ||
the former are present in the latter. Order is not considered. | ||
|
||
:param str dataset_id: Name of the BigQuery dataset for the table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas_gbq/tests/test_gbq.py
Outdated
@@ -1071,6 +1071,30 @@ def test_upload_data_if_table_exists_append(self): | |||
_get_project_id(), if_exists='append', | |||
private_key=_get_private_key_path()) | |||
|
|||
def test_upload_subset_columns_if_table_exists_append(self): | |||
# For pull request #24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be descriptive
self.destination_table + test_id, _get_project_id(), | ||
if_exists='append', private_key=_get_private_key_path()) | ||
|
||
sleep(30) # <- Curses Google!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls update as @MaximilianR suggests (look at other tests to see how this is done)
Accidentally left a duplicate test in Correcting change to schema made by auto-rebase Fixing missing assertTrue and reversion to not checking subset on append (both from rebase) Replacing AssertEqual Shortening line to pass flake
f612f95
to
ad45010
Compare
@jreback Sorry this took so long. I did another rebase, updated the doc strings, added some description to test cases and adjusted the change log per your requested changes. The only thing I didn't do was to change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. some minor comments.
@parthea if you'd have a look
docs/source/changelog.rst
Outdated
@@ -6,13 +6,15 @@ Changelog | |||
|
|||
- Resolve issue where the optional ``--noauth_local_webserver`` command line argument would not be propagated during the authentication process. (:issue:`35`) | |||
- Drop support for Python 3.4 (:issue:`40`) | |||
- When using ```to_gbq``` if ```if_exists``` is set to ```append```, dataframe needs to contain only a subset of the fields in the BigQuery schema. (:issue: `24`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no space after :issue:`24`
, say more like:
.to_gbq(...., if_exists='append')
...the rest of what you were saying
docs/source/changelog.rst
Outdated
|
||
0.1.6 / 2017-05-03 | ||
------------------ | ||
|
||
- All gbq errors will simply be subclasses of ``ValueError`` and no longer inherit from the deprecated ``PandasError``. | ||
|
||
0.1.4 / 2017-03-17 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you change this? there isn't a 1.5 release
@@ -557,7 +557,25 @@ def load_data(self, dataframe, dataset_id, table_id, chunksize): | |||
|
|||
self._print("\n") | |||
|
|||
def verify_schema(self, dataset_id, table_id, schema): | |||
def schema(self, dataset_id, table_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you update the to_gbq
doc-string as well (with expl of the new behavior of if_exists)
self.destination_table + test_id, _get_project_id(), | ||
if_exists='append', private_key=_get_private_key_path()) | ||
|
||
sleep(30) # <- Curses Google!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaximilianR hmm, wait_for_job
must be old.....
This is
|
@MaximilianR thanks! @mr-mcox if you want to add that as a utility function ok with that. pls run tests according to instructions and show the branch where they are passing. (also ok to do this in another PR) |
@jreback Good ideas on the updates to the changelog. You are correct that I removed 0.1.5 - this PR was the only change under that release and since there isn't 0.1.5 in the changelog on the master branch, I figured that it never happened and removed it entirely. I added |
Wait a sec - another build shows that I messed up calling |
This reverts commit 8726a01.
@jreback I'm reverting that |
Right, that's a good point:
|
@parthea if you'd have a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @mr-mcox !
I just have minor editorial changes before merging.
Purpose
Current behavior of
to_gbq
is fail if the schema of the new data is not equivalent to the current schema. However, this means that the load fails if the new data is missing columns that are present in the current schema. For instance, this may occur when the data source I am using to construct the dataframe only provides non-empty values. Rather than determining the current schema of the GBQ table and adding empty columns to my dataframe, I would liketo_gbq
to load my data if the columns in the dataframe are a subset of the current schema.Primary changes made
schema
function out ofverify_schema
to support bothverify_schema
andschema_is_subset
schema_is_subset
determines whetherlocal_schema
is a subset ofremote_schema
schema_is_subset
rather thanverify_schema
to determine if the data can be loadedAuxiliary changes made
PROJECT_ID
etc are retrieved from an environment variable to facilitate local testingtest_gbq
through autopep8 added a row after two class names