When appending to a table, load if the dataframe contains a subset of the existing schema #24

mr-mcox · 2017-04-06T19:08:32Z

Purpose

Current behavior of to_gbq is fail if the schema of the new data is not equivalent to the current schema. However, this means that the load fails if the new data is missing columns that are present in the current schema. For instance, this may occur when the data source I am using to construct the dataframe only provides non-empty values. Rather than determining the current schema of the GBQ table and adding empty columns to my dataframe, I would like to_gbq to load my data if the columns in the dataframe are a subset of the current schema.

Primary changes made

Factoring a schema function out of verify_schema to support both verify_schema and schema_is_subset
schema_is_subset determines whether local_schema is a subset of remote_schema
the append flag uses schema_is_subset rather than verify_schema to determine if the data can be loaded

Auxiliary changes made

PROJECT_ID etc are retrieved from an environment variable to facilitate local testing
Running test_gbq through autopep8 added a row after two class names

codecov-io · 2017-04-06T19:27:37Z

Codecov Report

Merging #24 into master will decrease coverage by 45.25%.
The diff coverage is 15.9%.

@@             Coverage Diff             @@
##           master      #24       +/-   ##
===========================================
- Coverage   74.18%   28.93%   -45.26%     
===========================================
  Files           4        4               
  Lines        1511     1552       +41     
===========================================
- Hits         1121      449      -672     
- Misses        390     1103      +713

Impacted Files	Coverage Δ
pandas_gbq/tests/test_gbq.py	`31.03% <12.12%> (-50.78%)`	⬇️
pandas_gbq/gbq.py	`16.23% <27.27%> (-63%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cec8c86...764df25. Read the comment docs.

max-sixty · 2017-04-06T20:35:37Z

pandas_gbq/tests/test_gbq.py

-PROJECT_ID = None
-PRIVATE_KEY_JSON_PATH = None
-PRIVATE_KEY_JSON_CONTENTS = None
+PROJECT_ID = os.getenv('PROJECT_ID')


I think the Google library pulls from GOOGLE_CLOUD_PROJECT (more confident about the newer library, which this package doesn't use yet). So you could either replicate that variable, or pass None through and the Google library should pick it up

max-sixty · 2017-04-06T20:36:22Z

pandas_gbq/tests/test_gbq.py

-PRIVATE_KEY_JSON_PATH = None
-PRIVATE_KEY_JSON_CONTENTS = None
+PROJECT_ID = os.getenv('PROJECT_ID')
+PRIVATE_KEY_JSON_PATH = os.getenv('PRIVATE_KEY_JSON_PATH')


Same as above for GOOGLE_APPLICATION_CREDENTIALS

max-sixty · 2017-04-06T20:36:35Z

pandas_gbq/tests/test_gbq.py

-PRIVATE_KEY_JSON_CONTENTS = None
+PROJECT_ID = os.getenv('PROJECT_ID')
+PRIVATE_KEY_JSON_PATH = os.getenv('PRIVATE_KEY_JSON_PATH')
+PRIVATE_KEY_JSON_CONTENTS = os.getenv('PRIVATE_KEY_JSON_CONTENTS')


(Contents doesn't have a Google version, so we do need to choose our own for that)

max-sixty · 2017-04-06T20:38:12Z

pandas_gbq/tests/test_gbq.py

+                   self.destination_table + test_id, _get_project_id(),
+                   if_exists='append', private_key=_get_private_key_path())
+
+        sleep(30)  # <- Curses Google!!!


We have a function wait_for_job which we copied from the Google docs; I'll put PR it in down the road (it relies on google.cloud.bigquery, so need to move some other code onto that)

pls update as @MaximilianR suggests (look at other tests to see how this is done)

@MaximilianR hmm, wait_for_job must be old.....

jreback · 2017-04-07T20:06:44Z

pandas_gbq/tests/test_gbq.py

-PROJECT_ID = None
-PRIVATE_KEY_JSON_PATH = None
-PRIVATE_KEY_JSON_CONTENTS = None
+PROJECT_ID = os.getenv('GOOGLE_CLOUD_PROJECT')


https://pandas-gbq.readthedocs.io/en/latest/contributing.html are the docs for how to test this

I am ok with these environment variables (They are just for local testing). I agree with @jreback that is also necessary to follow the steps in the contributing docs to configure Travis-CI to run on your pandas-gbq fork. You only need to configure Travis-CI once. After that unit tests run automatically for your pandas-gbq fork after each push.

jreback · 2017-04-07T20:09:07Z

is there any reason to change if_exists? IOW, is there a reason not to append with a compat schema? (albeit one that is larger), ever?

jreback · 2017-04-07T20:09:20Z

pandas_gbq/gbq.py

@@ -558,7 +558,9 @@ def load_data(self, dataframe, dataset_id, table_id, chunksize):

        self._print("\n")

-    def verify_schema(self, dataset_id, table_id, schema):
+    def schema(self, dataset_id, table_id):
+        """Retrieve the schema of the table"""


can you add a doc-string

jreback · 2017-04-07T20:09:27Z

pandas_gbq/gbq.py

        except HttpError as ex:
            self.process_http_error(ex)

+    def verify_schema(self, dataset_id, table_id, schema):
+        fields_remote = set([json.dumps(field)


jreback · 2017-04-07T20:09:33Z

pandas_gbq/gbq.py

+        return fields_remote == fields_local
+
+    def schema_is_subset(self, dataset_id, table_id, schema):
+        fields_remote = set([json.dumps(field)


jreback · 2017-04-07T20:10:37Z

pandas_gbq/gbq.py

+                             for field in self.schema(dataset_id, table_id)])
+        fields_local = set(json.dumps(field_local)
+                           for field_local in schema['fields'])
+


these routines should prob just take the result of self.schema(dataset_id, table_id), yes?

jreback · 2017-04-07T20:11:05Z

pandas_gbq/tests/test_gbq.py

@@ -1080,6 +1082,29 @@ def test_upload_data_if_table_exists_append(self):
                       _get_project_id(), if_exists='append',
                       private_key=_get_private_key_path())

+    def test_upload_subset_columns_if_table_exists_append(self):
+        test_id = "16"


can you add a comment with the issue ref number (this PR number is ok)

jreback · 2017-04-07T20:12:05Z

pandas_gbq/tests/test_gbq.py

@@ -1275,6 +1300,18 @@ def test_verify_schema_ignores_field_mode(self):
            self.dataset_prefix + "1", TABLE_ID + test_id, test_schema_2),
            'Expected schema to match')

+    def test_retrieve_schema(self):
+        test_id = "15"


comment of issue number

jreback · 2017-04-07T20:12:49Z

pandas_gbq/tests/test_gbq.py

+        actual = self.sut.schema(self.dataset_prefix + "1", TABLE_ID + test_id)
+        expected = test_schema['fields']
+        assert expected == actual, 'Expected schema used to create table'
+


can you add some tests that exercise schema_is_subset

parthea · 2017-04-12T02:39:52Z

is there any reason to change if_exists? IOW, is there a reason not to append with a compat schema? (albeit one that is larger), ever?

@jreback I apologize if I missed the context for this comment. The default behaviour of the if_exists parameter is fail if a table already exists. Do we want to modify the default value of the if_exists parameter?

parthea

Thanks @mr-mcox ! All unit tests passed locally. Code worked as expected in my limited testing.

Could you also update https://github.com/pydata/pandas-gbq/blob/master/docs/source/writing.rst and https://github.com/pydata/pandas-gbq/blob/master/docs/source/changelog.rst to document the improvements in this PR?

Specifically in writing.rst, the following text could be improved: "The dataframe must match the destination table in structure and data types."

Otherwise this LGTM (assuming that we also address the review comments from others, including additional tests for schema_is_subset ).

jreback · 2017-04-12T13:04:17Z

@parthea

is there any reason to change if_exists? IOW, is there a reason not to append with a compat schema? (albeit one that is larger), ever?

@jreback I apologize if I missed the context for this comment. The default behaviour of the if_exists parameter is fail if a table already exists. Do we want to modify the default value of the if_exists parameter?

what I mean is should this be a warning? you could have a 4th (maybe default state), like append_ok_schema_changes. is that worthile?

parthea · 2017-04-12T13:21:34Z

what I mean is should this be a warning? you could have a 4th (maybe default state), like append_ok_schema_changes. is that worthile?

My initial thought is that we don't need to add another if_exists value. It could create confusion in the docs. We could always add another value at a later stage if the community needs it. While we are on the topic of the default state for the if_exists parameter, I do agree that the default value could be changed from 'fail' to 'append'. I don't have a strong opinion about it. @mr-mcox, thoughts ?

mr-mcox · 2017-04-20T22:55:21Z

@parthea Regarding changing the if_exists default, I vote for keeping the default to 'fail'. My rationale is that in BigQuery, removing records once you added them can be especially tricky. For the novice user, I'd want them to make a conscious choice to choose behavior that would add to an existing table.

@jreback has a good point in making the behavior of 'append' also add new fields (not just adding a subset). This has been an occasional wish of mine and I may create a PR for that sometime in the future. However, I suspect that it may require some experimentation and require several more design considerations and ought to be out of the scope for this PR.

jreback · 2017-05-12T10:36:59Z

ping, can you rebase and update to comments.

mr-mcox · 2017-05-12T12:34:55Z

@jreback Sorry - I didn't realize that you were waiting on me for something. Can you be a little more specific about what you'd like me to do? Are you asking for me to take the five commits on this PR and rebase them to one, is there a comment that I neglected to address, or something else entirely?

jreback · 2017-05-12T13:06:28Z

well needs a rebase, yes squashing is nice too (though not super necessary at this stage)

mr-mcox · 2017-05-12T16:34:27Z

@jreback Alright - I've done a git rebase upstream/master, fixed a number of issues that resulted from that and squashed those commits. What would you like me to do next?

parthea · 2017-05-12T16:42:29Z

@mr-mcox Thanks for the rebase. I'm really sorry for the delay in reviewing this pull request. I'm hoping to complete reviews for open pull requests this weekend.

jreback · 2017-05-17T00:59:46Z

pandas_gbq/gbq.py

+        Obtain from BigQuery the field names and field types
+        for the table defined by the parameters
+
+        :param str dataset_id: Name of the BigQuery dataset for the table


pls use the numpy-doc style (see all other doc-strings)

jreback · 2017-05-17T00:59:54Z

docs/source/changelog.rst

@@ -9,6 +9,10 @@ Changelog

 - All gbq errors will simply be subclasses of ``ValueError`` and no longer inherit from the deprecated ``PandasError``.

+0.1.5 / 2017-04-20


move to latest (0.1.7)

the issue reference is misformatted, see the other entries

jreback · 2017-05-17T01:00:27Z

pandas_gbq/gbq.py

+        the schema passed in and indicate whether all fields in the former
+        are present in the latter. Order is not considered.
+
+        :param str dataset_id: Name of the BigQuery dataset for the table


jreback · 2017-05-17T01:00:36Z

pandas_gbq/gbq.py

+        the schema passed in and indicate whether a subset of the fields in
+        the former are present in the latter. Order is not considered.
+
+        :param str dataset_id: Name of the BigQuery dataset for the table


jreback · 2017-05-17T01:00:46Z

pandas_gbq/tests/test_gbq.py

@@ -1071,6 +1071,30 @@ def test_upload_data_if_table_exists_append(self):
                       _get_project_id(), if_exists='append',
                       private_key=_get_private_key_path())

+    def test_upload_subset_columns_if_table_exists_append(self):
+        # For pull request #24


be descriptive

jreback · 2017-05-17T01:01:16Z

pandas_gbq/tests/test_gbq.py

+                   self.destination_table + test_id, _get_project_id(),
+                   if_exists='append', private_key=_get_private_key_path())
+
+        sleep(30)  # <- Curses Google!!!


pls update as @MaximilianR suggests (look at other tests to see how this is done)

Accidentally left a duplicate test in Correcting change to schema made by auto-rebase Fixing missing assertTrue and reversion to not checking subset on append (both from rebase) Replacing AssertEqual Shortening line to pass flake

mr-mcox · 2017-06-07T00:45:55Z

@jreback Sorry this took so long. I did another rebase, updated the doc strings, added some description to test cases and adjusted the change log per your requested changes. The only thing I didn't do was to change test_gbq.py around line 1100. If I'm following the conversation correctly, @MaximilianR 's suggestion was to use wait_for_job but first needed to PR that code in. I searched the repository forwait_for_job and didn't find this function. You mentioned looking at other tests - the test I modeled that one after test_upload_data_if_table_exists_replace still uses the sleep(30) workaround, so I'm wondering if I'm misunderstanding what you're looking for. Could you clarify? Is there anything else you need me to do at this time?

jreback

lgtm. some minor comments.

@parthea if you'd have a look

jreback · 2017-06-07T00:58:29Z

docs/source/changelog.rst

@@ -6,13 +6,15 @@ Changelog

 - Resolve issue where the optional ``--noauth_local_webserver`` command line argument would not be propagated during the authentication process. (:issue:`35`)
 - Drop support for Python 3.4 (:issue:`40`)
+- When using ```to_gbq``` if ```if_exists``` is set to ```append```, dataframe needs to contain only a subset of the fields in the BigQuery schema. (:issue: `24`)


no space after :issue:`24` , say more like:

.to_gbq(...., if_exists='append') ...the rest of what you were saying

jreback · 2017-06-07T00:59:57Z

docs/source/changelog.rst


 0.1.6 / 2017-05-03
 ------------------

 - All gbq errors will simply be subclasses of ``ValueError`` and no longer inherit from the deprecated ``PandasError``.

-0.1.4 / 2017-03-17


did you change this? there isn't a 1.5 release

jreback · 2017-06-07T01:00:50Z

pandas_gbq/gbq.py

@@ -557,7 +557,25 @@ def load_data(self, dataframe, dataset_id, table_id, chunksize):

        self._print("\n")

-    def verify_schema(self, dataset_id, table_id, schema):
+    def schema(self, dataset_id, table_id):


can you update the to_gbq doc-string as well (with expl of the new behavior of if_exists)

jreback · 2017-06-07T01:02:17Z

pandas_gbq/tests/test_gbq.py

+                   self.destination_table + test_id, _get_project_id(),
+                   if_exists='append', private_key=_get_private_key_path())
+
+        sleep(30)  # <- Curses Google!!!


@MaximilianR hmm, wait_for_job must be old.....

max-sixty · 2017-06-07T01:03:42Z

This is wait_for_job:


def wait_for_job(job):

    # from https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/bigquery/cloud-client/snippets.py

    while True:
        job.reload()  # Refreshes the state via a GET request.
        if job.state == 'DONE':
            if job.error_result:
                raise RuntimeError(job.errors)
            return
        logger.info("Waiting for {} to complete".format(job))
        time.sleep(1)

jreback · 2017-06-07T01:22:07Z

@MaximilianR thanks!

@mr-mcox if you want to add that as a utility function ok with that. pls run tests according to instructions and show the branch where they are passing.

(also ok to do this in another PR)

mr-mcox · 2017-06-07T02:34:14Z

@jreback Good ideas on the updates to the changelog. You are correct that I removed 0.1.5 - this PR was the only change under that release and since there isn't 0.1.5 in the changelog on the master branch, I figured that it never happened and removed it entirely. I added wait_for_job with some minor tweaks, replaced the sleep functions across tests and the tests now complete in much less time. Since Travis CI on this page shows the tests on this branch passing, I'm not sure if you're looking for something else. That build is https://travis-ci.org/pydata/pandas-gbq/builds/240203212

mr-mcox · 2017-06-07T02:37:30Z

Wait a sec - another build shows that I messed up calling wait_for_job - I'll fix that now.

This reverts commit 8726a01.

mr-mcox · 2017-06-07T02:51:15Z

@jreback I'm reverting that wait_for_job as it requires a job passed in. The job isn't returned by to_gbq, so I'm at a loss as to how to retrieve the job without refactoring to_gbq. @MaximilianR - what was your intention in using that utility? As I'm staring at to_gbq, I'm wondering if it is already waiting for the job to complete, though I vaguely recall from my early testing that the test failed without the wait.

max-sixty · 2017-06-07T04:41:15Z

Right, that's a good point:

The current implementation of to_gbq uses streaming. This is OK, but slower (max 100k rows / sec), and more expensive.
The current implementation should complete very soon after returning - the data sits in a streaming buffer, and BQ queries that buffer before returning a result
We use upload_from_file internally after writing to a CSV, which returns an async job, and so isn't ready immediately (though overall is faster)
I would love to PR our stuff in, I've just been completely jammed here, and likely will be for another month
Until we reconfigure to_gbq to use jobs, I would vote to allow sleep

jreback · 2017-06-09T10:21:44Z

@parthea if you'd have a look

parthea

LGTM. Thanks @mr-mcox !

I just have minor editorial changes before merging.

max-sixty reviewed Apr 6, 2017

View reviewed changes

jreback reviewed Apr 7, 2017

View reviewed changes

jreback suggested changes Apr 7, 2017

View reviewed changes

parthea reviewed Apr 12, 2017

View reviewed changes

mr-mcox force-pushed the enh-append-subset branch from 54af80f to 3e693b0 Compare April 20, 2017 23:03

parthea added this to the 0.2.0 milestone Apr 22, 2017

mr-mcox force-pushed the enh-append-subset branch 2 times, most recently from f336d4a to f612f95 Compare May 12, 2017 15:37

jreback suggested changes May 17, 2017

View reviewed changes

mr-mcox added 2 commits June 6, 2017 18:51

Improvements discused in PR conversation

c21f999

Accidentally left a duplicate test in Correcting change to schema made by auto-rebase Fixing missing assertTrue and reversion to not checking subset on append (both from rebase) Replacing AssertEqual Shortening line to pass flake

Making updates per jreback's requested changes

ad45010

mr-mcox force-pushed the enh-append-subset branch from f612f95 to ad45010 Compare June 7, 2017 00:13

Fixing trailing whitespace

48851db

jreback approved these changes Jun 7, 2017

View reviewed changes

jreback mentioned this pull request Jun 7, 2017

CLN: use wait_for_job rather than sleep #48

Closed

mr-mcox added 2 commits June 6, 2017 21:18

Adding detail to changelog

7eee379

Use wait_for_job rather than sleep

8726a01

Revert "Use wait_for_job rather than sleep"

71002fe

This reverts commit 8726a01.

parthea approved these changes Jun 13, 2017

View reviewed changes

parthea added 3 commits June 13, 2017 12:44

Minor tweaks before merging

d46c1ed

Update the to_gbq doc-string as suggested by @jreback

d219bcc

Make travis happy

764df25

parthea merged commit c210de1 into googleapis:master Jun 13, 2017

		@@ -9,6 +9,10 @@ Changelog

		- All gbq errors will simply be subclasses of ``ValueError`` and no longer inherit from the deprecated ``PandasError``.

		0.1.5 / 2017-04-20

When appending to a table, load if the dataframe contains a subset of the existing schema #24

When appending to a table, load if the dataframe contains a subset of the existing schema #24

Conversation

mr-mcox commented Apr 6, 2017

Purpose

Primary changes made

Auxiliary changes made

codecov-io commented Apr 6, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parthea commented Apr 12, 2017

parthea left a comment • edited Loading

Choose a reason for hiding this comment

jreback commented Apr 12, 2017

parthea commented Apr 12, 2017 • edited Loading

mr-mcox commented Apr 20, 2017

jreback commented May 12, 2017

mr-mcox commented May 12, 2017

jreback commented May 12, 2017

mr-mcox commented May 12, 2017

parthea commented May 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mr-mcox commented Jun 7, 2017

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-sixty commented Jun 7, 2017

jreback commented Jun 7, 2017

mr-mcox commented Jun 7, 2017

mr-mcox commented Jun 7, 2017

mr-mcox commented Jun 7, 2017

max-sixty commented Jun 7, 2017

jreback commented Jun 9, 2017

parthea left a comment

Choose a reason for hiding this comment

codecov-io commented Apr 6, 2017 •

edited

Loading

parthea left a comment •

edited

Loading

parthea commented Apr 12, 2017 •

edited

Loading