Add option for limiting rows of retrieved of results #102

bburky · 2017-12-30T23:27:46Z

I found the pandas-gbq interface easy to use and wanted to also use it for creating tables in BigQuery, not just downloading all the results at once. The existing capabilities of read_gbq() is actually already sufficient to do this, because you can just set query.destinationTable in the job configuration. However, I would like to limit the number of retrieved rows to a small sample of the whole table that was created instead of downloading the many thousands of rows that were created.

I've already played with making the changes myself in a project I'm working on:

http://nbviewer.jupyter.org/github/bburky/subredditgenderratios/blob/master/Subreddit%20Gender%20Ratios.ipynb

In the current code for run_query(), you read all rows from the table by converting the iterator into a list. Instead, you could pass the iterator to itertools.islice() first to limit it to a configurable limit. You can look at my code to see how it could be done.

Also, if you're interested I could contribute the IPython %%bigquery cell magic I am using in that project. It should be a very simple wrapper around read_gbq().

The text was updated successfully, but these errors were encountered:

max-sixty · 2017-12-31T11:56:24Z

Adding the %%bigquery magic would be interesting. How similar is that to the magic in google's datalab?

bburky · 2017-12-31T17:03:13Z

Oh, I didn't know about datalab. My %%bigquery magic is basically a thin wrapper around read_gbq(). I can't quite tell how datalab's %%bq magic works, it may do something similar around their own APIs. They also have a --name argument for reusing results, but I can't tell if it saves tables, or just lets you save the output to a variable for reuse.

Would work on an IPython magic be done in this repository, or a separate standalone one? It would introduce a dependency on IPython.

max-sixty · 2018-01-01T05:16:21Z

Would work on an IPython magic be done in this repository, or a separate standalone one? It would introduce a dependency on IPython.

I'm not sure actually (it could be an optional dependency though). @jreback @tswast ?

jreback · 2018-01-01T05:23:36Z

you can do it in this repo; it a separate piece so the dep is fine (and only impacts the magic piece)

tswast · 2018-01-02T17:29:02Z

Yeah, could be an optional dependency.

@bburky: My coworker @alixhami has started some work on making a %%bigquery magic, too. He's planning on sending a PR soon to https://github.com/GoogleCloudPlatform/google-cloud-python to add the magic command there. You may wish to coordinate.

tswast · 2018-03-06T22:46:21Z

Follow-up for magics, the google-cloud-bigquery library is adding a BigQuery magic in googleapis/google-cloud-python#4983

max-sixty · 2018-04-20T21:21:53Z

I think we can close this - it's open because of the %%bigquery, which is now upstream.

Re limiting rows - that's very easy to do with a LIMIT clause

Reopen if anyone disagrees

tswast · 2018-04-20T21:28:47Z

Re limiting rows - that's very easy to do with a LIMIT clause

The issue was opened with the thought that you could do a query with a lot of results and write to a destination table, but only want to sample the results.

However, I would like to limit the number of retrieved rows to a small sample of the whole table that was created instead of downloading the many thousands of rows that were created.

Limiting the maximum results via pandas-gbq probably doesn't make the most sense if you want a representative sample. For that you'd want to add a little randomness and select from the destination table.

Or @bburky did you not want a representative sample, more just a preview?

max-sixty · 2018-04-20T22:12:46Z

Ah, preview makes sense. Sorry for being overzealous.

bburky · 2018-04-20T22:40:17Z

Yes. I was running a query that returned many many results and saving the results with `query.destinationTable`. To sanity check the query I ran, I wanted to see a 100 row preview or so. A possible implementation is just `itertools.islice()` of the query results. I don't mind if this isn't completely built into pandas-gbq, but there wasn't any way for me to hook or modify anything except copy pasting and modifying `run_query()`.

max-sixty closed this as completed Apr 20, 2018

max-sixty reopened this Apr 20, 2018

max-sixty mentioned this issue Aug 21, 2018

KeyError in pandas.io.gbq.read_gbq when no DataFrame should be returned #45

Closed

tswast added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Aug 2, 2019

tswast self-assigned this Aug 2, 2019

tswast mentioned this issue Aug 3, 2019

ENH: Add max_results argument to read_gbq. #286

Merged

2 tasks

tswast closed this as completed in #286 Aug 9, 2019

tswast mentioned this issue Aug 26, 2019

BigQuery: preview / list_rows command for %%bigquery Jupyter magic googleapis/google-cloud-python#9105

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option for limiting rows of retrieved of results #102

Add option for limiting rows of retrieved of results #102

bburky commented Dec 30, 2017

max-sixty commented Dec 31, 2017

bburky commented Dec 31, 2017

max-sixty commented Jan 1, 2018

jreback commented Jan 1, 2018

tswast commented Jan 2, 2018

tswast commented Mar 6, 2018

max-sixty commented Apr 20, 2018

tswast commented Apr 20, 2018

max-sixty commented Apr 20, 2018

bburky commented Apr 20, 2018 via email

Add option for limiting rows of retrieved of results #102

Add option for limiting rows of retrieved of results #102

Comments

bburky commented Dec 30, 2017

max-sixty commented Dec 31, 2017

bburky commented Dec 31, 2017

max-sixty commented Jan 1, 2018

jreback commented Jan 1, 2018

tswast commented Jan 2, 2018

tswast commented Mar 6, 2018

max-sixty commented Apr 20, 2018

tswast commented Apr 20, 2018

max-sixty commented Apr 20, 2018

bburky commented Apr 20, 2018 via email