Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery usage docs do not explain references #4930

Closed
inglesp opened this issue Feb 25, 2018 · 10 comments
Closed

BigQuery usage docs do not explain references #4930

inglesp opened this issue Feb 25, 2018 · 10 comments
Assignees
Labels
api: bigquery Issues related to the BigQuery API. type: cleanup An internal cleanup or hygiene concern.

Comments

@inglesp
Copy link

inglesp commented Feb 25, 2018

The BigQuery usage docs do not explain what TableReferences are, when/why you'd need to use one instead of a Table, and how to get a TableReference from a Table or vice versa.

(And similarly for DatasetReferences and Datasets.)

@chemelnucfin chemelnucfin added documentation api: bigquery Issues related to the BigQuery API. type: cleanup An internal cleanup or hygiene concern. labels Feb 26, 2018
@chemelnucfin chemelnucfin self-assigned this Feb 26, 2018
@tswast
Copy link
Contributor

tswast commented Feb 26, 2018

Yes, this could use some additional explanation in the docs.

The main purpose of TableReference and DatasetReference is to indicate that it is only a pointer to a Table or Dataset. Several properties in the REST API only accept / return a pointer to a table, such as QueryJob.destination.

It is possible to go from a Table to a TableReference with Table.reference. Likewise, to go from a Dataset to a DatasetReference with Dataset.reference.

To go from a reference to a full object, use the client to fetch the full object from the API with get_table() or get_dataset(). If a Table or Dataset does not exist (for example, you want to create one with create_table or create_dataset), the Table and Dataset constructors accept a reference as their argument.

@tswast
Copy link
Contributor

tswast commented Feb 26, 2018

Note: the usage docs do have examples for create_table(), get_table(), create_dataset() and get_dataset().

I agree that examples using the table.reference and dataset.reference properties would be helpful.

@inglesp
Copy link
Author

inglesp commented Feb 27, 2018

Thanks for your comments here. It'd be really helpful to have this in the documentation!

@max-sixty
Copy link

I did find this very confusing. For example, client.dataset(dataset_name) returns a DatasetReference, in spite of its name.

Is this something we're coupled to because of the REST API? Could we at least add options to supply strings, so bigquery.Dataset(dataset_name) returned a dataset?

Overall, the API is extremely class-heavy for a python library. A recent frustrating example was client.list_datasets() doesn't return a list, or even a generator, it returns a google.api_core.page_iterator.HTTPIterator (though if you try and use it as an iterator you get TypeError: HTTPIterator object is not an iterator!)

@tswast
Copy link
Contributor

tswast commented Apr 21, 2018

client.dataset(dataset_name) returns a DatasetReference, in spite of its name

It's funny you mention that method. It's probably the only thing that didn't change in the 0.27 to 0.28 rewrite. In 0.27 and earlier, the dataset() method returned a Dataset class but it was really just a reference. Confusingly, even though it was a Dataset none of the properties were populated besides the ID!

Could we at least add options to supply strings, so bigquery.Dataset(dataset_name) returned a dataset?

In my first version of the rewrite, I proposed exactly this (allowing either string or reference), but the number of combinations exploded pretty fast. Some folks on the Datalab / Colab teams gave me some feedback that only allowing references would greatly simplify the implementation (which I do agree it did accomplish that).

For example, one trouble with bigquery.Dataset(dataset_name) is that in that case you don't have a project associated with the dataset because only the client has that info. This would require the API have hooks to handle partial references that get filled with defaults anywhere that the current API can just use the full path from the reference.

Also, yes we are slightly tied to having reference objects because of the REST API. For example, there are 16 instances of TableReference in the Jobs resource alone. https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs In 0.27, the Python API again pretended these were full Table objects but they only had the ID fields populated, which I thought was even more confusing.

Overall, the API is extremely class-heavy for a python library. A recent frustrating example was client.list_datasets() doesn't return a list, or even a generator, it returns a google.api_core.page_iterator.HTTPIterator

True, it's not an iterator. But it is an iterable. For example list(client.list_datasets()) does build a list of values from all pages of the API response.

@max-sixty
Copy link

max-sixty commented Apr 21, 2018

Thanks for the reply @tswast

I still don't understand what a DatasetReference has over a 'project.dataset' string - looking through the implementation the only advantage I can see is creating a table from the object. Is it that it could be project:dataset or dataset-with-implicit-project, and we don't want to support all the permutations?

It sounds like you've thought about it a lot, so I pause in humility. But I remain confused why (I think) this is what's currently required to create a dataset, where I would have expected client.create_dataset(name)

dataset = client.create_dataset(
    bigquery.Dataset(
        client.dataset(dataset_name)
    )
)

it's not an iterator. But it is an iterable

Yes that's fair, and my original comment probably wasn't balanced. Still, calling next(dataset_list) and getting an error isn't ideal, even though it's minor

@tswast
Copy link
Contributor

tswast commented Apr 21, 2018

Honestly since the REST API separates everything out like

{"projectId": "my-project", "datasetId": "my_dataset"}

it hadn't crossed my mind to accept a fully-qualified dataset ID. I would be open to a PR that modifies Dataset to accept strings like "project.dataset" as an option where there is a Dataset reference.

@tswast
Copy link
Contributor

tswast commented Apr 27, 2018

Re: my previous comment.

I've sent #5255 to add Dataset/Table.from_string(fully_qualified_id), which I think will address the concern that it requires too many objects to create a table/dataset.

@tseaver
Copy link
Contributor

tseaver commented May 29, 2018

@tswast With #5255 merged, should this issue be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. type: cleanup An internal cleanup or hygiene concern.
Projects
None yet
Development

No branches or pull requests

5 participants