Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autodetect feature is not available in BigQuery client #2926

Closed
teiuaz opened this issue Jan 9, 2017 · 22 comments
Closed

Autodetect feature is not available in BigQuery client #2926

teiuaz opened this issue Jan 9, 2017 · 22 comments
Assignees
Labels
api: bigquery Issues related to the BigQuery API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: question Request for information or clarification. Not an issue.

Comments

@teiuaz
Copy link

teiuaz commented Jan 9, 2017

Autodetect feature works only if the body of the request contains a schema field.
For example, that works fine:

job = {
    'projectId': PROJECT_ID,
    'configuration': {
        'load': {
            'autodetect': True,  # works without it, by default autodetect = True 
            'destinationTable': {
                'projectId': PROJECT_ID,
                'datasetId': self.dataset,
                'tableId': self.table,
            },
            'sourceUris': [self.path],
            'writeDisposition': self.write_disposition
        }
    }
}

The task will be failed if schema path is not specified in the task params.

May be this code in the client works unexpected(job.py):

resource = {
    'jobReference': {
        'projectId': self.project,
        'jobId': self.name,
    },
    'configuration': {
        self._JOB_TYPE: {
            'sourceUris': self.source_uris,
            'destinationTable': {
                'projectId': self.destination.project,
                'datasetId': self.destination.dataset_name,
                'tableId': self.destination.name,
            },
        },
    },
}
configuration = resource['configuration'][self._JOB_TYPE]
self._populate_config_resource(configuration)

if len(self.schema) > 0:
    configuration['schema'] = {
        'fields': _build_schema_resource(self.schema)}
@teiuaz teiuaz changed the title autodetect feature is not available in BigQuery client Autodetect feature is not available in BigQuery client Jan 9, 2017
@daspecster daspecster added the api: bigquery Issues related to the BigQuery API. label Jan 9, 2017
@danoscarmike danoscarmike added type: question Request for information or clarification. Not an issue. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Feb 28, 2017
@lukesneeringer
Copy link
Contributor

@ivvory Thank you for your submission.
Unfortunately, I am unsure what it is you want us to change. Would you be willing to rephrase?

@teiuaz
Copy link
Author

teiuaz commented Mar 1, 2017

@lukesneeringer sorry for bad explanation. Let me try again. The problem is that when i was trying to use autodetect feature to upload data without the schema i faced with a problem the feature doesn't work correctly. I tried to use autodetect by passing empty schema, because didn't find how to use it through the client directly. By the way I decided to pass jobs myself specifying params manually. When i passed autodetect=True and empty fields param(like fields=None or fields='') i got an error. But when i didn't pass the empty fields param the job completed successfully.

May be the code below from the client adds the fields param to the job in any case.

if len(self.schema) > 0:
    configuration['schema'] = {
        'fields': _build_schema_resource(self.schema)}

P.S. Thanks for autodetect feature in gc web interface :)

@lukesneeringer
Copy link
Contributor

@ivvory Thanks, that makes sense. Thank you.

@lukesneeringer lukesneeringer self-assigned this Mar 1, 2017
@max-sixty
Copy link

I'm trying to parse through the posts above.

Is there a way to use autodetect with the google-cloud-bigquery library at the moment?

@dhermes
Copy link
Contributor

dhermes commented May 1, 2017

@MaximilianR We don't have any explicit support for it (yet).

@tswast AFAICT autodetect only shows up in

  • Job.configuration.load.autodetect
  • Job.configuration.query.tableDefinitions.autodetect
  • Table.externalDataConfiguration.autodetect

The relevant methods for autodetect at creation time seem to be

  • bigquery.jobs.insert
  • bigquery.tables.insert

Is this correct? (I based this off the discovery doc.)

@tswast
Copy link
Contributor

tswast commented May 1, 2017

I believe that is correct.

There are two times when you would use the autodetect feature: creating a new table from loaded data and making an external table definition. https://cloud.google.com/bigquery/external-table-definition

@teiuaz
Copy link
Author

teiuaz commented May 4, 2017

@MaximilianR may be you are interested in some notes about using autodetect.

Now you can use autodetect simply by polling the job:

jobs = GbqConnector(project_id=project_id).service.jobs()
params = {} # declare

# use autodetect if needed
params['configuration']['load']['autodetect'] = True

response = jobs.insert(projectId=project_id, body=params).execute()
job_id = response['jobReference']['jobId']

result = poll_job(jobs, job_id)


def poll_job(jobs, job_id):
    while True:
        status = jobs.get(projectId=project_id, jobId=job_id).execute()['status']

        if status['state'] == 'DONE':
            return 'DONE'

        sleep(1)

Sometimes you want to add the data with new columns that are absent in existing table. That is more popular case when using autodetect rather than manual schema. So you may consider schemaUpdateOption param with also new 'ALLOW_FIELD_ADDITION' option that allows to detect new fields missed in target table:

    if write_disposition == 'WRITE_APPEND':
        params['configuration']['load']['schemaUpdateOptions'] = ['ALLOW_FIELD_ADDITION']

One more problem is produced by using of these features. I guess a workflow is the following: if you have a table and try to load data with autodetect first the schema from the data is detected and then compared with the schema of existing table. That is a cause of some errors as sometimes you don't know what type has a column in the current data. For example if all values of some field in all processing rows are equal null then autodetect recognize it as a string type(not sure exactly) but actually in the existing table this field has another type. Simplest solution that can help:

def poll_job(jobs, job_id):
    while True:
        status = jobs.get(projectId=project_id, jobId=job_id).execute()['status']

        # check for this situation
        if 'errorResult' in status:
            if str(status['errorResult']['message']).startswith('Invalid schema update'):
                return 'Autodetect failed'
            else:
                raise Exception(str(status))

        if status['state'] == 'DONE':
            return 'DONE'

        sleep(1)
 
# param declaring...
result = poll_job(jobs, job_id)

if result == 'Autodetect failed':
    del params['configuration']['load']['autodetect']
    del params['configuration']['load']['schemaUpdateOptions']

    response = jobs.insert(projectId=self.project_id, body=job).execute()
    job_id = response['jobReference']['jobId']
    poll_job(jobs, job_id)

One flexible option related to using of autodetect is unavailable. I mean you cannot create a table without data loading. So i had to write the own schema generator copying the workflow of autodetect option :)

Hope you found that helpful.

@gyangscrf
Copy link

Where is module GbqConnector?

@tswast
Copy link
Contributor

tswast commented Jun 26, 2017

I believe GbqConnector is from the Pandas library, not this one. https://github.com/pydata/pandas-gbq/blob/4b04174ddc61e6490dd07f8e6b758463d82206df/pandas_gbq/gbq.py#L193

@snarfed
Copy link

snarfed commented Jun 29, 2017

UPDATE: the code below doesn't work. try this code instead!

here's workaround code that works with the current released google-cloud-python (0.24.0). the key part is passing {'configuration': {'load': {'autodetect': True}}} to LoadTableFromStorageJob.from_api_repr().

from google.cloud.bigquery import Client
from google.cloud.bigquery.job import LoadTableFromStorageJob
from google.cloud.bigquery.table import Table, Dataset

client = Client(...)
job = LoadTableFromStorageJob.from_api_repr({
    'jobReference': {
        'jobId': 'UNIQUE_ID',
    },
    'configuration': {
        'load': {
            'destinationTable': {
                'datasetId': 'analytics',
                'tableId': table.name,
            },
            'autodetect': True,
        },
    },
  }, client=client)
job.destination = Table('TABLE_NAME', Dataset(Dataset('DATASET_NAME', client))
job.source_uris = ['file.csv', ...]

# set other options, e.g.:
job.source_format = 'CSV'
...

job.begin()

@dhermes
Copy link
Contributor

dhermes commented Jun 30, 2017

FWIW the latest is 0.25.0

@snarfed
Copy link

snarfed commented Jul 1, 2017

oops, the code i gave in #2926 (comment) doesn't actually work after all. i ended up having to monkey patch to get autodetect into the API call. here's that code:

def _add_autodetect():
    resource = LoadTableFromStorageJob._build_resource(job)
    resource['configuration']['load']['autodetect'] = True
    return resource

job._build_resource = _add_autodetect

can't wait for official support! :P

@tswast
Copy link
Contributor

tswast commented Jul 30, 2017

At least some of this issue was fixed in #3648

@lukesneeringer
Copy link
Contributor

@tswast Is what is left of this issue covered in the redesign? If so, I would like to close this out.

@tswast tswast closed this as completed Aug 14, 2017
@hemanthk92
Copy link

@snarfed While file (& what directory is that file in) did you change and add this new _add_auto_detect function?

@snarfed
Copy link

snarfed commented Dec 12, 2017

@hemanthk92 it goes in your own code, not bigquery/google cloud's...but ignore it! they've fixed this bug. just set job.autodetect = True on your LoadTableFromStorageJob now.

@hemanthk92
Copy link

@snarfed thanks for your response.
I'm on google-cloud-bigquery==0.28.0
I don't see autodetect parameter as an option.
screen shot 2017-12-12 at 10 29 51 pm

@tswast
Copy link
Contributor

tswast commented Dec 13, 2017

@hemanthk92 It is a property on LoadJobConfig in 0.28.0. The migration guide has a sample for providing a configuration.

from google.cloud import bigquery

client = bigquery.Client()

dataset_ref = client.dataset('test_dataset')
table_ref = dataset_ref.table('test_table')

job_config = bigquery.LoadJobConfig()
job_config.autodetect = True

with open('data_sample.txt', 'rb') as source_file:
    job = client.load_table_from_file(
        source_file, table_ref, job_config=job_config)  # Start the job.

job.result()  # Wait for the job to complete.

@ravi45722
Copy link

I am using streaming insertion. Is there any way to use this auto detect schemas in streaming??

@tswast
Copy link
Contributor

tswast commented Jan 2, 2018

Sorry, @ravi45722 the BigQuery API does not have an auto-detect feature for the streaming API. I recommend filing an issue requesting this feature at https://issuetracker.google.com/issues/new?component=187149&template=0

@tseaver
Copy link
Contributor

tseaver commented Jan 10, 2018

@yiga2 FWIW, that feature is marked "experimental" in the docs.

The workaround you propose won't work: instead, I would use:

    job_config._properties['schemaUpdateOptions'] = ['ALLOW_FIELD_ADDITION']

@chemelnucfin
Copy link
Contributor

@tseaver did you mean to reply to #4728?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests