Autodetect feature is not available in BigQuery client #2926

teiuaz · 2017-01-09T15:22:00Z

Autodetect feature works only if the body of the request contains a schema field.
For example, that works fine:

job = {
    'projectId': PROJECT_ID,
    'configuration': {
        'load': {
            'autodetect': True,  # works without it, by default autodetect = True 
            'destinationTable': {
                'projectId': PROJECT_ID,
                'datasetId': self.dataset,
                'tableId': self.table,
            },
            'sourceUris': [self.path],
            'writeDisposition': self.write_disposition
        }
    }
}

The task will be failed if schema path is not specified in the task params.

May be this code in the client works unexpected(job.py):

resource = {
    'jobReference': {
        'projectId': self.project,
        'jobId': self.name,
    },
    'configuration': {
        self._JOB_TYPE: {
            'sourceUris': self.source_uris,
            'destinationTable': {
                'projectId': self.destination.project,
                'datasetId': self.destination.dataset_name,
                'tableId': self.destination.name,
            },
        },
    },
}
configuration = resource['configuration'][self._JOB_TYPE]
self._populate_config_resource(configuration)

if len(self.schema) > 0:
    configuration['schema'] = {
        'fields': _build_schema_resource(self.schema)}

lukesneeringer · 2017-02-28T20:41:22Z

@ivvory Thank you for your submission.
Unfortunately, I am unsure what it is you want us to change. Would you be willing to rephrase?

teiuaz · 2017-03-01T12:42:58Z

@lukesneeringer sorry for bad explanation. Let me try again. The problem is that when i was trying to use autodetect feature to upload data without the schema i faced with a problem the feature doesn't work correctly. I tried to use autodetect by passing empty schema, because didn't find how to use it through the client directly. By the way I decided to pass jobs myself specifying params manually. When i passed autodetect=True and empty fields param(like fields=None or fields='') i got an error. But when i didn't pass the empty fields param the job completed successfully.

May be the code below from the client adds the fields param to the job in any case.

if len(self.schema) > 0:
    configuration['schema'] = {
        'fields': _build_schema_resource(self.schema)}

P.S. Thanks for autodetect feature in gc web interface :)

lukesneeringer · 2017-03-01T15:27:10Z

@ivvory Thanks, that makes sense. Thank you.

max-sixty · 2017-04-29T11:46:55Z

I'm trying to parse through the posts above.

Is there a way to use autodetect with the google-cloud-bigquery library at the moment?

dhermes · 2017-05-01T16:46:57Z

@MaximilianR We don't have any explicit support for it (yet).

@tswast AFAICT autodetect only shows up in

Job.configuration.load.autodetect
Job.configuration.query.tableDefinitions.autodetect
Table.externalDataConfiguration.autodetect

The relevant methods for autodetect at creation time seem to be

bigquery.jobs.insert
bigquery.tables.insert

Is this correct? (I based this off the discovery doc.)

tswast · 2017-05-01T17:45:34Z

I believe that is correct.

There are two times when you would use the autodetect feature: creating a new table from loaded data and making an external table definition. https://cloud.google.com/bigquery/external-table-definition

teiuaz · 2017-05-04T09:27:28Z

@MaximilianR may be you are interested in some notes about using autodetect.

Now you can use autodetect simply by polling the job:

jobs = GbqConnector(project_id=project_id).service.jobs()
params = {} # declare

# use autodetect if needed
params['configuration']['load']['autodetect'] = True

response = jobs.insert(projectId=project_id, body=params).execute()
job_id = response['jobReference']['jobId']

result = poll_job(jobs, job_id)


def poll_job(jobs, job_id):
    while True:
        status = jobs.get(projectId=project_id, jobId=job_id).execute()['status']

        if status['state'] == 'DONE':
            return 'DONE'

        sleep(1)

Sometimes you want to add the data with new columns that are absent in existing table. That is more popular case when using autodetect rather than manual schema. So you may consider schemaUpdateOption param with also new 'ALLOW_FIELD_ADDITION' option that allows to detect new fields missed in target table:

    if write_disposition == 'WRITE_APPEND':
        params['configuration']['load']['schemaUpdateOptions'] = ['ALLOW_FIELD_ADDITION']

One more problem is produced by using of these features. I guess a workflow is the following: if you have a table and try to load data with autodetect first the schema from the data is detected and then compared with the schema of existing table. That is a cause of some errors as sometimes you don't know what type has a column in the current data. For example if all values of some field in all processing rows are equal null then autodetect recognize it as a string type(not sure exactly) but actually in the existing table this field has another type. Simplest solution that can help:

def poll_job(jobs, job_id):
    while True:
        status = jobs.get(projectId=project_id, jobId=job_id).execute()['status']

        # check for this situation
        if 'errorResult' in status:
            if str(status['errorResult']['message']).startswith('Invalid schema update'):
                return 'Autodetect failed'
            else:
                raise Exception(str(status))

        if status['state'] == 'DONE':
            return 'DONE'

        sleep(1)
 
# param declaring...
result = poll_job(jobs, job_id)

if result == 'Autodetect failed':
    del params['configuration']['load']['autodetect']
    del params['configuration']['load']['schemaUpdateOptions']

    response = jobs.insert(projectId=self.project_id, body=job).execute()
    job_id = response['jobReference']['jobId']
    poll_job(jobs, job_id)

One flexible option related to using of autodetect is unavailable. I mean you cannot create a table without data loading. So i had to write the own schema generator copying the workflow of autodetect option :)

Hope you found that helpful.

gyangscrf · 2017-06-26T21:55:33Z

Where is module GbqConnector?

tswast · 2017-06-26T21:59:28Z

I believe GbqConnector is from the Pandas library, not this one. https://github.com/pydata/pandas-gbq/blob/4b04174ddc61e6490dd07f8e6b758463d82206df/pandas_gbq/gbq.py#L193

snarfed · 2017-06-29T22:06:22Z

UPDATE: the code below doesn't work. try this code instead!

here's workaround code that works with the current released google-cloud-python (0.24.0). the key part is passing {'configuration': {'load': {'autodetect': True}}} to LoadTableFromStorageJob.from_api_repr().

from google.cloud.bigquery import Client
from google.cloud.bigquery.job import LoadTableFromStorageJob
from google.cloud.bigquery.table import Table, Dataset

client = Client(...)
job = LoadTableFromStorageJob.from_api_repr({
    'jobReference': {
        'jobId': 'UNIQUE_ID',
    },
    'configuration': {
        'load': {
            'destinationTable': {
                'datasetId': 'analytics',
                'tableId': table.name,
            },
            'autodetect': True,
        },
    },
  }, client=client)
job.destination = Table('TABLE_NAME', Dataset(Dataset('DATASET_NAME', client))
job.source_uris = ['file.csv', ...]

# set other options, e.g.:
job.source_format = 'CSV'
...

job.begin()

dhermes · 2017-06-30T03:18:45Z

FWIW the latest is 0.25.0

snarfed · 2017-07-01T00:13:35Z

oops, the code i gave in #2926 (comment) doesn't actually work after all. i ended up having to monkey patch to get autodetect into the API call. here's that code:

def _add_autodetect():
    resource = LoadTableFromStorageJob._build_resource(job)
    resource['configuration']['load']['autodetect'] = True
    return resource

job._build_resource = _add_autodetect

can't wait for official support! :P

tswast · 2017-07-30T17:02:05Z

At least some of this issue was fixed in #3648

lukesneeringer · 2017-08-11T21:25:37Z

@tswast Is what is left of this issue covered in the redesign? If so, I would like to close this out.

hemanthk92 · 2017-12-12T23:12:23Z

@snarfed While file (& what directory is that file in) did you change and add this new _add_auto_detect function?

snarfed · 2017-12-12T23:55:27Z

@hemanthk92 it goes in your own code, not bigquery/google cloud's...but ignore it! they've fixed this bug. just set job.autodetect = True on your LoadTableFromStorageJob now.

hemanthk92 · 2017-12-13T06:31:17Z

@snarfed thanks for your response.
I'm on google-cloud-bigquery==0.28.0
I don't see autodetect parameter as an option.

tswast · 2017-12-13T18:37:22Z

@hemanthk92 It is a property on LoadJobConfig in 0.28.0. The migration guide has a sample for providing a configuration.

from google.cloud import bigquery

client = bigquery.Client()

dataset_ref = client.dataset('test_dataset')
table_ref = dataset_ref.table('test_table')

job_config = bigquery.LoadJobConfig()
job_config.autodetect = True

with open('data_sample.txt', 'rb') as source_file:
    job = client.load_table_from_file(
        source_file, table_ref, job_config=job_config)  # Start the job.

job.result()  # Wait for the job to complete.

ravi45722 · 2017-12-29T18:24:05Z

I am using streaming insertion. Is there any way to use this auto detect schemas in streaming??

tswast · 2018-01-02T17:18:36Z

Sorry, @ravi45722 the BigQuery API does not have an auto-detect feature for the streaming API. I recommend filing an issue requesting this feature at https://issuetracker.google.com/issues/new?component=187149&template=0

tseaver · 2018-01-10T21:44:05Z

@yiga2 FWIW, that feature is marked "experimental" in the docs.

The workaround you propose won't work: instead, I would use:

    job_config._properties['schemaUpdateOptions'] = ['ALLOW_FIELD_ADDITION']

chemelnucfin · 2018-01-11T08:09:39Z

@tseaver did you mean to reply to #4728?

teiuaz changed the title ~~autodetect feature is not available in BigQuery client~~ Autodetect feature is not available in BigQuery client Jan 9, 2017

daspecster added the api: bigquery Issues related to the BigQuery API. label Jan 9, 2017

danoscarmike added type: question Request for information or clarification. Not an issue. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Feb 28, 2017

danoscarmike added the Status: Acknowledged label Feb 28, 2017

lukesneeringer self-assigned this Mar 1, 2017

azymnis mentioned this issue Jun 23, 2017

Autodetect schema option is missing from google.cloud.bigquery.job.LoadTableFromStorageJob #3521

Closed

tseaver mentioned this issue Jun 26, 2017

autodetect schema for avro file is not enabled #3539

Closed

ssgtcookie mentioned this issue Jul 4, 2017

Autodetect option not implemented on the uploadCsv function - bigQuery cloud lib googleapis/google-cloud-dotnet#1166

Closed

tseaver mentioned this issue Jul 12, 2017

BigQuery data load fails from GCS on the first load. #3602

Closed

tswast closed this as completed Aug 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autodetect feature is not available in BigQuery client #2926

Autodetect feature is not available in BigQuery client #2926

teiuaz commented Jan 9, 2017

lukesneeringer commented Feb 28, 2017

teiuaz commented Mar 1, 2017

lukesneeringer commented Mar 1, 2017

max-sixty commented Apr 29, 2017

dhermes commented May 1, 2017

tswast commented May 1, 2017

teiuaz commented May 4, 2017

gyangscrf commented Jun 26, 2017

tswast commented Jun 26, 2017

snarfed commented Jun 29, 2017 •

edited

Loading

dhermes commented Jun 30, 2017

snarfed commented Jul 1, 2017

tswast commented Jul 30, 2017

lukesneeringer commented Aug 11, 2017

hemanthk92 commented Dec 12, 2017

snarfed commented Dec 12, 2017

hemanthk92 commented Dec 13, 2017

tswast commented Dec 13, 2017

ravi45722 commented Dec 29, 2017

tswast commented Jan 2, 2018

tseaver commented Jan 10, 2018

chemelnucfin commented Jan 11, 2018

Autodetect feature is not available in BigQuery client #2926

Autodetect feature is not available in BigQuery client #2926

Comments

teiuaz commented Jan 9, 2017

lukesneeringer commented Feb 28, 2017

teiuaz commented Mar 1, 2017

lukesneeringer commented Mar 1, 2017

max-sixty commented Apr 29, 2017

dhermes commented May 1, 2017

tswast commented May 1, 2017

teiuaz commented May 4, 2017

gyangscrf commented Jun 26, 2017

tswast commented Jun 26, 2017

snarfed commented Jun 29, 2017 • edited Loading

dhermes commented Jun 30, 2017

snarfed commented Jul 1, 2017

tswast commented Jul 30, 2017

lukesneeringer commented Aug 11, 2017

hemanthk92 commented Dec 12, 2017

snarfed commented Dec 12, 2017

hemanthk92 commented Dec 13, 2017

tswast commented Dec 13, 2017

ravi45722 commented Dec 29, 2017

tswast commented Jan 2, 2018

tseaver commented Jan 10, 2018

chemelnucfin commented Jan 11, 2018

snarfed commented Jun 29, 2017 •

edited

Loading