-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autodetect feature is not available in BigQuery client #2926
Comments
@ivvory Thank you for your submission. |
@lukesneeringer sorry for bad explanation. Let me try again. The problem is that when i was trying to use autodetect feature to upload data without the schema i faced with a problem the feature doesn't work correctly. I tried to use autodetect by passing empty schema, because didn't find how to use it through the client directly. By the way I decided to pass jobs myself specifying params manually. When i passed May be the code below from the client adds the if len(self.schema) > 0:
configuration['schema'] = {
'fields': _build_schema_resource(self.schema)} P.S. Thanks for autodetect feature in gc web interface :) |
@ivvory Thanks, that makes sense. Thank you. |
I'm trying to parse through the posts above. Is there a way to use |
@MaximilianR We don't have any explicit support for it (yet). @tswast AFAICT
The relevant methods for autodetect at creation time seem to be
Is this correct? (I based this off the discovery doc.) |
I believe that is correct. There are two times when you would use the autodetect feature: creating a new table from loaded data and making an external table definition. https://cloud.google.com/bigquery/external-table-definition |
@MaximilianR may be you are interested in some notes about using autodetect. Now you can use autodetect simply by polling the job: jobs = GbqConnector(project_id=project_id).service.jobs()
params = {} # declare
# use autodetect if needed
params['configuration']['load']['autodetect'] = True
response = jobs.insert(projectId=project_id, body=params).execute()
job_id = response['jobReference']['jobId']
result = poll_job(jobs, job_id)
def poll_job(jobs, job_id):
while True:
status = jobs.get(projectId=project_id, jobId=job_id).execute()['status']
if status['state'] == 'DONE':
return 'DONE'
sleep(1) Sometimes you want to add the data with new columns that are absent in existing table. That is more popular case when using autodetect rather than manual schema. So you may consider schemaUpdateOption param with also new if write_disposition == 'WRITE_APPEND':
params['configuration']['load']['schemaUpdateOptions'] = ['ALLOW_FIELD_ADDITION'] One more problem is produced by using of these features. I guess a workflow is the following: if you have a table and try to load data with autodetect first the schema from the data is detected and then compared with the schema of existing table. That is a cause of some errors as sometimes you don't know what type has a column in the current data. For example if all values of some field in all processing rows are equal null then autodetect recognize it as a string type(not sure exactly) but actually in the existing table this field has another type. Simplest solution that can help: def poll_job(jobs, job_id):
while True:
status = jobs.get(projectId=project_id, jobId=job_id).execute()['status']
# check for this situation
if 'errorResult' in status:
if str(status['errorResult']['message']).startswith('Invalid schema update'):
return 'Autodetect failed'
else:
raise Exception(str(status))
if status['state'] == 'DONE':
return 'DONE'
sleep(1)
# param declaring...
result = poll_job(jobs, job_id)
if result == 'Autodetect failed':
del params['configuration']['load']['autodetect']
del params['configuration']['load']['schemaUpdateOptions']
response = jobs.insert(projectId=self.project_id, body=job).execute()
job_id = response['jobReference']['jobId']
poll_job(jobs, job_id) One flexible option related to using of autodetect is unavailable. I mean you cannot create a table without data loading. So i had to write the own schema generator copying the workflow of autodetect option :) Hope you found that helpful. |
Where is module |
I believe |
UPDATE: the code below doesn't work. try this code instead!
|
FWIW the latest is |
oops, the code i gave in #2926 (comment) doesn't actually work after all. i ended up having to monkey patch to get autodetect into the API call. here's that code: def _add_autodetect():
resource = LoadTableFromStorageJob._build_resource(job)
resource['configuration']['load']['autodetect'] = True
return resource
job._build_resource = _add_autodetect can't wait for official support! :P |
At least some of this issue was fixed in #3648 |
@tswast Is what is left of this issue covered in the redesign? If so, I would like to close this out. |
@snarfed While file (& what directory is that file in) did you change and add this new _add_auto_detect function? |
@hemanthk92 it goes in your own code, not bigquery/google cloud's...but ignore it! they've fixed this bug. just set |
@snarfed thanks for your response. |
@hemanthk92 It is a property on from google.cloud import bigquery
client = bigquery.Client()
dataset_ref = client.dataset('test_dataset')
table_ref = dataset_ref.table('test_table')
job_config = bigquery.LoadJobConfig()
job_config.autodetect = True
with open('data_sample.txt', 'rb') as source_file:
job = client.load_table_from_file(
source_file, table_ref, job_config=job_config) # Start the job.
job.result() # Wait for the job to complete. |
I am using streaming insertion. Is there any way to use this auto detect schemas in streaming?? |
Sorry, @ravi45722 the BigQuery API does not have an auto-detect feature for the streaming API. I recommend filing an issue requesting this feature at https://issuetracker.google.com/issues/new?component=187149&template=0 |
@yiga2 FWIW, that feature is marked "experimental" in the docs. The workaround you propose won't work: instead, I would use: job_config._properties['schemaUpdateOptions'] = ['ALLOW_FIELD_ADDITION'] |
Autodetect feature works only if the body of the request contains a schema field.
For example, that works fine:
The task will be failed if schema path is not specified in the task params.
May be this code in the client works unexpected(job.py):
The text was updated successfully, but these errors were encountered: