`Client.insert_rows_json()`: add option to disable best-effort deduplication #720

pietrodn · 2021-06-25T15:08:21Z

Currently, the Client.insert_rows_json() method for streaming inserts always inserts an insertId unique identifier for each row provided.
This row identifier can be user-provided; if the user doesn't provide any identifiers, the library automatically fills the row IDs by using UUID4.

Here's the code:

        for index, row in enumerate(json_rows):
            info = {"json": row}
            if row_ids is not None:
                info["insertId"] = row_ids[index]
            else:
                info["insertId"] = str(uuid.uuid4())
            rows_info.append(info)

However, insert IDs are entirely optional, and there are actually valid use cases not to use them. From the BigQuery documentation:

You can disable best effort de-duplication by not populating the insertId field for each row inserted. When you do not populate insertId, you get higher streaming ingest quotas in certain regions. This is the recommended way to get higher streaming ingest quota limits.

The BigQuery Python client library provides no way of omitting the insertIds. it would be nice to have a parameter for that.

The text was updated successfully, but these errors were encountered:

plamut · 2021-06-28T15:41:11Z

I'll double check why this is the case, but at a glance this request sounds reasonable. 👍

tswast · 2021-06-28T21:17:47Z

I recall that you can explicitly pass in a list of None values to omit these.

It might be useful to have a more discoverable way to do this, though.

pietrodn · 2021-06-29T07:24:20Z

I recall that you can explicitly pass in a list of None values to omit these.

It might be useful to have a more discoverable way to do this, though.

It seems to work. However I don't see the point in allocating memory just to generate a list of Nones, maybe it can be improved? :-)

plamut · 2021-06-29T12:02:11Z

A memory efficient alternative would be to provide a fake sequence-ish object that returns None for every index:

class NoneItems:
    def __getitem__(self, index):
        return None

>>> insert_ids = NoneItems()
>>> assert insert_ids[0] is None
>>> assert insert_ids[1] is None
>>> assert insert_ids[42] is None

A bit hacky, but should be a good enough workaround until more user friendly support is added. :)

plamut · 2021-06-29T16:50:09Z

Update: This was confirmed, we'll add support for this in a more user-friendly way.

pietrodn · 2021-07-01T22:01:00Z

Amazing, thanks!

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Jun 25, 2021

yoshi-automation added the triage me I really want to be triaged. label Jun 26, 2021

plamut added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed triage me I really want to be triaged. labels Jun 28, 2021

plamut self-assigned this Jun 29, 2021

plamut mentioned this issue Jul 1, 2021

feat: make it easier to disable best-effort deduplication with streaming inserts #734

Merged

4 tasks

plamut closed this as completed in #734 Jul 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Client.insert_rows_json()`: add option to disable best-effort deduplication #720

`Client.insert_rows_json()`: add option to disable best-effort deduplication #720

pietrodn commented Jun 25, 2021 •

edited

Loading

plamut commented Jun 28, 2021

tswast commented Jun 28, 2021 •

edited

Loading

pietrodn commented Jun 29, 2021

plamut commented Jun 29, 2021

plamut commented Jun 29, 2021

pietrodn commented Jul 1, 2021

Client.insert_rows_json(): add option to disable best-effort deduplication #720

Client.insert_rows_json(): add option to disable best-effort deduplication #720

Comments

pietrodn commented Jun 25, 2021 • edited Loading

plamut commented Jun 28, 2021

tswast commented Jun 28, 2021 • edited Loading

pietrodn commented Jun 29, 2021

plamut commented Jun 29, 2021

plamut commented Jun 29, 2021

pietrodn commented Jul 1, 2021

`Client.insert_rows_json()`: add option to disable best-effort deduplication #720

`Client.insert_rows_json()`: add option to disable best-effort deduplication #720

pietrodn commented Jun 25, 2021 •

edited

Loading

tswast commented Jun 28, 2021 •

edited

Loading