Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client.insert_rows_json(): add option to disable best-effort deduplication #720

Closed
pietrodn opened this issue Jun 25, 2021 · 6 comments · Fixed by #734
Closed

Client.insert_rows_json(): add option to disable best-effort deduplication #720

pietrodn opened this issue Jun 25, 2021 · 6 comments · Fixed by #734
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@pietrodn
Copy link

pietrodn commented Jun 25, 2021

Currently, the Client.insert_rows_json() method for streaming inserts always inserts an insertId unique identifier for each row provided.
This row identifier can be user-provided; if the user doesn't provide any identifiers, the library automatically fills the row IDs by using UUID4.

Here's the code:

        for index, row in enumerate(json_rows):
            info = {"json": row}
            if row_ids is not None:
                info["insertId"] = row_ids[index]
            else:
                info["insertId"] = str(uuid.uuid4())
            rows_info.append(info)

However, insert IDs are entirely optional, and there are actually valid use cases not to use them. From the BigQuery documentation:

You can disable best effort de-duplication by not populating the insertId field for each row inserted. When you do not populate insertId, you get higher streaming ingest quotas in certain regions. This is the recommended way to get higher streaming ingest quota limits.

The BigQuery Python client library provides no way of omitting the insertIds. it would be nice to have a parameter for that.

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Jun 25, 2021
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Jun 26, 2021
@plamut plamut added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed triage me I really want to be triaged. labels Jun 28, 2021
@plamut
Copy link
Contributor

plamut commented Jun 28, 2021

I'll double check why this is the case, but at a glance this request sounds reasonable. 👍

@tswast
Copy link
Contributor

tswast commented Jun 28, 2021

I recall that you can explicitly pass in a list of None values to omit these.

It might be useful to have a more discoverable way to do this, though.

@pietrodn
Copy link
Author

I recall that you can explicitly pass in a list of None values to omit these.

It might be useful to have a more discoverable way to do this, though.

It seems to work. However I don't see the point in allocating memory just to generate a list of Nones, maybe it can be improved? :-)

@plamut
Copy link
Contributor

plamut commented Jun 29, 2021

A memory efficient alternative would be to provide a fake sequence-ish object that returns None for every index:

class NoneItems:
    def __getitem__(self, index):
        return None

>>> insert_ids = NoneItems()
>>> assert insert_ids[0] is None
>>> assert insert_ids[1] is None
>>> assert insert_ids[42] is None

A bit hacky, but should be a good enough workaround until more user friendly support is added. :)

@plamut
Copy link
Contributor

plamut commented Jun 29, 2021

Update: This was confirmed, we'll add support for this in a more user-friendly way.

@pietrodn
Copy link
Author

pietrodn commented Jul 1, 2021

Amazing, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
4 participants