Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang in MutationsBatcher.mutate() #949

Closed
patjakdev opened this issue Apr 1, 2024 · 10 comments
Closed

Hang in MutationsBatcher.mutate() #949

patjakdev opened this issue Apr 1, 2024 · 10 comments
Assignees
Labels
api: bigtable Issues related to the googleapis/python-bigtable API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@patjakdev
Copy link

Environment details

  • OS type and version: Mac OS X 14.1.2, Ubuntu 22.04
  • Python version: 3.10.13
  • pip version: N/A
  • google-cloud-bigtable version: 2.23.0

Steps to reproduce

Recently, we started seeing this usage of the MutationsBatcher hang indefinitely in mutate(). This issue seems to be reproducible both in my local development environment (Mac) as well as in our deployed dev environment (Linux).

Based on the stack traces I included below, I'm guessing there's some problem with the creation of the gRPC channel to perform the flush but it's unclear to me how to get any insight into what might be going on there.

Code example

    def _on_batch_completed(self, statuses: Sequence[Status]) -> None:
            logger.debug(f"Completed write batch of size {len(statuses)}: {statuses}")

    def store_features(self, entity_type: str, entities: pd.DataFrame) -> None:
        table = "ml_features_cases"
        packer = msgpack.Packer()
        with MutationsBatcher(table, batch_completed_callback=self._on_batch_completed) as batch:
            for index, *values in entities.itertuples(name=None):
                row = table.direct_row(index.encode("utf-8"))

                for column_name, value in zip(entities.columns.values, values):
                    # ... encode data into variable named encoded_value
                    row.set_cell("features", column_name.encode(), encoded_value)

                batch.mutate(row)

Stack trace

Thread 0x000000029d037000 (most recent call first):

 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/grpc/_channel.py", line 1972 in __init__
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/grpc/__init__.py", line 2107 in secure_channel
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/google/api_core/grpc_helpers.py", line 386 in create_channel
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/google/cloud/bigtable_v2/services/bigtable/transports/grpc.py", line 217 in create_channel
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/google/cloud/bigtable/client.py", line 285 in _create_gapic_client_channel
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/google/cloud/bigtable/client.py", line 332 in table_data_client
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/google/cloud/bigtable/table.py", line 160 in name
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/google/cloud/bigtable/table.py", line 724 in mutate_rows
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/google/cloud/bigtable/batcher.py", line 385 in _flush_rows
 File "/Users/patrick/.pyenv/versions/3.10.13/lib/python3.10/concurrent/futures/thread.py", line 58 in run
 File "/Users/patrick/.pyenv/versions/3.10.13/lib/python3.10/concurrent/futures/thread.py", line 83 in _worker
 File "/Users/patrick/.pyenv/versions/3.10.13/lib/python3.10/threading.py", line 953 in run
 File "/Users/patrick/.pyenv/versions/3.10.13/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
 File "/Users/patrick/.pyenv/versions/3.10.13/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x000000029c02b000 (most recent call first):

 File "", line 1074 in get_data
 File "/Users/patrick/.pyenv/versions/3.10.13/lib/python3.10/pkgutil.py", line 639 in get_data
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/grpc/_channel.py", line 1972 in __init__
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/grpc/__init__.py", line 2107 in secure_channel
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/google/api_core/grpc_helpers.py", line 386 in create_channel
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/google/cloud/bigtable_v2/services/bigtable/transports/grpc.py", line 217 in create_channel
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/google/cloud/bigtable/client.py", line 285 in _create_gapic_client_channel
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/google/cloud/bigtable/client.py", line 332 in table_data_client
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/google/cloud/bigtable/table.py", line 160 in name
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/google/cloud/bigtable/table.py", line 724 in mutate_rows
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/google/cloud/bigtable/batcher.py", line 385 in _flush_rows
 File "/Users/patrick/.pyenv/versions/3.10.13/lib/python3.10/concurrent/futures/thread.py", line 58 in run
 File "/Users/patrick/.pyenv/versions/3.10.13/lib/python3.10/concurrent/futures/thread.py", line 83 in _worker
 File "/Users/patrick/.pyenv/versions/3.10.13/lib/python3.10/threading.py", line 953 in run
 File "/Users/patrick/.pyenv/versions/3.10.13/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
 File "/Users/patrick/.pyenv/versions/3.10.13/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00000001d6c01ec0 (most recent call first):

 File "/Users/patrick/.pyenv/versions/3.10.13/lib/python3.10/threading.py", line 320 in wait
 File "/Users/patrick/.pyenv/versions/3.10.13/lib/python3.10/threading.py", line 607 in wait
 File "/Users/patrick/.pyenv/versions/3.10.13/lib/python3.10/threading.py", line 940 in start
 File "/Users/patrick/.pyenv/versions/3.10.13/lib/python3.10/concurrent/futures/thread.py", line 199 in _adjust_thread_count
 File "/Users/patrick/.pyenv/versions/3.10.13/lib/python3.10/concurrent/futures/thread.py", line 176 in submit
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/google/cloud/bigtable/batcher.py", line 340 in _flush_async
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/google/cloud/bigtable/batcher.py", line 266 in mutate
 File "/Users/patrick/src/Apella-Technology/realtime-dags/case_forecasting_baseline/feature_store/feature_store.py", line 138 in store_features
 File "/Users/patrick/src/Apella-Technology/realtime-dags/case_forecasting_baseline/dags/ops/feature_calculation.py", line 137 in store_features
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/dagster/_core/execution/plan/compute_generator.py", line 125 in invoke_compute_fn
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/dagster/_core/execution/plan/compute_generator.py", line 131 in _coerce_op_compute_fn_to_iterator
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/dagster/_utils/__init__.py", line 465 in iterate_with_context
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/dagster/_core/execution/plan/compute.py", line 181 in _yield_compute_results
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/dagster/_core/execution/plan/compute.py", line 212 in execute_core_compute
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/dagster/_core/execution/plan/execute_step.py", line 95 in _process_asset_results_to_events
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/dagster/_core/execution/plan/execute_step.py", line 160 in _step_output_error_checked_user_event_sequence
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/dagster/_core/execution/plan/execute_step.py", line 478 in core_dagster_event_sequence_for_step
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/dagster/_core/execution/plan/execute_plan.py", line 286 in dagster_event_sequence_for_step
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/dagster/_core/execution/plan/execute_plan.py", line 121 in inner_plan_execution_iterator
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/dagster/_core/executor/in_process.py", line 26 in inprocess_execution_iterator
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/dagster/_core/execution/api.py", line 875 in __iter__
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/dagster/_core/executor/in_process.py", line 55 in execute
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/dagster/_core/execution/api.py", line 764 in job_execution_iterator
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/dagster/_core/execution/api.py", line 875 in __iter__
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/dagster/_grpc/impl.py", line 152 in core_execute_run
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/dagster/_grpc/impl.py", line 242 in _run_in_subprocess
 File "/Users/patrick/Library/Caches/pypoetry/virtualenvs/realtime-dagster-wbtued84-py3.10/lib/python3.10/site-packages/dagster/_grpc/impl.py", line 266 in start_run_in_subprocess
 File "/Users/patrick/.pyenv/versions/3.10.13/lib/python3.10/multiprocessing/process.py", line 108 in run
 File "/Users/patrick/.pyenv/versions/3.10.13/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
 File "/Users/patrick/.pyenv/versions/3.10.13/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
 File "/Users/patrick/.pyenv/versions/3.10.13/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
 File "", line 1 in 
@product-auto-label product-auto-label bot added the api: bigtable Issues related to the googleapis/python-bigtable API. label Apr 1, 2024
@patjakdev
Copy link
Author

I should add that this hang is not deterministic. It happens about 50% of the time that the code runs.

@patjakdev
Copy link
Author

A bit more information. This issue only occurs with google-api-core >= 2.17.0.

Given that google/api_core/grpc_helpers.py appears near the problematic area in the callstack, I think this may be in an issue in that library.

daniel-sanche referenced this issue in googleapis/python-api-core Apr 4, 2024
* feat: Add attempt_direct_path argument to create_channel

* add more test cases

* fix docstring

* fix docstring

* update docstring of attempt_direct_path arg

* update docstring of target arg

* Add comment for dns_prefix local variable

* Set the default value of attempt_direct_path to False

* simplify conditional statement

* use warnings.warn instead of _LOGGER.debug

* update docstring of target arg in _modify_target_for_direct_path

* s/direct_path_prefix/direct_path_separator

* default->google_auth_default

* parametrize target in def test_create_channel_implicit

* Add github issue for TODO

* filter deprecation warning related to grpcio-gcp

* format docstring
@daniel-sanche
Copy link
Contributor

I was able to reproduce the issue. It looks like it comes up when table.name is called in a background thread. .name ends up calling _client.table_data_client, which builds a new transport if it's being called the first time. It looks like when it's constructed for the first time in a background thread, it sometimes (always?) locks up. Calling table.name before starting the MutationsBatcher seems to prevent this issue

Your observation on it being related to google-api-core==2.17.0 was a good one. It looks like this is hanging on this ssl_credentials change added in 2.17.0

@parthea let me know how you want to proceed. I can probably make some patches to fix this issue on the bigtable side, but it's probably better to fix this in api_core

@parthea
Copy link
Contributor

parthea commented Apr 5, 2024

@daniel-sanche

Can you share the code that you used to create the issue, and the version of gRPC that was used? I can check to see if this is an issue in other clients as well.

@daniel-sanche
Copy link
Contributor

daniel-sanche commented Apr 5, 2024

I reproduced it yesterday using similar code to Patrick above. But this simpler, more generic sample also hangs:

import google.cloud.bigtable
import concurrent.futures

def bg_worker():
    print("creating client...")
    client = google.cloud.bigtable_v2.BigtableClient()
    print("worker done")

if __name__ == "__main__":
    executor = concurrent.futures.ThreadPoolExecutor()
    futures_list = []
    for i in range(10):
        futures_list.append(executor.submit(bg_worker))
    for future in futures_list:
        future.result()

In both cases, I used would run python script in a bash loop. The Python code doesn't hang every time, but when looped, it will consistently hang within a couple minutes:

while true
do 
  python test_script.py
done

I have grpcio == 1.60.1 installed if that matters

@parthea
Copy link
Contributor

parthea commented Apr 7, 2024

Thanks for the sample code! I can produce the issue in versions of grpcio newer than 1.58.0

The issue does not appear with pip install grpcio==1.58.0 . I'll try to create a minimal example using gRPC and file an issue in that repository.

@parthea
Copy link
Contributor

parthea commented Apr 7, 2024

This could be related to grpc/grpc#34672 which was fixed for grpc_ssl_credentials_create. There may be another code path in the gRPC code that still needs to be fixed.

@parthea parthea added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Apr 7, 2024
@parthea
Copy link
Contributor

parthea commented Apr 7, 2024

Marking as P2 since there is a workaround which is to use grpcio==1.58.0. I'll file an issue in the gRPC repository which we can use to track the long term fix.

copybara-service bot pushed a commit to grpc/grpc that referenced this issue Apr 9, 2024
…th nogil (#36266)

This fix is similar to #34712 but for `grpc_google_default_credentials_create` rather than `grpc_ssl_credentials_create`

Fixes #36265
Fixes googleapis/python-bigtable#949

Closes #36266

COPYBARA_INTEGRATE_REVIEW=#36266 from parthea:repro-issue-34672 d736f6f
PiperOrigin-RevId: 623291826
XuanWang-Amos pushed a commit to XuanWang-Amos/grpc that referenced this issue Apr 16, 2024
…th nogil (grpc#36266)

This fix is similar to grpc#34712 but for `grpc_google_default_credentials_create` rather than `grpc_ssl_credentials_create`

Fixes grpc#36265
Fixes googleapis/python-bigtable#949

Closes grpc#36266

COPYBARA_INTEGRATE_REVIEW=grpc#36266 from parthea:repro-issue-34672 d736f6f
PiperOrigin-RevId: 623291826
XuanWang-Amos added a commit to grpc/grpc that referenced this issue Apr 16, 2024
…th nogil (v1.62.x backport) (#36376)

Backport of #36266 to v1.62.x.
---
This fix is similar to #34712 but for
`grpc_google_default_credentials_create` rather than
`grpc_ssl_credentials_create`

Fixes #36265
Fixes googleapis/python-bigtable#949

Co-authored-by: Anthonios Partheniou <[email protected]>
@patjakdev
Copy link
Author

@parthea and @daniel-sanche - thanks to you both! Sorry I've gone a bit silent; I was on vacation the past two weeks.

If there's anything else I can do to help let me know, but it looks like y'all got to the bottom of this.

@daniel-sanche
Copy link
Contributor

Yeah I think it should be safe to close this, thanks for the detailed bug report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigtable Issues related to the googleapis/python-bigtable API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants