Free Python GIL on blocking operations to allow multi-threading runtime usage without deadlocks #387
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously we forced the use of a single-threaded tokio runtime whenever a Python Datasource was in use. This was done to work around deadlocks, but has the unfortunate side affect of disabling multi-threaded parallelization of queries in these situations. This PR applies the technique discussed in See PyO3/pyo3#2182 to use the PyO3
allow_threads
construct to release the Python GIL before performing blocking operations that may themselves need to acquire the GIL in separate threads.In turns out that the DuckDbDatasource still requires running on the main thread in order to access the kernel's top-level DataFrames, so I made the main thread behavior configurable on a per-datasource level, where the default is to maintain the prior behavior of running on the main thread.
I started thinking about this again as a result of the discussion in #386. With these changes, it should be possible to write a
__dataframe__
protocol-based VegaFusion Datasource that implements a custom DataFusion datasource without requiring everything to run on the main thread.