Skip to content

Commit

Permalink
[IMP] snippets.convert_html_columns: a batch processing story
Browse files Browse the repository at this point in the history
TLDR: RTFM

Once upon a time, in a countryside farm in Belgium...

At first, the upgrade of databases was straightforward. But, as time
passed, the size of the databases grew, and some CPU-intensive
computations took so much time that a solution needed to be found.
Hopefully, the Python standard library has the perfect module for this
task: `concurrent.futures`.
Then, Python 3.10 appeared, and the usage of `ProcessPoolExecutor`
started to sometimes hang for no apparent reasons. Soon, our hero finds
out he wasn't the only one to suffer from this issue[^1].
Unfortunately, the proposed solution looked overkill. Still, it
revealed that the issue had already been known[^2] for a few years.
Despite the fact that an official patch wasn't ready to be committed,
discussion about its legitimacy[^3] leads our hero to a nicer solution.

By default, `ProcessPoolExecutor.map` submits elements one by one to the
pool. This is pretty inefficient when there are a lot of elements to
process. This can be changed by using a large value for the *chunksize*
argument.

Who would have thought that a bigger chunk size would solve a
performance issue?
As always, the response was in the documentation[^4].

[^1]: https://stackoverflow.com/questions/74633896/processpoolexecutor-using-map-hang-on-large-load
[^2]: python/cpython#74028
[^3]: python/cpython#114975 (review)
[^4]: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.map
  • Loading branch information
KangOl committed Jun 6, 2024
1 parent 6a7f050 commit 077e33d
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion src/util/snippets.py
Original file line number Diff line number Diff line change
Expand Up @@ -279,7 +279,7 @@ def convert_html_columns(cr, table, columns, converter_callback, where_column="I
convert = Convertor(converters, converter_callback)
for query in util.log_progress(split_queries, logger=_logger, qualifier=f"{table} updates"):
cr.execute(query)
for data in executor.map(convert, cr.fetchall()):
for data in executor.map(convert, cr.fetchall(), chunksize=1000):
if "id" in data:
cr.execute(update_query, data)

Expand Down

0 comments on commit 077e33d

Please sign in to comment.