Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to set the multiprocessing context (to "spawn") in core.StatsForecast. #948

Closed
christian-hnz opened this issue Nov 26, 2024 · 3 comments

Comments

@christian-hnz
Copy link

christian-hnz commented Nov 26, 2024

Description

Currently, the ProcessPoolExecutor in core._StatsForecast._forecast_parallel does not set the mp_context parameter AND offers no direct way of doing so. On Linux, therefore, the default method fork is always used. This may lead to totally surprising lockups of programs using threads and no clean way to solve such issues when they occur.

The polars library uses threads extensively and explicitly warns again fork in their docs. Their recent 1.14 release added an explicit warning which (for the above reason) is always shown when using polars alongside StatsForecast with n_jobs>=2.

Python will switch to fork as default in Python 3.14.

Here is a minimal example.

from datetime import date

import polars as pl
import statsforecast.core
import statsforecast.models

df = pl.DataFrame(
    {
        "unique_id": ["a", "a", "a", "b", "b", "b"],
        "ds": (date(2000, i, 1) for i in (1, 2, 3, 1, 2, 3)),
        "y": range(6),
    },
    schema_overrides={"value": pl.Float64},
)

sf = statsforecast.core.StatsForecast(
    models=[statsforecast.models.Naive()],
    freq="1mo",
    n_jobs=2,
)
sf.forecast(h=12, df=df)

generates

RuntimeWarning: Using fork() can cause Polars to deadlock in the child process.
In addition, using fork() with Python in general is a recipe for mysterious
deadlocks and crashes.

The most likely reason you are seeing this error is because you are using the
multiprocessing module on Linux, which uses fork() by default. This will be
fixed in Python 3.14. Until then, you want to use the "spawn" context instead.

See https://docs.pola.rs/user-guide/misc/multiprocessing/ for details.

on Linux.

Of course, one can hack around this, e.g.,

import concurrent.futures
import multiprocessing
from datetime import date
from unittest.mock import patch

import polars as pl
import statsforecast.core
import statsforecast.models

df = pl.DataFrame(
    {
        "unique_id": ["a", "a", "a", "b", "b", "b"],
        "ds": (date(2000, i, 1) for i in (1, 2, 3, 1, 2, 3)),
        "y": range(6),
    },
    schema_overrides={"value": pl.Float64},
)

def generate_pool(n_jobs: int) -> concurrent.futures.ProcessPoolExecutor:
    return concurrent.futures.ProcessPoolExecutor(
        max_workers=n_jobs, mp_context=multiprocessing.get_context("spawn")
    )

sf = statsforecast.core.StatsForecast(
    models=[statsforecast.models.Naive()],
    freq="1mo",
    n_jobs=2,
    )

with patch("statsforecast.core.ProcessPoolExecutor", generate_pool):
    sf.forecast(h=12, df=df)

but that is clearly not optimal.

Setting the context to "spawn" has some overhead and there might be cases where people might want to stick with "fork"; hence, having a parameter would be nice.

Use case

Using statsforecast alongside current polars versions.

@jmoralez
Copy link
Member

Hey @christian-hnz, thanks for the detailed report. We only use dicts of numpy arrays in multiprocessing, so I don't think this would cause any problems with polars. Did you experience any issues (apart from polars' warning)?

@jmoralez
Copy link
Member

Seems like you can set the start method in your notebook/script with multiprocessing.set_start_method. If you do that then our ProcessPoolExecutor should be able to pick that up.

@christian-hnz
Copy link
Author

Thanks for the reply and the insight!

No, I actually did not run into real issues using polars with statsforecast (and I've actually done quite heavy jobs, so rather confident there is not real issue here). Also, a warnings.filterwarnings is enough to deal with the newly added polars fork warning.

True, multiprocessing.set_start_method is an option but not an optional choice outside scripting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants