Memory leaks when extracting features using LocalDaskDistributor with large Dataframes #534

isaacWpark · 2019-05-06T21:19:13Z

Hi, I’m attempting to use extract_features from a large dataframe using a LocalDaskDistributor, and am encountering the following error:

Distributed.worker - WARNING - gc.collect() took 1.732s. This is usually a sign that the some tasks handle too many Python objects at the same time. Rechunking the work into smaller tasks might help.

Followed by the following repeated error, which repeats until the kernel ultimately freezes:

distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 7.89 GB -- Worker memory limit: 11.38 GB

This error persists regardless of whether very large (>100,000) or very small (10) chunksizes are specified in extract_features, and appears to occur whenever attempting to extract features from a dataframe with ~100,000 rows using the localDaskDistributor.

I'm using Anaconda python 3.6.8 on Windows 7 Professional, with tsfresh v0.11.2.
Example code which produces these errors follows:

from tsfresh.examples.robot_execution_failures import \
    download_robot_execution_failures, \
    load_robot_execution_failures
from tsfresh.feature_extraction import extract_features
from tsfresh.utilities.distribution import LocalDaskDistributor
import copy
import pandas as pd


download_robot_execution_failures()
df, y = load_robot_execution_failures()

df_comb = copy.deepcopy(df)
for x in range(100):
    df_iter = df
    df_iter['id'] = df_iter['id'] + df_comb['id'].max()
    df_comb = pd.concat([df_comb, df_iter], axis = 0)
    print(len(df_comb))

df_comb.reset_index(inplace = True, drop = True)


Distributor = LocalDaskDistributor(n_workers=3)

X = extract_features(timeseries_container=df_comb,
                     column_id='id', column_sort='time',
                     distributor=Distributor)

Thanks!

mmann1123 · 2019-11-22T16:02:02Z

I have been having the same issue, just wondering if anyone has found a way around this. Thanks

dbarbier · 2019-11-22T17:49:32Z

There does not seem to be any memory leak; IMO the problem is that some features (at least approximate_entropy, maybe others?) require nrows*nrows memory storage, which is quite large if nrows > 100,000. Try with default_fc_parameters=EfficientFCParameters() option.

BTW I am surprised that LocalDaskDistributor sets processes=False and thus uses multithreading. You could try with

from distributed import LocalCluster, Client
cluster = LocalCluster(n_workers=3)
client = Client(cluster)
address = client.scheduler_info()['address']
Distributor = ClusterDaskDistributor(address)

and check whether it runs faster or not. Or more simply just stick with default distributor.

dbarbier · 2019-11-22T21:10:04Z

I made a test on a smaller dataset, 14520 rows instead of 133200, here are timings on my laptop (2 CPU) for:

  X = extract_features(timeseries_container=df_comb,
                       column_id='id', column_sort='time',
                       default_fc_parameters=EfficientFCParameters())

no extra option: 2'12"
n_jobs=0: 3'47"
distributor=LocalDaskDistributor(n_workers=2): 6'15"
distributor=ClusterDaskDistributor(...): 3'11"

To be honest, I think that LocalDaskDistributor should be removed. ClusterDaskDistributor is useful to run on a cluster.

nils-braun · 2019-11-23T16:22:00Z

Probably you are correct @dbarbier, LocalDaskDistributor is merely for testing out the dask processing - not really for production

nils-braun · 2019-11-26T19:00:00Z

Can we close this issue? Is the proposed solution ok?

nils-braun · 2019-11-29T21:00:28Z

Feel free to reopen if it is not :-) I just want to clean up issues!

nils-braun closed this as completed Nov 29, 2019

dsstex mentioned this issue Jun 2, 2022

Even 512 GiB Memory not enough for extract_features on 7895800 rows × 28 columns ? #947

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leaks when extracting features using LocalDaskDistributor with large Dataframes #534

Memory leaks when extracting features using LocalDaskDistributor with large Dataframes #534

isaacWpark commented May 6, 2019 •

edited by MaxBenChrist

Loading

mmann1123 commented Nov 22, 2019

dbarbier commented Nov 22, 2019 •

edited

Loading

dbarbier commented Nov 22, 2019

nils-braun commented Nov 23, 2019

nils-braun commented Nov 26, 2019

nils-braun commented Nov 29, 2019

Memory leaks when extracting features using LocalDaskDistributor with large Dataframes #534

Memory leaks when extracting features using LocalDaskDistributor with large Dataframes #534

Comments

isaacWpark commented May 6, 2019 • edited by MaxBenChrist Loading

mmann1123 commented Nov 22, 2019

dbarbier commented Nov 22, 2019 • edited Loading

dbarbier commented Nov 22, 2019

nils-braun commented Nov 23, 2019

nils-braun commented Nov 26, 2019

nils-braun commented Nov 29, 2019

isaacWpark commented May 6, 2019 •

edited by MaxBenChrist

Loading

dbarbier commented Nov 22, 2019 •

edited

Loading