Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leaks when extracting features using LocalDaskDistributor with large Dataframes #534

Closed
isaacWpark opened this issue May 6, 2019 · 6 comments

Comments

@isaacWpark
Copy link

isaacWpark commented May 6, 2019

Hi, I’m attempting to use extract_features from a large dataframe using a LocalDaskDistributor, and am encountering the following error:

Distributed.worker - WARNING - gc.collect() took 1.732s. This is usually a sign that the some tasks handle too many Python objects at the same time. Rechunking the work into smaller tasks might help.

Followed by the following repeated error, which repeats until the kernel ultimately freezes:

distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 7.89 GB -- Worker memory limit: 11.38 GB

This error persists regardless of whether very large (>100,000) or very small (10) chunksizes are specified in extract_features, and appears to occur whenever attempting to extract features from a dataframe with ~100,000 rows using the localDaskDistributor.

I'm using Anaconda python 3.6.8 on Windows 7 Professional, with tsfresh v0.11.2.
Example code which produces these errors follows:

from tsfresh.examples.robot_execution_failures import \
    download_robot_execution_failures, \
    load_robot_execution_failures
from tsfresh.feature_extraction import extract_features
from tsfresh.utilities.distribution import LocalDaskDistributor
import copy
import pandas as pd


download_robot_execution_failures()
df, y = load_robot_execution_failures()

df_comb = copy.deepcopy(df)
for x in range(100):
    df_iter = df
    df_iter['id'] = df_iter['id'] + df_comb['id'].max()
    df_comb = pd.concat([df_comb, df_iter], axis = 0)
    print(len(df_comb))

df_comb.reset_index(inplace = True, drop = True)


Distributor = LocalDaskDistributor(n_workers=3)

X = extract_features(timeseries_container=df_comb,
                     column_id='id', column_sort='time',
                     distributor=Distributor)

Thanks!

@mmann1123
Copy link

I have been having the same issue, just wondering if anyone has found a way around this. Thanks

@dbarbier
Copy link
Contributor

dbarbier commented Nov 22, 2019

There does not seem to be any memory leak; IMO the problem is that some features (at least approximate_entropy, maybe others?) require nrows*nrows memory storage, which is quite large if nrows > 100,000. Try with default_fc_parameters=EfficientFCParameters() option.

BTW I am surprised that LocalDaskDistributor sets processes=False and thus uses multithreading. You could try with

from distributed import LocalCluster, Client
cluster = LocalCluster(n_workers=3)
client = Client(cluster)
address = client.scheduler_info()['address']
Distributor = ClusterDaskDistributor(address)

and check whether it runs faster or not. Or more simply just stick with default distributor.

@dbarbier
Copy link
Contributor

I made a test on a smaller dataset, 14520 rows instead of 133200, here are timings on my laptop (2 CPU) for:

  X = extract_features(timeseries_container=df_comb,
                       column_id='id', column_sort='time',
                       default_fc_parameters=EfficientFCParameters())
  • no extra option: 2'12"
  • n_jobs=0: 3'47"
  • distributor=LocalDaskDistributor(n_workers=2): 6'15"
  • distributor=ClusterDaskDistributor(...): 3'11"

To be honest, I think that LocalDaskDistributor should be removed. ClusterDaskDistributor is useful to run on a cluster.

@nils-braun
Copy link
Collaborator

Probably you are correct @dbarbier, LocalDaskDistributor is merely for testing out the dask processing - not really for production

@nils-braun
Copy link
Collaborator

Can we close this issue? Is the proposed solution ok?

@nils-braun
Copy link
Collaborator

Feel free to reopen if it is not :-) I just want to clean up issues!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants