-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leaks when extracting features using LocalDaskDistributor with large Dataframes #534
Comments
I have been having the same issue, just wondering if anyone has found a way around this. Thanks |
There does not seem to be any memory leak; IMO the problem is that some features (at least BTW I am surprised that from distributed import LocalCluster, Client
cluster = LocalCluster(n_workers=3)
client = Client(cluster)
address = client.scheduler_info()['address']
Distributor = ClusterDaskDistributor(address) and check whether it runs faster or not. Or more simply just stick with default distributor. |
I made a test on a smaller dataset, 14520 rows instead of 133200, here are timings on my laptop (2 CPU) for: X = extract_features(timeseries_container=df_comb,
column_id='id', column_sort='time',
default_fc_parameters=EfficientFCParameters())
To be honest, I think that LocalDaskDistributor should be removed. ClusterDaskDistributor is useful to run on a cluster. |
Probably you are correct @dbarbier, |
Can we close this issue? Is the proposed solution ok? |
Feel free to reopen if it is not :-) I just want to clean up issues! |
Hi, I’m attempting to use extract_features from a large dataframe using a LocalDaskDistributor, and am encountering the following error:
Distributed.worker - WARNING - gc.collect() took 1.732s. This is usually a sign that the some tasks handle too many Python objects at the same time. Rechunking the work into smaller tasks might help.
Followed by the following repeated error, which repeats until the kernel ultimately freezes:
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 7.89 GB -- Worker memory limit: 11.38 GB
This error persists regardless of whether very large (>100,000) or very small (10) chunksizes are specified in extract_features, and appears to occur whenever attempting to extract features from a dataframe with ~100,000 rows using the localDaskDistributor.
I'm using Anaconda python 3.6.8 on Windows 7 Professional, with tsfresh v0.11.2.
Example code which produces these errors follows:
Thanks!
The text was updated successfully, but these errors were encountered: