Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flatten_df is too slow #22

Closed
manycoding opened this issue Oct 23, 2018 · 7 comments
Closed

flatten_df is too slow #22

manycoding opened this issue Oct 23, 2018 · 7 comments

Comments

@manycoding
Copy link
Contributor

  1. Can it be rewritted to not use recursion?

  2. If not, profile and see how to improve. To test, use jobs with nested fields and a considerable amount of items.

  3. Add tests to check the speed

@manycoding
Copy link
Contributor Author

@manycoding
Copy link
Contributor Author

https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

Note
It is worth noting that concat() (and therefore append()) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension.

@manycoding manycoding transferred this issue from another repository Mar 18, 2019
@manycoding
Copy link
Contributor Author

What if instead of concatenating each time we keep new columns somewhere (in memory\disk) and concatenate in the end?

@ivankivanov
Copy link
Member

I did several experiments with different versions. The latest one results:

  • 5k items:
  1. current code - 55 sec
  2. new code - first level - 631 ms (5000 - 713), , second level - 4 secs - (5000 - 763)
  • 150k items
  1. current code - out of memory for 16 GB
  2. new code - first level - 8.18 s - (141762, 777), second level - >1 min

The notebook is placed in: /shared/qa/Experiments/flat.ipynb
and the new branch for tests is flatten_df.

@manycoding
Copy link
Contributor Author

manycoding commented Apr 23, 2019

@ivankivanov It might be a good idea to generate a df for testing so we can share notebooks here.

The idea looks doable, basically we replace recursion with a loop.
I don't see flatten_df branch.

@ivankivanov
Copy link
Member

This is the pull request:

#74

Notebook:

shared/qa/Experiments/flattenFinal.ipynb

@ivankivanov
Copy link
Member

sync with master
last notebook with performance: shared/qa/Experiments/flatten_df.ipynb

current code - 58.9 sec - full
new code - 6.68 sec - full
150k items
current code - out of memory for 32 GB
new code - first level - 7.96 s - (141762, 777), second level ~ 11 min

@manycoding manycoding added this to the 0.3.6 milestone Jun 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants