flatten_df is too slow #22

manycoding · 2018-10-23T17:24:38Z

Can it be rewritted to not use recursion?
If not, profile and see how to improve. To test, use jobs with nested fields and a considerable amount of items.
Add tests to check the speed

manycoding · 2019-01-16T12:43:16Z

Profiling:
https://notebook-qa-generic.scrapinghub.com/notebooks/GATF/Experiments/flatten_and_whitespaces.ipynb

manycoding · 2019-03-09T14:17:38Z

https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

Note
It is worth noting that concat() (and therefore append()) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension.

manycoding · 2019-04-09T20:56:59Z

What if instead of concatenating each time we keep new columns somewhere (in memory\disk) and concatenate in the end?

ivankivanov · 2019-04-22T12:34:43Z

I did several experiments with different versions. The latest one results:

5k items:

current code - 55 sec
new code - first level - 631 ms (5000 - 713), , second level - 4 secs - (5000 - 763)

150k items

current code - out of memory for 16 GB
new code - first level - 8.18 s - (141762, 777), second level - >1 min

The notebook is placed in: /shared/qa/Experiments/flat.ipynb
and the new branch for tests is flatten_df.

manycoding · 2019-04-23T00:06:34Z

@ivankivanov It might be a good idea to generate a df for testing so we can share notebooks here.

The idea looks doable, basically we replace recursion with a loop.
I don't see flatten_df branch.

ivankivanov · 2019-04-25T16:07:15Z

This is the pull request:

#74

Notebook:

shared/qa/Experiments/flattenFinal.ipynb

ivankivanov · 2019-05-17T22:02:22Z

sync with master
last notebook with performance: shared/qa/Experiments/flatten_df.ipynb

current code - 58.9 sec - full
new code - 6.68 sec - full
150k items
current code - out of memory for 32 GB
new code - first level - 7.96 s - (141762, 777), second level ~ 11 min

manycoding assigned ivankivanov Nov 13, 2018

manycoding unassigned ivankivanov Mar 9, 2019

manycoding transferred this issue from another repository Mar 18, 2019

manycoding added the Type: Performance label Mar 18, 2019

manycoding added a commit that referenced this issue May 23, 2019

Flatten raw data with flatten_json, closes #74, #22

3963d65

manycoding closed this as completed May 28, 2019

manycoding added this to the 0.3.6 milestone Jun 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flatten_df is too slow #22

flatten_df is too slow #22

manycoding commented Oct 23, 2018

manycoding commented Jan 16, 2019

manycoding commented Mar 9, 2019

manycoding commented Apr 9, 2019

ivankivanov commented Apr 22, 2019

manycoding commented Apr 23, 2019 •

edited

Loading

ivankivanov commented Apr 25, 2019

ivankivanov commented May 17, 2019

flatten_df is too slow #22

flatten_df is too slow #22

Comments

manycoding commented Oct 23, 2018

manycoding commented Jan 16, 2019

manycoding commented Mar 9, 2019

manycoding commented Apr 9, 2019

ivankivanov commented Apr 22, 2019

manycoding commented Apr 23, 2019 • edited Loading

ivankivanov commented Apr 25, 2019

ivankivanov commented May 17, 2019

manycoding commented Apr 23, 2019 •

edited

Loading