Flatten json #94

manycoding · 2019-05-23T17:13:03Z

This one significantly increases performance and decreases memory usage if using flat_df, which corresponds to expand=True default parameter.
Some things to consider:

If original column name and expanded column name the same, the original column will be dropped.
See here Data is lost if unflatten column name equals to another original column amirziai/flatten#48
I think it won't happen in 99% of cases, I've never encountered such a case. Alternatively, we can set other separator which potentially bumps it to 99.9%.
Nevertheless, there's is that issue and also a comment in tests.

Time and peak memory, 1k samples from a heavily nested df (see shared/qa/Experiments/data/items_epicsports_150k_nested.pickle).
This pr - 1.63s, 2400 MB
Old - 4min 57s, 4411.51 MiB
#74 - 6min 2s, 2800 MB

And, interestingly, all 150k items with this pr:
9min 40s
peak memory: 3682.77 MiB, increment: 1391.34 MiB

manycoding · 2019-05-23T17:13:37Z

src/arche/readers/items.py

        if self._flat_df is None:
            if self.expand:
-                self._flat_df, self._columns_map = pandas.flatten_df(self.df)
+                self._flat_df = pd.DataFrame(flatten(i) for i in self.raw)
+                self._flat_df["_key"] = self.df.get(


We need the same formatted keys as in original df.

manycoding · 2019-05-23T17:14:02Z

src/arche/readers/items.py

@@ -40,8 +43,12 @@ def process_df(df: pd.DataFrame) -> pd.DataFrame:
            df["_type"] = df["_type"].astype("category")
        return df

-    def get_origin_column_name(self, column_name: str) -> str:
-        return self._columns_map.get(column_name, column_name)
+    def origin_column_name(self, new: str) -> str:


KISS approach. I believe it will work in most cases.

manycoding · 2019-05-23T17:14:53Z

tests/readers/test_items.py

+            {"type": [0, [2, 3]], "str": "s", "type_0": 6},
+        ],
+        {
+            "str": ["k", "s"],


Here we lose type 0,0

codecov · 2019-05-23T17:16:35Z

Codecov Report

Merging #94 into master will decrease coverage by 0.56%.
The diff coverage is 92.85%.

@@            Coverage Diff             @@
##           master      #94      +/-   ##
==========================================
- Coverage   71.04%   70.47%   -0.57%     
==========================================
  Files          24       23       -1     
  Lines        1592     1565      -27     
  Branches      275      273       -2     
==========================================
- Hits         1131     1103      -28     
  Misses        436      436              
- Partials       25       26       +1

Impacted Files	Coverage Δ
src/arche/rules/others.py	`100% <100%> (ø)`	⬆️
src/arche/readers/items.py	`84.07% <92.3%> (-0.05%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 94ea065...5bf6383. Read the comment docs.

ivankivanov

so far the fastest solution

manycoding added 2 commits May 23, 2019 13:01

Flatten raw data with flatten_json, closes #74, #22

3963d65

Remove unused params from get_collection fixture

9e98177

manycoding added the Type: Performance label May 23, 2019

manycoding requested review from raphapassini, ejulio, victor-torres and ivankivanov May 23, 2019 17:13

manycoding commented May 23, 2019

View reviewed changes

ivankivanov approved these changes May 23, 2019

View reviewed changes

Add changes

5bf6383

manycoding merged commit 7fee5f2 into master May 27, 2019

manycoding deleted the flatten_json branch May 27, 2019 17:23

manycoding modified the milestone: 0.3.6 Jun 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flatten json #94

Flatten json #94

manycoding commented May 23, 2019 •

edited

Loading

manycoding May 23, 2019

manycoding May 23, 2019

manycoding May 23, 2019 •

edited

Loading

codecov bot commented May 23, 2019 •

edited

Loading

ivankivanov left a comment

Flatten json #94

Flatten json #94

Conversation

manycoding commented May 23, 2019 • edited Loading

manycoding May 23, 2019

Choose a reason for hiding this comment

manycoding May 23, 2019

Choose a reason for hiding this comment

manycoding May 23, 2019 • edited Loading

Choose a reason for hiding this comment

codecov bot commented May 23, 2019 • edited Loading

Codecov Report

ivankivanov left a comment

Choose a reason for hiding this comment

manycoding commented May 23, 2019 •

edited

Loading

manycoding May 23, 2019 •

edited

Loading

codecov bot commented May 23, 2019 •

edited

Loading