Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flatten json #94

Merged
merged 3 commits into from
May 27, 2019
Merged

Flatten json #94

merged 3 commits into from
May 27, 2019

Conversation

manycoding
Copy link
Contributor

@manycoding manycoding commented May 23, 2019

This one significantly increases performance and decreases memory usage if using flat_df, which corresponds to expand=True default parameter.
Some things to consider:

  1. If original column name and expanded column name the same, the original column will be dropped.
    See here Data is lost if unflatten column name equals to another original column amirziai/flatten#48
    I think it won't happen in 99% of cases, I've never encountered such a case. Alternatively, we can set other separator which potentially bumps it to 99.9%.
    Nevertheless, there's is that issue and also a comment in tests.

Time and peak memory, 1k samples from a heavily nested df (see shared/qa/Experiments/data/items_epicsports_150k_nested.pickle).
This pr - 1.63s, 2400 MB
Old - 4min 57s, 4411.51 MiB
#74 - 6min 2s, 2800 MB

And, interestingly, all 150k items with this pr:
9min 40s
peak memory: 3682.77 MiB, increment: 1391.34 MiB

if self._flat_df is None:
if self.expand:
self._flat_df, self._columns_map = pandas.flatten_df(self.df)
self._flat_df = pd.DataFrame(flatten(i) for i in self.raw)
self._flat_df["_key"] = self.df.get(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the same formatted keys as in original df.

@@ -40,8 +43,12 @@ def process_df(df: pd.DataFrame) -> pd.DataFrame:
df["_type"] = df["_type"].astype("category")
return df

def get_origin_column_name(self, column_name: str) -> str:
return self._columns_map.get(column_name, column_name)
def origin_column_name(self, new: str) -> str:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KISS approach. I believe it will work in most cases.

{"type": [0, [2, 3]], "str": "s", "type_0": 6},
],
{
"str": ["k", "s"],
Copy link
Contributor Author

@manycoding manycoding May 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we lose type 0,0

@codecov
Copy link

codecov bot commented May 23, 2019

Codecov Report

Merging #94 into master will decrease coverage by 0.56%.
The diff coverage is 92.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #94      +/-   ##
==========================================
- Coverage   71.04%   70.47%   -0.57%     
==========================================
  Files          24       23       -1     
  Lines        1592     1565      -27     
  Branches      275      273       -2     
==========================================
- Hits         1131     1103      -28     
  Misses        436      436              
- Partials       25       26       +1
Impacted Files Coverage Δ
src/arche/rules/others.py 100% <100%> (ø) ⬆️
src/arche/readers/items.py 84.07% <92.3%> (-0.05%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 94ea065...5bf6383. Read the comment docs.

Copy link
Member

@ivankivanov ivankivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so far the fastest solution

@manycoding manycoding merged commit 7fee5f2 into master May 27, 2019
@manycoding manycoding deleted the flatten_json branch May 27, 2019 17:23
@manycoding manycoding modified the milestone: 0.3.6 Jun 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants