-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flatten json #94
Flatten json #94
Conversation
if self._flat_df is None: | ||
if self.expand: | ||
self._flat_df, self._columns_map = pandas.flatten_df(self.df) | ||
self._flat_df = pd.DataFrame(flatten(i) for i in self.raw) | ||
self._flat_df["_key"] = self.df.get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the same formatted keys as in original df.
@@ -40,8 +43,12 @@ def process_df(df: pd.DataFrame) -> pd.DataFrame: | |||
df["_type"] = df["_type"].astype("category") | |||
return df | |||
|
|||
def get_origin_column_name(self, column_name: str) -> str: | |||
return self._columns_map.get(column_name, column_name) | |||
def origin_column_name(self, new: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KISS approach. I believe it will work in most cases.
{"type": [0, [2, 3]], "str": "s", "type_0": 6}, | ||
], | ||
{ | ||
"str": ["k", "s"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we lose type 0,0
Codecov Report
@@ Coverage Diff @@
## master #94 +/- ##
==========================================
- Coverage 71.04% 70.47% -0.57%
==========================================
Files 24 23 -1
Lines 1592 1565 -27
Branches 275 273 -2
==========================================
- Hits 1131 1103 -28
Misses 436 436
- Partials 25 26 +1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so far the fastest solution
This one significantly increases performance and decreases memory usage if using
flat_df
, which corresponds toexpand=True
default parameter.Some things to consider:
See here Data is lost if unflatten column name equals to another original column amirziai/flatten#48
I think it won't happen in 99% of cases, I've never encountered such a case. Alternatively, we can set other separator which potentially bumps it to 99.9%.
Nevertheless, there's is that issue and also a comment in tests.
Time and peak memory, 1k samples from a heavily nested df (see
shared/qa/Experiments/data/items_epicsports_150k_nested.pickle
).This pr - 1.63s, 2400 MB
Old - 4min 57s, 4411.51 MiB
#74 - 6min 2s, 2800 MB
And, interestingly, all 150k items with this pr:
9min 40s
peak memory: 3682.77 MiB, increment: 1391.34 MiB