-
-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement writing pandas metadata and auto-setting cats/index #151
Conversation
Only copes with single index, not compound (as was the case in fastparquet so far).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks reasonable to me. Dask will provide us integration tests (we can test pyarrow write -> fastparquet read, and vice versa) so that will help suss out any inconsistencies on either side
pf = ParquetFile(fn) | ||
assert set(pf.columns) == {'x', 'y', 'z'} | ||
meta = json.loads(pf.key_value_metadata['pandas']) | ||
assert meta['index_columns'] == ['z'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So one deficiency for named indexes is that there may be a conflict between the index name and a column name in the DataFrame. I opened pandas-dev/pandas#16391, which is something we can add in a forward-compatible way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently check against duplicate column names after reset_index, so you would get an error for this case
fastparquet/writer.py
Outdated
@@ -671,6 +678,7 @@ def make_metadata(data, has_nulls=True, ignore_columns=[], fixed_text=None, | |||
se.repetition_type = parquet_thrift.FieldRepetitionType.OPTIONAL | |||
fmd.schema.append(se) | |||
root.num_children += 1 | |||
meta.value = json.dumps(pandas_metadata) | |||
cats.value = json.dumps(catstruct) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to drop this one in favor of the pandas metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, this should be removed. On reading I give a warning when this old version is present.
Only copes with single index, not compound (as was the case in fastparquet
so far).
See the spec