Implement writing pandas metadata and auto-setting cats/index #151

martindurant · 2017-05-17T20:22:05Z

Only copes with single index, not compound (as was the case in fastparquet
so far).

See the spec

Only copes with single index, not compound (as was the case in fastparquet so far).

wesm

This looks reasonable to me. Dask will provide us integration tests (we can test pyarrow write -> fastparquet read, and vice versa) so that will help suss out any inconsistencies on either side

wesm · 2017-05-19T15:23:42Z

fastparquet/test/test_output.py

+    pf = ParquetFile(fn)
+    assert set(pf.columns) == {'x', 'y', 'z'}
+    meta = json.loads(pf.key_value_metadata['pandas'])
+    assert meta['index_columns'] == ['z']


So one deficiency for named indexes is that there may be a conflict between the index name and a column name in the DataFrame. I opened pandas-dev/pandas#16391, which is something we can add in a forward-compatible way

We currently check against duplicate column names after reset_index, so you would get an error for this case

wesm · 2017-05-19T15:27:24Z

fastparquet/writer.py

@@ -671,6 +678,7 @@ def make_metadata(data, has_nulls=True, ignore_columns=[], fixed_text=None,
            se.repetition_type = parquet_thrift.FieldRepetitionType.OPTIONAL
        fmd.schema.append(se)
        root.num_children += 1
+    meta.value = json.dumps(pandas_metadata)
    cats.value = json.dumps(catstruct)


Do you want to drop this one in favor of the pandas metadata?

You are right, this should be removed. On reading I give a warning when this old version is present.

Implement writing pandas metadata and auto-setting cats/index

f56a2d3

Only copes with single index, not compound (as was the case in fastparquet so far).

martindurant mentioned this pull request May 17, 2017

Index column inference issues in read_parquet dask/dask#2222

Closed

wesm reviewed May 19, 2017

View reviewed changes

Remove old category key/values on writing

b288f57

martindurant merged commit bbb03e2 into dask:master May 19, 2017

martindurant deleted the pandas_metadata branch May 19, 2017 16:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement writing pandas metadata and auto-setting cats/index #151

Implement writing pandas metadata and auto-setting cats/index #151

martindurant commented May 17, 2017

wesm left a comment

wesm May 19, 2017

martindurant May 19, 2017

wesm May 19, 2017

martindurant May 19, 2017

Implement writing pandas metadata and auto-setting cats/index #151

Implement writing pandas metadata and auto-setting cats/index #151

Conversation

martindurant commented May 17, 2017

wesm left a comment

Choose a reason for hiding this comment

wesm May 19, 2017

Choose a reason for hiding this comment

martindurant May 19, 2017

Choose a reason for hiding this comment

wesm May 19, 2017

Choose a reason for hiding this comment

martindurant May 19, 2017

Choose a reason for hiding this comment