-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: Update parquet metadata format description around index levels #18201
Conversation
lgtm. merge when ready @cpcloud |
@cpcloud ? |
Codecov Report
@@ Coverage Diff @@
## master #18201 +/- ##
==========================================
- Coverage 91.59% 91.55% -0.04%
==========================================
Files 153 153
Lines 51257 51257
==========================================
- Hits 46949 46929 -20
- Misses 4308 4328 +20
Continue to review full report at Codecov.
|
@cpcloud good to go? |
I think the example at then end also should be updated: https://github.com/cpcloud/pandas/blame/6183078033adc62c1ccf043dac15d5b4f121fdac/doc/source/developer.rst#L157 (it still uses |
Having dived a little bit in the pyarrow side of the code in apache/arrow#1386, I have two design-level comments or questions related to the index names:
I think those changes are not yet in a released version of pyarrow? (and the last one is also not yet implemented in fastparquet I think to align with pyarrow) |
Well, we are running out of time to make any more such changes for Arrow 0.8.0. What about using the disambiguated names only when there is a naming conflict? |
cc @cpcloud |
We introduced this to accommodate that use case specifically. See ARROW-1754
I don't understand this comment. Can you give an example of what isn't possible or is cumbersome here?
That's also true for some cases independent of how we deal with index names. For example @jorisvandenbossche Can you give some examples to illustrate your points? |
Yes, I know, that's why I mentioned it. But I would say to that issue: "pitty, won't fix, this is not supported" (or of course: raise an informative error message. The solution for the user can simply be to specify
In apache/arrow#1386, I needed to do the following to get the field name corresponding to a specific pandas metadata column entry:
It might be that I am missing something and the above is overly complicated. But if you would have a
I would think such an operation (getting the field name corresponding to a pandas column name) is something rather common (it is also something that they will need in fastparquet), and it is of course not "impossible" (that was a bit strong stated), but just rather hard. This is indeed not fully related to the index names discussion, because even if we would revert the "always using |
@cpcloud Another illustration of the possible difficulties with the mapping between the pandas metadata and the actual field names in the schema: currently this mapping is possible in
Of course, this could (and maybe should) be fixed in |
I'm looking into this on the fastparquet / dask side.
Joris' suggestion seems sensible to me. |
Whatever the decision for this discussion, this PR should close #16391 (where Wes proposes an additional name entry for the index to disambiguate with column names) |
@jorisvandenbossche @TomAugspurger Thanks for commenting. I agree adding |
Thanks.
Thanks. Once ARROW-1895 is ready I'll update fastparquet's metadata writer
to be compliant.
…On Wed, Dec 6, 2017 at 1:50 PM, Phillip Cloud ***@***.***> wrote:
@jorisvandenbossche <https://github.com/jorisvandenbossche> @TomAugspurger
<https://github.com/tomaugspurger> Thanks for commenting. I agree adding
'field_name' will be helpful and will allow writing index metadata that
fastparquet can read as well. PR is in the works for arrow. I'll also
update this PR with the changes.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18201 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIloCpiOVkaff3vA-Q9JLE_2Cs5_wks5s9u_vgaJpZM4QYi4Q>
.
|
Thanks! Happy to review when there is a PR. Regarding my other point (the second bullet point here: #18201 (comment)) about always using The reason that I argue for this is the following: as long as one uses I understand that the 0.8.0 release is close, so that you might not want to reconsider this at this point. But I can certainly do a PR with the change to help out (I think it should be a relatively straightforward patch, just as apache/arrow#1271 was also not that a big patch) |
Here's the PR apache/arrow#1397 |
This is unrelated to the changes here, but I noticed pyarrow uses "string" for the "pandas_type": >>> df = pd.DataFrame({"A": [1, 2]})
>>> sink = io.BytesIO()
>>> pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
>>> json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['column_indexes']
[{'metadata': None,
'name': None,
'numpy_type': 'object',
'pandas_type': 'string'}] Though the spec says just |
@TomAugspurger That only appears in the |
@jorisvandenbossche @TomAugspurger Thanks for ironing this stuff out, it's very helpful to have a third and fourth pair of eyes on this code. |
We could always use guids for the index column names in the Parquet schema instead of the generated dunder-names (and use the index column metadata to list the guids corresponding to each index level). I think this is the last blocking issue for Arrow 0.8.0 -- there are a couple other small items I'm going to address today/tomorrow |
I took the guid approach back when I first did this and it wasn't a viable solution because there are new guids generated for every dataframe. It becomes really difficult to do something like a concat operation under this scheme. You need reliable and reproducible names for the index columns. |
Ah, good point, since you need to be deterministic across files |
If unique column names are not required in parquet and/or arrow, I think the |
Regarding the preserving of the index name if possible, I quickly tried this and opened apache/arrow#1408, just as a proof of concept that this is IMO not difficult to do while also keeping the bug fix of ARROW-1754 |
What's the recommended way for determining whether |
The actual name is stored in the metadata (name vs field_name) and can be null I think? |
Ah, it wasn't null as of as of 0.7.1: In [12]: import pandas as pd
In [13]: import pandas as pd; import pyarrow as pa; import pyarrow.parquet as pq
In [14]: df = pd.DataFrame({"x": [1, 2]})
In [15]: pq.write_table(pa.Table.from_pandas(df), '/tmp/a.parq')
In [16]: json.loads(pq.read_metadata('/tmp/a.parq').metadata[b'pandas'])
Out[16]:
{'columns': [{'metadata': None,
'name': 'x',
'numpy_type': 'int64',
'pandas_type': 'int64'},
{'metadata': None,
'name': '__index_level_0__',
'numpy_type': 'int64',
'pandas_type': 'int64'}],
'index_columns': ['__index_level_0__'],
'pandas_version': '0.21.0'} But it is null in 0.8.0: In [8]: import pandas as pd; import pyarrow as pa; import pyarrow.parquet as pq
In [9]: df = pd.DataFrame({"x": [1, 2]})
In [10]: pq.write_table(pa.Table.from_pandas(df), '/tmp/b.parq')
In [11]: json.loads(pq.read_metadata('/tmp/b.parq').metadata[b'pandas'])
Out[11]:
{'column_indexes': [{'field_name': None,
'metadata': {'encoding': 'UTF-8'},
'name': None,
'numpy_type': 'object',
'pandas_type': 'unicode'}],
'columns': [{'field_name': 'x',
'metadata': None,
'name': 'x',
'numpy_type': 'int64',
'pandas_type': 'int64'},
{'field_name': '__index_level_0__',
'metadata': None,
'name': None,
'numpy_type': 'int64',
'pandas_type': 'int64'}],
'index_columns': ['__index_level_0__'],
'pandas_version': '0.21.0'}
In [12]: pa.__version__
Out[12]: '0.8.0' |
https://issues.apache.org/jira/browse/ARROW-1941 was just opened and I noticed that the pyarrow metadata for nested types were like
I don't believe that nested types are mentioned anywhere for |
…that preserves index name if available Related to the discussion about the pandas metadata specification in pandas-dev/pandas#18201, and an alternative to #1271. I don't open this PR because it should necessarily be merged, I just want to show that it is not that difficult to both fix [ARROW-1754](https://issues.apache.org/jira/browse/ARROW-1754) and preserve index names as field names when possible (as this was mentioned in pandas-dev/pandas#18201 as the reason to make this change to not preserve index names). The diff is partly a revert of #1271, but then adapted to the current codebase. Main reasons I prefer to preserve index names: 1) usability in pyarrow itself (if you would want to work with pyarrow Tables created from pandas) and 2) when interchanging parquet files with other people / other non-pandas systems, then it would be much nicer to not have `__index_level_n__` column names if possible. Author: Joris Van den Bossche <[email protected]> Closes #1408 from jorisvandenbossche/index-names and squashes the following commits: eef1d33 [Joris Van den Bossche] alternative fix for duplicate index/column name that preserves index name if available
@cpcloud can you update this? (a.o. recent changes in apache/arrow#1408 affect this) |
@cpcloud update on this? |
@jorisvandenbossche Any chance you can submit a PR to my fork for the changes you have in mind? |
ping @jorisvandenbossche @cpcloud can you rebase when you have a chance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can merge this, and make any update in a separate PR. Just there is a required change, as we are not using the right directive for the example.
|
||
Here's an example of how the index metadata is structured in pyarrow: | ||
|
||
.. code-block:: python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. code-block:: python | |
.. ipython:: python |
@datapythonista that is fine. or if @cpcloud can update would be great. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked, and the code block is correct, as it doesn't show any output, and it's actually not expected to run (some variables are not initialized).
lgtm then
@jreback can we merge this then?
sure |
thanks @cpcloud |
cc @wesm @jcrist