-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add to/from_parquet with pyarrow & fastparquet #15838
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
brotlipy |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,6 +17,8 @@ pymysql | |
feather-format | ||
pyarrow | ||
psycopg2 | ||
python-snappy | ||
fastparquet | ||
beautifulsoup4 | ||
s3fs | ||
xarray | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,3 +13,5 @@ numexpr | |
pytables | ||
matplotlib | ||
blosc | ||
fastparquet | ||
pyarrow |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -43,6 +43,7 @@ object. The corresponding ``writer`` functions are object methods that are acces | |
binary;`MS Excel <https://en.wikipedia.org/wiki/Microsoft_Excel>`__;:ref:`read_excel<io.excel_reader>`;:ref:`to_excel<io.excel_writer>` | ||
binary;`HDF5 Format <https://support.hdfgroup.org/HDF5/whatishdf5.html>`__;:ref:`read_hdf<io.hdf5>`;:ref:`to_hdf<io.hdf5>` | ||
binary;`Feather Format <https://github.com/wesm/feather>`__;:ref:`read_feather<io.feather>`;:ref:`to_feather<io.feather>` | ||
binary;`Parquet Format <https://parquet.apache.org/>`__;:ref:`read_parquet<io.parquet>`;:ref:`to_parquet<io.parquet>` | ||
binary;`Msgpack <http://msgpack.org/index.html>`__;:ref:`read_msgpack<io.msgpack>`;:ref:`to_msgpack<io.msgpack>` | ||
binary;`Stata <https://en.wikipedia.org/wiki/Stata>`__;:ref:`read_stata<io.stata_reader>`;:ref:`to_stata<io.stata_writer>` | ||
binary;`SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__;:ref:`read_sas<io.sas_reader>`; | ||
|
@@ -209,7 +210,7 @@ buffer_lines : int, default None | |
.. deprecated:: 0.19.0 | ||
|
||
Argument removed because its value is not respected by the parser | ||
|
||
compact_ints : boolean, default False | ||
.. deprecated:: 0.19.0 | ||
|
||
|
@@ -4087,7 +4088,7 @@ control compression: ``complevel`` and ``complib``. | |
``complevel`` specifies if and how hard data is to be compressed. | ||
``complevel=0`` and ``complevel=None`` disables | ||
compression and ``0<complevel<10`` enables compression. | ||
|
||
``complib`` specifies which compression library to use. If nothing is | ||
specified the default library ``zlib`` is used. A | ||
compression library usually optimizes for either good | ||
|
@@ -4102,9 +4103,9 @@ control compression: ``complevel`` and ``complib``. | |
- `blosc <http://www.blosc.org/>`_: Fast compression and decompression. | ||
|
||
.. versionadded:: 0.20.2 | ||
|
||
Support for alternative blosc compressors: | ||
|
||
- `blosc:blosclz <http://www.blosc.org/>`_ This is the | ||
default compressor for ``blosc`` | ||
- `blosc:lz4 | ||
|
@@ -4545,6 +4546,79 @@ Read from a feather file. | |
import os | ||
os.remove('example.feather') | ||
|
||
|
||
.. _io.parquet: | ||
|
||
Parquet | ||
------- | ||
|
||
.. versionadded:: 0.21.0 | ||
|
||
`Parquet <https://parquet.apache.org/`__ provides a partitioned binary columnar serialization for data frames. It is designed to | ||
make reading and writing data frames efficient, and to make sharing data across data analysis | ||
languages easy. Parquet can use a variety of compression techniques to shrink the file size as much as possible | ||
while still maintaining good read performance. | ||
|
||
Parquet is designed to faithfully serialize and de-serialize ``DataFrame`` s, supporting all of the pandas | ||
dtypes, including extension dtypes such as datetime with tz. | ||
|
||
Several caveats. | ||
|
||
- The format will NOT write an ``Index``, or ``MultiIndex`` for the ``DataFrame`` and will raise an | ||
error if a non-default one is provided. You can simply ``.reset_index(drop=True)`` in order to store the index. | ||
- Duplicate column names and non-string columns names are not supported | ||
- Categorical dtypes are currently not-supported (for ``pyarrow``). | ||
- Non supported types include ``Period`` and actual python object types. These will raise a helpful error message | ||
on an attempt at serialization. | ||
|
||
You can specifiy an ``engine`` to direct the serialization. This can be one of ``pyarrow``, or ``fastparquet``, or ``auto``. | ||
If the engine is NOT specified, then the ``pd.options.io.parquet.engine`` option is checked; if this is also ``auto``, then | ||
then ``pyarrow`` is tried, and falling back to ``fastparquet``. | ||
|
||
See the documentation for `pyarrow <http://arrow.apache.org/docs/python/`__ and `fastparquet <https://fastparquet.readthedocs.io/en/latest/>`__ | ||
|
||
.. note:: | ||
|
||
These engines are very similar and should read/write nearly identical parquet format files. | ||
These libraries differ by having different underlying dependencies (``fastparquet`` by using ``numba``, while ``pyarrow`` uses a c-library). | ||
|
||
.. ipython:: python | ||
|
||
df = pd.DataFrame({'a': list('abc'), | ||
'b': list(range(1, 4)), | ||
'c': np.arange(3, 6).astype('u1'), | ||
'd': np.arange(4.0, 7.0, dtype='float64'), | ||
'e': [True, False, True], | ||
'f': pd.date_range('20130101', periods=3), | ||
'g': pd.date_range('20130101', periods=3, tz='US/Eastern'), | ||
'h': pd.date_range('20130101', periods=3, freq='ns')}) | ||
|
||
df | ||
df.dtypes | ||
|
||
Write to a parquet file. | ||
|
||
.. ipython:: python | ||
|
||
df.to_parquet('example_pa.parquet', engine='pyarrow') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This raises an error, since categorical is not supported by pyarrow? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. https://issues.apache.org/jira/browse/ARROW-1285 |
||
df.to_parquet('example_fp.parquet', engine='fastparquet') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This also fails because fastparquet is not installed in the doc build. |
||
|
||
Read from a parquet file. | ||
|
||
.. ipython:: python | ||
|
||
result = pd.read_parquet('example_pa.parquet', engine='pyarrow') | ||
result = pd.read_parquet('example_fp.parquet', engine='fastparquet') | ||
|
||
result.dtypes | ||
|
||
.. ipython:: python | ||
:suppress: | ||
|
||
import os | ||
os.remove('example_pa.parquet') | ||
os.remove('example_fp.parquet') | ||
|
||
.. _io.sql: | ||
|
||
SQL Queries | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1598,6 +1598,30 @@ def to_feather(self, fname): | |
from pandas.io.feather_format import to_feather | ||
to_feather(self, fname) | ||
|
||
def to_parquet(self, fname, engine='auto', compression='snappy', | ||
**kwargs): | ||
""" | ||
Write a DataFrame to the binary parquet format. | ||
|
||
.. versionadded:: 0.21.0 | ||
|
||
Parameters | ||
---------- | ||
fname : str | ||
string file path | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I haven't played with either of the engines, but do they both share similar semantics on the path argument? Does it have to be a string, or can it be an open file object, or There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the case of fastparquet, this is anything that can be passed to open; and you can specify what function to open files with ( Only in dask can you supply something like "s3://user:pass@bucket/path', and get it parsed to pass the correct open_with automatically. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might be useful for pandas to handle the conversion of a file path into a file-like object for semantic conformity. An exception would be unless a particular engine can do better with a local file path -- as an example, in pyarrow, we memory map local files which has generally better performance than Python There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please note that parquet data-sets are not necessarily single-file, so I don't think it's a great idea to pass open files, local or otherwise. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the other hand, from a Dask perspective it might be nice to one day rely entirely on a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we handle path_or_buffers in the following way:
So i don't see any reason to handle this differently. The IO engine gets to handle a fully qualified string path. (e.g. HDF5, excel, pickle, json) look all the same to pandas. The IO engine is in charge of opening closing the actual files. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
For this, there's a good argument that pandas should define a file system abstract interface that 3rd parties can implement. In practice in dask and pandas, this is already the case, but it may be worth defining with more formal rigor (as far as pandas is concerned at least) to help with API conformity. pandas doesn't really have a "plugin" API, but this is something to consider more and more as we try to be less monolithic |
||
engine : {'auto', 'pyarrow', 'fastparquet'}, default 'auto' | ||
Parquet reader library to use. If 'auto', then the option | ||
'io.parquet.engine' is used. If 'auto', then the first | ||
library to be installed is used. | ||
compression : str, optional, default 'snappy' | ||
compression method, includes {'gzip', 'snappy', 'brotli'} | ||
kwargs | ||
Additional keyword arguments passed to the engine | ||
""" | ||
from pandas.io.parquet import to_parquet | ||
to_parquet(self, fname, engine, | ||
compression=compression, **kwargs) | ||
|
||
@Substitution(header='Write out column names. If a list of string is given, \ | ||
it is assumed to be aliases for the column names') | ||
@Appender(fmt.docstring_to_string, indents=1) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the tests, there seem to be some differences at what data types they support? If that is correct, we should maybe mention it here ?