Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add to/from_parquet with pyarrow & fastparquet #15838

Merged
merged 1 commit into from
Aug 2, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ci/install_travis.sh
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,7 @@ fi
echo
echo "[removing installed pandas]"
conda remove pandas -y --force
pip uninstall -y pandas

if [ "$BUILD_TEST" ]; then

Expand Down
2 changes: 1 addition & 1 deletion ci/requirements-2.7.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@ source activate pandas

echo "install 27"

conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1 fastparquet
4 changes: 2 additions & 2 deletions ci/requirements-3.5.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ source activate pandas

echo "install 35"

conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1

# pip install python-dateutil to get latest
conda remove -n pandas python-dateutil --force
pip install python-dateutil

conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1
2 changes: 1 addition & 1 deletion ci/requirements-3.5_OSX.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@ source activate pandas

echo "install 35_OSX"

conda install -n pandas -c conda-forge feather-format==0.3.1
conda install -n pandas -c conda-forge feather-format==0.3.1 fastparquet
1 change: 1 addition & 0 deletions ci/requirements-3.6.pip
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
brotlipy
2 changes: 2 additions & 0 deletions ci/requirements-3.6.run
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ pymysql
feather-format
pyarrow
psycopg2
python-snappy
fastparquet
beautifulsoup4
s3fs
xarray
Expand Down
2 changes: 1 addition & 1 deletion ci/requirements-3.6_DOC.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,6 @@ echo "[install DOC_BUILD deps]"

pip install pandas-gbq

conda install -n pandas -c conda-forge feather-format pyarrow nbsphinx pandoc
conda install -n pandas -c conda-forge feather-format pyarrow nbsphinx pandoc fastparquet

conda install -n pandas -c r r rpy2 --yes
2 changes: 2 additions & 0 deletions ci/requirements-3.6_WIN.run
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,5 @@ numexpr
pytables
matplotlib
blosc
fastparquet
pyarrow
1 change: 1 addition & 0 deletions doc/source/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,7 @@ Optional Dependencies
* `xarray <http://xarray.pydata.org>`__: pandas like handling for > 2 dims, needed for converting Panels to xarray objects. Version 0.7.0 or higher is recommended.
* `PyTables <http://www.pytables.org>`__: necessary for HDF5-based storage. Version 3.0.0 or higher required, Version 3.2.1 or higher highly recommended.
* `Feather Format <https://github.com/wesm/feather>`__: necessary for feather-based storage, version 0.3.1 or higher.
* ``Apache Parquet Format``, either `pyarrow <http://arrow.apache.org/docs/python/>`__ (>= 0.4.1) or `fastparquet <https://fastparquet.readthedocs.io/en/latest/necessary>`__ (>= 0.0.6) for parquet-based storage. The `snappy <https://pypi.python.org/pypi/python-snappy>`__ and `brotli <https://pypi.python.org/pypi/brotlipy>`__ are available for compression support.
* `SQLAlchemy <http://www.sqlalchemy.org>`__: for SQL database support. Version 0.8.1 or higher recommended. Besides SQLAlchemy, you also need a database specific driver. You can find an overview of supported drivers for each SQL dialect in the `SQLAlchemy docs <http://docs.sqlalchemy.org/en/latest/dialects/index.html>`__. Some common drivers are:

* `psycopg2 <http://initd.org/psycopg/>`__: for PostgreSQL
Expand Down
82 changes: 78 additions & 4 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ object. The corresponding ``writer`` functions are object methods that are acces
binary;`MS Excel <https://en.wikipedia.org/wiki/Microsoft_Excel>`__;:ref:`read_excel<io.excel_reader>`;:ref:`to_excel<io.excel_writer>`
binary;`HDF5 Format <https://support.hdfgroup.org/HDF5/whatishdf5.html>`__;:ref:`read_hdf<io.hdf5>`;:ref:`to_hdf<io.hdf5>`
binary;`Feather Format <https://github.com/wesm/feather>`__;:ref:`read_feather<io.feather>`;:ref:`to_feather<io.feather>`
binary;`Parquet Format <https://parquet.apache.org/>`__;:ref:`read_parquet<io.parquet>`;:ref:`to_parquet<io.parquet>`
binary;`Msgpack <http://msgpack.org/index.html>`__;:ref:`read_msgpack<io.msgpack>`;:ref:`to_msgpack<io.msgpack>`
binary;`Stata <https://en.wikipedia.org/wiki/Stata>`__;:ref:`read_stata<io.stata_reader>`;:ref:`to_stata<io.stata_writer>`
binary;`SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__;:ref:`read_sas<io.sas_reader>`;
Expand Down Expand Up @@ -209,7 +210,7 @@ buffer_lines : int, default None
.. deprecated:: 0.19.0

Argument removed because its value is not respected by the parser

compact_ints : boolean, default False
.. deprecated:: 0.19.0

Expand Down Expand Up @@ -4087,7 +4088,7 @@ control compression: ``complevel`` and ``complib``.
``complevel`` specifies if and how hard data is to be compressed.
``complevel=0`` and ``complevel=None`` disables
compression and ``0<complevel<10`` enables compression.

``complib`` specifies which compression library to use. If nothing is
specified the default library ``zlib`` is used. A
compression library usually optimizes for either good
Expand All @@ -4102,9 +4103,9 @@ control compression: ``complevel`` and ``complib``.
- `blosc <http://www.blosc.org/>`_: Fast compression and decompression.

.. versionadded:: 0.20.2

Support for alternative blosc compressors:

- `blosc:blosclz <http://www.blosc.org/>`_ This is the
default compressor for ``blosc``
- `blosc:lz4
Expand Down Expand Up @@ -4545,6 +4546,79 @@ Read from a feather file.
import os
os.remove('example.feather')


.. _io.parquet:

Parquet
-------

.. versionadded:: 0.21.0

`Parquet <https://parquet.apache.org/`__ provides a partitioned binary columnar serialization for data frames. It is designed to
make reading and writing data frames efficient, and to make sharing data across data analysis
languages easy. Parquet can use a variety of compression techniques to shrink the file size as much as possible
while still maintaining good read performance.

Parquet is designed to faithfully serialize and de-serialize ``DataFrame`` s, supporting all of the pandas
dtypes, including extension dtypes such as datetime with tz.

Several caveats.

- The format will NOT write an ``Index``, or ``MultiIndex`` for the ``DataFrame`` and will raise an
error if a non-default one is provided. You can simply ``.reset_index(drop=True)`` in order to store the index.
- Duplicate column names and non-string columns names are not supported
- Categorical dtypes are currently not-supported (for ``pyarrow``).
- Non supported types include ``Period`` and actual python object types. These will raise a helpful error message
on an attempt at serialization.

You can specifiy an ``engine`` to direct the serialization. This can be one of ``pyarrow``, or ``fastparquet``, or ``auto``.
If the engine is NOT specified, then the ``pd.options.io.parquet.engine`` option is checked; if this is also ``auto``, then
then ``pyarrow`` is tried, and falling back to ``fastparquet``.

See the documentation for `pyarrow <http://arrow.apache.org/docs/python/`__ and `fastparquet <https://fastparquet.readthedocs.io/en/latest/>`__

.. note::

These engines are very similar and should read/write nearly identical parquet format files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the tests, there seem to be some differences at what data types they support? If that is correct, we should maybe mention it here ?

These libraries differ by having different underlying dependencies (``fastparquet`` by using ``numba``, while ``pyarrow`` uses a c-library).

.. ipython:: python

df = pd.DataFrame({'a': list('abc'),
'b': list(range(1, 4)),
'c': np.arange(3, 6).astype('u1'),
'd': np.arange(4.0, 7.0, dtype='float64'),
'e': [True, False, True],
'f': pd.date_range('20130101', periods=3),
'g': pd.date_range('20130101', periods=3, tz='US/Eastern'),
'h': pd.date_range('20130101', periods=3, freq='ns')})

df
df.dtypes

Write to a parquet file.

.. ipython:: python

df.to_parquet('example_pa.parquet', engine='pyarrow')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises an error, since categorical is not supported by pyarrow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://issues.apache.org/jira/browse/ARROW-1285
(and going to remove the cat; we test this but docs didn't get updated)

df.to_parquet('example_fp.parquet', engine='fastparquet')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also fails because fastparquet is not installed in the doc build.
But maybe rather than adding it to the doc build, maybe we can make this a code block? (just showing how to specify the engine) To not further burden the doc build with more dependencies (pyarrow is already included for feather)


Read from a parquet file.

.. ipython:: python

result = pd.read_parquet('example_pa.parquet', engine='pyarrow')
result = pd.read_parquet('example_fp.parquet', engine='fastparquet')

result.dtypes

.. ipython:: python
:suppress:

import os
os.remove('example_pa.parquet')
os.remove('example_fp.parquet')

.. _io.sql:

SQL Queries
Expand Down
3 changes: 3 additions & 0 deletions doc/source/options.rst
Original file line number Diff line number Diff line change
Expand Up @@ -414,6 +414,9 @@ io.hdf.default_format None default format writing format,
'table'
io.hdf.dropna_table True drop ALL nan rows when appending
to a table
io.parquet.engine None The engine to use as a default for
parquet reading and writing. If None
then try 'pyarrow' and 'fastparquet'
mode.chained_assignment warn Raise an exception, warn, or no
action if trying to use chained
assignment, The default is warn
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ Other Enhancements
- :func:`DataFrame.select_dtypes` now accepts scalar values for include/exclude as well as list-like. (:issue:`16855`)
- :func:`date_range` now accepts 'YS' in addition to 'AS' as an alias for start of year (:issue:`9313`)
- :func:`date_range` now accepts 'Y' in addition to 'A' as an alias for end of year (:issue:`9313`)
- Integration with Apache Parquet, including a new top-level ``pd.read_parquet()`` and ``DataFrame.to_parquet()`` method, see :ref:`here <io.parquet>`.

.. _whatsnew_0210.api_breaking:

Expand Down
12 changes: 12 additions & 0 deletions pandas/core/config_init.py
Original file line number Diff line number Diff line change
Expand Up @@ -465,3 +465,15 @@ def _register_xlsx(engine, other):
except ImportError:
# fallback
_register_xlsx('openpyxl', 'xlsxwriter')

# Set up the io.parquet specific configuration.
parquet_engine_doc = """
: string
The default parquet reader/writer engine. Available options:
'auto', 'pyarrow', 'fastparquet', the default is 'auto'
"""

with cf.config_prefix('io.parquet'):
cf.register_option(
'engine', 'auto', parquet_engine_doc,
validator=is_one_of_factory(['auto', 'pyarrow', 'fastparquet']))
24 changes: 24 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -1598,6 +1598,30 @@ def to_feather(self, fname):
from pandas.io.feather_format import to_feather
to_feather(self, fname)

def to_parquet(self, fname, engine='auto', compression='snappy',
**kwargs):
"""
Write a DataFrame to the binary parquet format.

.. versionadded:: 0.21.0

Parameters
----------
fname : str
string file path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't played with either of the engines, but do they both share similar semantics on the path argument? Does it have to be a string, or can it be an open file object, or pathlib.Path? Can it be an s3 path?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of fastparquet, this is anything that can be passed to open; and you can specify what function to open files with (open_with=), which must return a file-like; this is how you open with s3 etc., by passing S3FileSystem.open.

Only in dask can you supply something like "s3://user:pass@bucket/path', and get it parsed to pass the correct open_with automatically.

Copy link
Member

@wesm wesm Apr 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful for pandas to handle the conversion of a file path into a file-like object for semantic conformity. An exception would be unless a particular engine can do better with a local file path -- as an example, in pyarrow, we memory map local files which has generally better performance than Python file objects

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that parquet data-sets are not necessarily single-file, so I don't think it's a great idea to pass open files, local or otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, from a Dask perspective it might be nice to one day rely entirely on a pandas.read_parquet function for chunk-wise logic. In this case we would want to hand pandas a file-like object and ask it to get us a few particular row groups from that object. If inconvenient I don't think we should worry about this use case near-term. I just thought I'd bring it up.

Copy link
Contributor Author

@jreback jreback Apr 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we handle path_or_buffers in the following way:

  • path-like objects (pathlib and py.local), we stringify
  • if a string
    • if its a url we turn this into a Bytes object (also handling gzip content encoding)
    • if its a s3 url we defer to s3fs for opening
    • else we would do things like expand_user
    • we can infer a compression from the filepath itself (we just path this thru if its found),
      mainly useful for text files where we decompress.
  • file-like we pass thru
  • for csv reading we will handle the file io & encoding
  • all others we pass the string-path thru

So i don't see any reason to handle this differently. The IO engine gets to handle a fully qualified string path. (e.g. HDF5, excel, pickle, json) look all the same to pandas. The IO engine is in charge of opening closing the actual files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that parquet data-sets are not necessarily single-file, so I don't think it's a great idea to pass open files, local or otherwise.

For this, there's a good argument that pandas should define a file system abstract interface that 3rd parties can implement. In practice in dask and pandas, this is already the case, but it may be worth defining with more formal rigor (as far as pandas is concerned at least) to help with API conformity. pandas doesn't really have a "plugin" API, but this is something to consider more and more as we try to be less monolithic

engine : {'auto', 'pyarrow', 'fastparquet'}, default 'auto'
Parquet reader library to use. If 'auto', then the option
'io.parquet.engine' is used. If 'auto', then the first
library to be installed is used.
compression : str, optional, default 'snappy'
compression method, includes {'gzip', 'snappy', 'brotli'}
kwargs
Additional keyword arguments passed to the engine
"""
from pandas.io.parquet import to_parquet
to_parquet(self, fname, engine,
compression=compression, **kwargs)

@Substitution(header='Write out column names. If a list of string is given, \
it is assumed to be aliases for the column names')
@Appender(fmt.docstring_to_string, indents=1)
Expand Down
1 change: 1 addition & 0 deletions pandas/io/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from pandas.io.sql import read_sql, read_sql_table, read_sql_query
from pandas.io.sas import read_sas
from pandas.io.feather_format import read_feather
from pandas.io.parquet import read_parquet
from pandas.io.stata import read_stata
from pandas.io.pickle import read_pickle, to_pickle
from pandas.io.packers import read_msgpack, to_msgpack
Expand Down
4 changes: 2 additions & 2 deletions pandas/io/feather_format.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ def _try_import():
"you can install via conda\n"
"conda install feather-format -c conda-forge\n"
"or via pip\n"
"pip install feather-format\n")
"pip install -U feather-format\n")

try:
feather.__version__ >= LooseVersion('0.3.1')
Expand All @@ -29,7 +29,7 @@ def _try_import():
"you can install via conda\n"
"conda install feather-format -c conda-forge"
"or via pip\n"
"pip install feather-format\n")
"pip install -U feather-format\n")

return feather

Expand Down
Loading