Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Problem using to_csv with BytesIO #37292

Closed
2 of 3 tasks
amotl opened this issue Oct 20, 2020 · 7 comments
Closed
2 of 3 tasks

BUG: Problem using to_csv with BytesIO #37292

amotl opened this issue Oct 20, 2020 · 7 comments
Labels
IO CSV read_csv, to_csv Usage Question
Milestone

Comments

@amotl
Copy link

amotl commented Oct 20, 2020

Dear people of Pandas,

first things first: Thanks for all of your excellent work conceiving and maintaining Pandas. You know who you are.

With kind regards,
Andreas.


  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
from io import BytesIO

df = pd.DataFrame()
buffer = BytesIO()

df.to_csv(buffer)

Problem description

The snippet above croaks with

TypeError: a bytes-like object is required, not 'str'

We expected this to work. However, we will be happy to learn otherwise. We also had a look at #22555 and #35129 which seem to be related but not exactly on the spot.

The background on this is that we are currently in the process of upgrading Kotori to Python 3 (yeah, we are late to the game). However, coming from this, we can confirm it worked when using Pandas 0.18.1 on Python 2 the other day.

The relevant code is

Thanks already for looking into this!

Output of pd.show_versions()

INSTALLED VERSIONS

commit : db08276
python : 3.6.9.final.0
python-bits : 64
OS : Darwin
OS-release : 17.7.0
Version : Darwin Kernel Version 17.7.0: Thu Jun 18 21:21:34 PDT 2020; root:xnu-4570.71.82.5~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.1.3
numpy : 1.19.2
pytz : 2018.9
dateutil : 2.8.1
pip : 20.2.3
setuptools : 50.3.0
Cython : None
pytest : 4.6.9
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.3.6
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.8
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@amotl amotl added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 20, 2020
@twoertwein
Copy link
Member

#35129 adds support for binary file handles but it is part of the to-be-released 1.2 version.

@twoertwein
Copy link
Member

in the meantime you can wrap your io.BytesIO object using io.TextIOWapper (pandas 1.2 will do the same internally).

@jreback jreback added IO CSV read_csv, to_csv Usage Question and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 20, 2020
@jreback jreback added this to the 1.2 milestone Oct 20, 2020
@jreback jreback closed this as completed Oct 20, 2020
@amotl
Copy link
Author

amotl commented Oct 21, 2020

Dear Torsten,

thanks for your quick answer.

The problem with using a io.TextIOWrapper in general here is that it probably would work for all text-like output formats

But we will get problems when using that for writing binary data into, like Excel and so on

For now, we will be using StringIO objects explicitly for rendering into text-like output formats using to_csv, to_json and to_html. For all others, we will keep the BytesIO.

The idea is to be able to get hold of the in-memory payload content in a generic fashion by using .getvalue() after dispatching to the specific rendering method.

With kind regards,
Andreas.

P.S.: We will check back after upgrading to Pandas 1.2 in December and see how that goes. In the meanwhile, we will use the workaround as outlined above, explicitly using either StringIO or BytesIO, depending on the character of the output format.

@twoertwein
Copy link
Member

you could probably do something like:

        if suffix in ['csv', 'txt']:
            # http://pandas.pydata.org/pandas-docs/stable/io.html#io-store-in-csv
            charset = 'utf-8'
            wrapper = TextIOWrapper(buffer, encoding=charset)
            df.to_csv(wrapper, header=True, index=False, encoding=charset, date_format='%Y-%m-%dT%H:%M:%S.%fZ')
            wrapper.flush()  # make sure that TextIOWrapper writes the content to buffer

@amotl
Copy link
Author

amotl commented Oct 21, 2020

Dear Torsten,

thanks for taking the time. I am now doing it like that using StringIO:

        if suffix in ['csv', 'txt']:
            # http://pandas.pydata.org/pandas-docs/stable/io.html#io-store-in-csv
            buffer = StringIO()
            df.to_csv(buffer, header=True, index=False, encoding='utf-8', date_format='%Y-%m-%dT%H:%M:%S.%fZ')
            charset = 'utf-8'

Is there any particular advantage using TextIOWrapper instead?

With kind regards,
Andreas.

@twoertwein
Copy link
Member

the only advantage in your setup for using BytesIO+TextIOBuffer is that when you call getvalue you always get bytes andmnot sometimes bytes and sometimes strings.

@amotl
Copy link
Author

amotl commented Oct 21, 2020

When you call getvalue you always get bytes and not sometimes bytes and sometimes strings.

So true.

I've fiddled a bit with that variant in the code and the test suite on the other side and I am now using BytesIO+TextIOBuffer as you suggested. Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Usage Question
Projects
None yet
Development

No branches or pull requests

3 participants