Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: ValueError converting dense categorical series to sparse when fill_value not in series #49987

Closed
3 tasks done
Tracked by #3
PGijsbers opened this issue Dec 1, 2022 · 8 comments · Fixed by #50743
Closed
3 tasks done
Tracked by #3
Assignees
Labels
Bug Categorical Categorical Data Type Needs Tests Unit test(s) needed to prevent regressions Sparse Sparse Data Type

Comments

@PGijsbers
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from pandas import SparseDtype

df = pd.DataFrame([["a", 0],["b", 1], ["b", 2]], columns=["A","B"])
df["A"].astype(SparseDtype("category"))
# or: df["A"].astype(SparseDtype("category", fill_value="not_in_series"))

Issue Description

I am unable to convert a dense categorical series to a sparse one when I leave the fill_value at default, or a value which does not exist in the series.

Stacktrace:


ValueError Traceback (most recent call last)
File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/IPython/core/formatters.py:706, in PlainTextFormatter.call(self, obj)
699 stream = StringIO()
700 printer = pretty.RepresentationPrinter(stream, self.verbose,
701 self.max_width, self.newline,
702 max_seq_length=self.max_seq_length,
703 singleton_pprinters=self.singleton_printers,
704 type_pprinters=self.type_printers,
705 deferred_pprinters=self.deferred_printers)
--> 706 printer.pretty(obj)
707 printer.flush()
708 return stream.getvalue()

File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/IPython/lib/pretty.py:410, in RepresentationPrinter.pretty(self, obj)
407 return meth(obj, self, cycle)
408 if cls is not object
409 and callable(cls.dict.get('repr')):
--> 410 return _repr_pprint(obj, self, cycle)
412 return _default_pprint(obj, self, cycle)
413 finally:

File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/IPython/lib/pretty.py:778, in repr_pprint(obj, p, cycle)
776 """A pprint that just redirects to the normal repr function."""
777 # Find newlines and replace them with p.break
()
--> 778 output = repr(obj)
779 lines = output.splitlines()
780 with p.group():

File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/pandas/core/series.py:1550, in Series.repr(self)
1548 # pylint: disable=invalid-repr-returned
1549 repr_params = fmt.get_series_repr_params()
-> 1550 return self.to_string(**repr_params)

File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/pandas/core/series.py:1643, in Series.to_string(self, buf, na_rep, float_format, header, index, length, dtype, name, max_rows, min_rows)
1597 """
1598 Render a string representation of the Series.
1599
(...)
1629 String representation of Series if buf=None, otherwise None.
1630 """
1631 formatter = fmt.SeriesFormatter(
1632 self,
1633 name=name,
(...)
1641 max_rows=max_rows,
1642 )
-> 1643 result = formatter.to_string()
1645 # catch contract violations
1646 if not isinstance(result, str):

File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/pandas/io/formats/format.py:393, in SeriesFormatter.to_string(self)
390 return f"{type(self.series).name}([], {footer})"
392 fmt_index, have_header = self._get_formatted_index()
--> 393 fmt_values = self._get_formatted_values()
395 if self.is_truncated_vertically:
396 n_header_rows = 0

File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/pandas/io/formats/format.py:377, in SeriesFormatter._get_formatted_values(self)
376 def _get_formatted_values(self) -> list[str]:
--> 377 return format_array(
378 self.tr_series._values,
379 None,
380 float_format=self.float_format,
381 na_rep=self.na_rep,
382 leading_space=self.index,
383 )

File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/pandas/io/formats/format.py:1326, in format_array(values, formatter, float_format, na_rep, digits, space, justify, decimal, leading_space, quoting)
1311 digits = get_option("display.precision")
1313 fmt_obj = fmt_klass(
1314 values,
1315 digits=digits,
(...)
1323 quoting=quoting,
1324 )
-> 1326 return fmt_obj.get_result()

File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/pandas/io/formats/format.py:1357, in GenericArrayFormatter.get_result(self)
1356 def get_result(self) -> list[str]:
-> 1357 fmt_values = self._format_strings()
1358 return _make_fixed_width(fmt_values, self.justify)

File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/pandas/io/formats/format.py:1658, in ExtensionArrayFormatter._format_strings(self)
1656 array = values._internal_get_values()
1657 else:
-> 1658 array = np.asarray(values)
1660 fmt_values = format_array(
1661 array,
1662 formatter,
(...)
1670 quoting=self.quoting,
1671 )
1672 return fmt_values

ValueError: object array method not producing an array

Expected Behavior

I expect it to "just work", similar to providing a fill value which does exist in the series, or how it works with other dtypes:

import pandas as pd
from pandas import SparseDtype
df = pd.DataFrame([["a", 0],["b", 1], ["b", 2]], columns=["A","B"])

# works, since "a" is a value present in the series
df["A"].astype(SparseDtype("category", fill_value="a"))  

# also works, despite -1 not being present in the series
df["B"].astype(SparseDtype(int, fill_value=-1))  

Installed Versions

INSTALLED VERSIONS

commit : 8dab54d
python : 3.10.5.final.0
python-bits : 64
OS : Darwin
OS-release : 21.5.0
Version : Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:37 PDT 2022; root:xnu-8020.121.3~4/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.5.2
numpy : 1.23.5
pytz : 2022.6
dateutil : 2.8.2
setuptools : 58.1.0
pip : 22.3.1
Cython : None
pytest : 7.2.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.6.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.6.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 10.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

@PGijsbers PGijsbers added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 1, 2022
@rhshadrach
Copy link
Member

Thanks for the report! Agreed the case where fill_value is specified but does not exist looks like a bug. For

df["A"].astype(SparseDtype("category"))

on the other hand, what's the expectation on unspecified values? While it's coming from a dense object so there are no unspecified values, I would think we still need a default.

@rhshadrach rhshadrach added Sparse Sparse Data Type Categorical Categorical Data Type labels Dec 3, 2022
@PGijsbers
Copy link
Author

The current default of fill_value is float("nan"), which seems a reasonable default to me given that's how None is represented in a dense categorical series. The problem is that if no such None/nan value exists in the data an error gets raised (similarly to setting the fill_value explicitly to a value not present in the data). I would even assume that if bug is fixed when setting the fill_value explicitly, the same patch also fixes the error when it is left at default.

For an alternative default, the mode of the data also seems reasonable to me as it compresses the data, but that is not in line with the default for other types (e.g., integers default to fill_value=0 regardless of data). For that reason, I would recommend sticking with float("nan") as default.

@rhshadrach
Copy link
Member

rhshadrach commented Dec 6, 2022

Thanks - makes sense. Further investigations and PRs to fix are most welcome!

@rhshadrach rhshadrach removed the Needs Triage Issue that has not been reviewed by a pandas team member label Dec 6, 2022
@PGijsbers
Copy link
Author

At this time I won't commit to that, unfortunately. So if anyone wants to work on this - go ahead! :) If there's no progress by the time I do have time, I'll leave a message.

@yanxiaole
Copy link

Seems this issue no longer exists in the latest main branch, @PGijsbers , could you confirm?

@PGijsbers
Copy link
Author

Just re-ran the example myself on latest, and it looks like it works now 🥳 thanks!

@rhshadrach rhshadrach reopened this Jan 2, 2023
@rhshadrach
Copy link
Member

Unless tests have already been added in the PR that (inadvertently?) fixed this, this needs tests before the issue is closed.

@rhshadrach rhshadrach added the Needs Tests Unit test(s) needed to prevent regressions label Jan 2, 2023
@luke396
Copy link
Contributor

luke396 commented Jan 14, 2023

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Needs Tests Unit test(s) needed to prevent regressions Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants