BUG: ValueError converting dense categorical series to sparse when `fill_value` not in series #49987

PGijsbers · 2022-12-01T13:27:56Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from pandas import SparseDtype

df = pd.DataFrame([["a", 0],["b", 1], ["b", 2]], columns=["A","B"])
df["A"].astype(SparseDtype("category"))
# or: df["A"].astype(SparseDtype("category", fill_value="not_in_series"))

Issue Description

I am unable to convert a dense categorical series to a sparse one when I leave the fill_value at default, or a value which does not exist in the series.

Stacktrace:

ValueError Traceback (most recent call last)
File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/IPython/core/formatters.py:706, in PlainTextFormatter.call(self, obj)
699 stream = StringIO()
700 printer = pretty.RepresentationPrinter(stream, self.verbose,
701 self.max_width, self.newline,
702 max_seq_length=self.max_seq_length,
703 singleton_pprinters=self.singleton_printers,
704 type_pprinters=self.type_printers,
705 deferred_pprinters=self.deferred_printers)
--> 706 printer.pretty(obj)
707 printer.flush()
708 return stream.getvalue()

File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/IPython/lib/pretty.py:410, in RepresentationPrinter.pretty(self, obj)
407 return meth(obj, self, cycle)
408 if cls is not object
409 and callable(cls.dict.get('repr')):
--> 410 return _repr_pprint(obj, self, cycle)
412 return _default_pprint(obj, self, cycle)
413 finally:

File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/IPython/lib/pretty.py:778, in repr_pprint(obj, p, cycle)
776 """A pprint that just redirects to the normal repr function."""
777 # Find newlines and replace them with p.break()
--> 778 output = repr(obj)
779 lines = output.splitlines()
780 with p.group():

File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/pandas/core/series.py:1550, in Series.repr(self)
1548 # pylint: disable=invalid-repr-returned
1549 repr_params = fmt.get_series_repr_params()
-> 1550 return self.to_string(**repr_params)

File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/pandas/core/series.py:1643, in Series.to_string(self, buf, na_rep, float_format, header, index, length, dtype, name, max_rows, min_rows)
1597 """
1598 Render a string representation of the Series.
1599
(...)
1629 String representation of Series if buf=None, otherwise None.
1630 """
1631 formatter = fmt.SeriesFormatter(
1632 self,
1633 name=name,
(...)
1641 max_rows=max_rows,
1642 )
-> 1643 result = formatter.to_string()
1645 # catch contract violations
1646 if not isinstance(result, str):

File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/pandas/io/formats/format.py:393, in SeriesFormatter.to_string(self)
390 return f"{type(self.series).name}([], {footer})"
392 fmt_index, have_header = self._get_formatted_index()
--> 393 fmt_values = self._get_formatted_values()
395 if self.is_truncated_vertically:
396 n_header_rows = 0

File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/pandas/io/formats/format.py:377, in SeriesFormatter._get_formatted_values(self)
376 def _get_formatted_values(self) -> list[str]:
--> 377 return format_array(
378 self.tr_series._values,
379 None,
380 float_format=self.float_format,
381 na_rep=self.na_rep,
382 leading_space=self.index,
383 )

File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/pandas/io/formats/format.py:1326, in format_array(values, formatter, float_format, na_rep, digits, space, justify, decimal, leading_space, quoting)
1311 digits = get_option("display.precision")
1313 fmt_obj = fmt_klass(
1314 values,
1315 digits=digits,
(...)
1323 quoting=quoting,
1324 )
-> 1326 return fmt_obj.get_result()

File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/pandas/io/formats/format.py:1357, in GenericArrayFormatter.get_result(self)
1356 def get_result(self) -> list[str]:
-> 1357 fmt_values = self._format_strings()
1358 return _make_fixed_width(fmt_values, self.justify)

File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/pandas/io/formats/format.py:1658, in ExtensionArrayFormatter._format_strings(self)
1656 array = values._internal_get_values()
1657 else:
-> 1658 array = np.asarray(values)
1660 fmt_values = format_array(
1661 array,
1662 formatter,
(...)
1670 quoting=self.quoting,
1671 )
1672 return fmt_values

ValueError: object array method not producing an array

Expected Behavior

I expect it to "just work", similar to providing a fill value which does exist in the series, or how it works with other dtypes:

import pandas as pd
from pandas import SparseDtype
df = pd.DataFrame([["a", 0],["b", 1], ["b", 2]], columns=["A","B"])

# works, since "a" is a value present in the series
df["A"].astype(SparseDtype("category", fill_value="a"))  

# also works, despite -1 not being present in the series
df["B"].astype(SparseDtype(int, fill_value=-1))

Installed Versions

INSTALLED VERSIONS

commit : 8dab54d
python : 3.10.5.final.0
python-bits : 64
OS : Darwin
OS-release : 21.5.0
Version : Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:37 PDT 2022; root:xnu-8020.121.3~4/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.5.2
numpy : 1.23.5
pytz : 2022.6
dateutil : 2.8.2
setuptools : 58.1.0
pip : 22.3.1
Cython : None
pytest : 7.2.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.6.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.6.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 10.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

The text was updated successfully, but these errors were encountered:

rhshadrach · 2022-12-03T13:04:45Z

Thanks for the report! Agreed the case where fill_value is specified but does not exist looks like a bug. For

df["A"].astype(SparseDtype("category"))

on the other hand, what's the expectation on unspecified values? While it's coming from a dense object so there are no unspecified values, I would think we still need a default.

PGijsbers · 2022-12-05T09:32:14Z

The current default of fill_value is float("nan"), which seems a reasonable default to me given that's how None is represented in a dense categorical series. The problem is that if no such None/nan value exists in the data an error gets raised (similarly to setting the fill_value explicitly to a value not present in the data). I would even assume that if bug is fixed when setting the fill_value explicitly, the same patch also fixes the error when it is left at default.

For an alternative default, the mode of the data also seems reasonable to me as it compresses the data, but that is not in line with the default for other types (e.g., integers default to fill_value=0 regardless of data). For that reason, I would recommend sticking with float("nan") as default.

rhshadrach · 2022-12-06T21:19:56Z

Thanks - makes sense. Further investigations and PRs to fix are most welcome!

PGijsbers · 2022-12-07T10:43:33Z

At this time I won't commit to that, unfortunately. So if anyone wants to work on this - go ahead! :) If there's no progress by the time I do have time, I'll leave a message.

yanxiaole · 2022-12-30T13:45:33Z

Seems this issue no longer exists in the latest main branch, @PGijsbers , could you confirm?

PGijsbers · 2023-01-02T13:47:01Z

Just re-ran the example myself on latest, and it looks like it works now 🥳 thanks!

rhshadrach · 2023-01-02T13:48:28Z

Unless tests have already been added in the PR that (inadvertently?) fixed this, this needs tests before the issue is closed.

luke396 · 2023-01-14T07:49:54Z

take

PGijsbers added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 1, 2022

rhshadrach added Sparse Sparse Data Type Categorical Categorical Data Type labels Dec 3, 2022

rhshadrach removed the Needs Triage Issue that has not been reviewed by a pandas team member label Dec 6, 2022

PGijsbers closed this as completed Jan 2, 2023

rhshadrach reopened this Jan 2, 2023

rhshadrach added the Needs Tests Unit test(s) needed to prevent regressions label Jan 2, 2023

phofl mentioned this issue Jan 10, 2023

TST: Fixed issues that need tests noatamir/pyladies-berlin-sprints#3

Open

17 tasks

github-actions bot assigned luke396 Jan 14, 2023

luke396 mentioned this issue Jan 14, 2023

TST: Test dtype sparseDtype with specificd fill value #50743

Merged

5 tasks

phofl closed this as completed in #50743 Jan 16, 2023

mroeschke mentioned this issue May 10, 2023

BUG: SparseDtype requires numpy dtype #53160

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: ValueError converting dense categorical series to sparse when `fill_value` not in series #49987

BUG: ValueError converting dense categorical series to sparse when `fill_value` not in series #49987

PGijsbers commented Dec 1, 2022

INSTALLED VERSIONS

rhshadrach commented Dec 3, 2022

PGijsbers commented Dec 5, 2022

rhshadrach commented Dec 6, 2022 •

edited

Loading

PGijsbers commented Dec 7, 2022

yanxiaole commented Dec 30, 2022

PGijsbers commented Jan 2, 2023

rhshadrach commented Jan 2, 2023

luke396 commented Jan 14, 2023

BUG: ValueError converting dense categorical series to sparse when fill_value not in series #49987

BUG: ValueError converting dense categorical series to sparse when fill_value not in series #49987

Comments

PGijsbers commented Dec 1, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

rhshadrach commented Dec 3, 2022

PGijsbers commented Dec 5, 2022

rhshadrach commented Dec 6, 2022 • edited Loading

PGijsbers commented Dec 7, 2022

yanxiaole commented Dec 30, 2022

PGijsbers commented Jan 2, 2023

rhshadrach commented Jan 2, 2023

luke396 commented Jan 14, 2023

BUG: ValueError converting dense categorical series to sparse when `fill_value` not in series #49987

BUG: ValueError converting dense categorical series to sparse when `fill_value` not in series #49987

rhshadrach commented Dec 6, 2022 •

edited

Loading