Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QUERY: difference between namedtuples and objects produced by _make_tuple_bunch? #22450

Closed
deanm0000 opened this issue Jan 31, 2025 · 9 comments · Fixed by #22494
Closed

QUERY: difference between namedtuples and objects produced by _make_tuple_bunch? #22450

deanm0000 opened this issue Jan 31, 2025 · 9 comments · Fixed by #22494
Labels
query A question or suggestion that requires further information scipy.stats
Milestone

Comments

@deanm0000
Copy link

Describe your issue.

Shapiro returns a namedtuple

ShapiroResult = namedtuple('ShapiroResult', ('statistic', 'pvalue'))

whereas wilcox does this

def wilcoxon_result_unpacker(res):
if hasattr(res, 'zstatistic'):
return res.statistic, res.pvalue, res.zstatistic
else:
return res.statistic, res.pvalue

which happens after it goes through the process of being made a namedtuple in WilcoxResult.

This issue came up from a polars user because polars treats the namedtuple differently than a regular tuple leading to some confusion.

Is there a reason I'm not seeing for this inconsistency or was it just unintentional? (I'm guessing someone who doesn't use namedtuples did the unpacking)

Reproducing Code Example

NA

Error message

NA

SciPy/NumPy/Python version and system information

e3c6a05 branch
@deanm0000 deanm0000 added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Jan 31, 2025
@mdhaber
Copy link
Contributor

mdhaber commented Jan 31, 2025

Like shapiro, wilcoxon also returns a namedtuple-like object.

from scipy.stats import wilcoxon
res = wilcoxon([1, 2, 3])
res
# WilcoxonResult(statistic=0.0, pvalue=0.25)
res.statistic  # 0.0 
res.pvalue  # 0.25
statistic, pvalue = res
statistic  # 0.0 
pvalue  # 0.25

See _make_tuple_bunch. This was needed to add an additional attribute (zstatistic; see gh-2625 and gh-15632) in way that would not break backward compatibility.

The manual unpacking that you showed (wilcoxon_result_unpacker) is separate, and needed during the process of dealing with nan_policy='omit', etc. It does not affect the type of the object returned by the public function.

Please provide a MRE of the problem you're experiencing.

@mdhaber mdhaber changed the title BUG: Inconsistent return type between shapiro and wilcoxon QUERY: different details in the implementations of shapiro and wilcoxon? Jan 31, 2025
@mdhaber mdhaber changed the title QUERY: different details in the implementations of shapiro and wilcoxon? QUERY: difference between nametuples and objects produced by _make_tuple_bunch? Jan 31, 2025
@mdhaber mdhaber changed the title QUERY: difference between nametuples and objects produced by _make_tuple_bunch? QUERY: difference between namedtuples and objects produced by _make_tuple_bunch? Jan 31, 2025
@mdhaber mdhaber removed the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Jan 31, 2025
@phobson
Copy link

phobson commented Jan 31, 2025

Hey @mdhaber, this originates from an issue I brought up in the polars discord where nametuples from scipy.stats and the Wilcoxon "tuple bunches" get unpacked differently.

This is certainly something that I can work around, but here's an MRE that illustrates what I'm seeing via polars.

I generate this example via scipy 1.15.1

import numpy
import polars as pl
import polars.selectors as cs
from scipy import stats

# setup dataframe of fake water quality data
N = 300
rs = numpy.random.RandomState(37)
df= pl.DataFrame({
    "state": rs.choice(["OR", "WA"], size=N),
    "landuse": rs.choice(["RES", "COM"], size=N),
    "pollutant": rs.choice(["Cu", "Pb"], size=N),
    "infl": rs.lognormal( 0.0, 1.25, size=N),
    "effl": rs.lognormal(-0.5, 2.00, size=N),
})

shapiro = (
    df.with_columns(obs=pl.struct("infl", "effl"))
      .group_by(pl.col("state"), pl.col("landuse"), pl.col("pollutant"))
      .agg(cs.by_name("infl", "effl").map_batches(stats.shapiro, returns_scalar=True))
)

And that gives something that looks like this (truncated -- note the struct columns):

┌───────┬─────────┬───────────┬──────────────────────┬───────────────────────┐
│ state ┆ landuse ┆ pollutant ┆ infl                 ┆ effl                  │
│ ---   ┆ ---     ┆ ---       ┆ ---                  ┆ ---                   │
│ str   ┆ str     ┆ str       ┆ struct[2]            ┆ struct[2]             │
╞═══════╪═════════╪═══════════╪══════════════════════╪═══════════════════════╡
│ OR    ┆ RES     ┆ Cu        ┆ {0.817417,0.00003}   ┆ {0.655478,4.5343e-8}  │
│ OR    ┆ RES     ┆ Pb        ┆ {0.62145,3.7383e-8}  ┆ {0.569183,8.0773e-9}  │
└───────┴─────────┴───────────┴──────────────────────┴───────────────────────┘

Compare that with the Wilcoxon test:

wilcoxon = (
    df.with_columns(obs=pl.struct("infl", "effl"))
      .group_by(pl.col("state"), pl.col("landuse"), pl.col("pollutant"))
      .agg(
         stat=pl.col("obs").map_batches(
              lambda g: stats.wilcoxon(g.struct.field("infl"), g.struct.field("effl")),
              returns_scalar=True
         )
      )
)

Here note the list column

┌───────┬─────────┬───────────┬───────────────────┐
│ state ┆ landuse ┆ pollutant ┆ stat              │
│ ---   ┆ ---     ┆ ---       ┆ ---               │
│ str   ┆ str     ┆ str       ┆ list[f64]         │
╞═══════╪═════════╪═══════════╪═══════════════════╡
│ OR    ┆ RES     ┆ Cu        ┆ [319.0, 0.633098] │
│ OR    ┆ RES     ┆ Pb        ┆ [274.0, 0.697783] │
└───────┴─────────┴───────────┴───────────────────┘

That difference might not seem like much, but with namedtuples getting converted to structs via polars, you can do some really neat things:

shapiro.select(pl.all(), pl.col("infl").struct.unnest())
┌───────┬─────────┬───────────┬─────────────────────┬──────────────────────┬───────────┬───────────┐
│ state ┆ landuse ┆ pollutant ┆ infl                ┆ effl                 ┆ statistic ┆ pvalue    │
│ ---   ┆ ---     ┆ ---       ┆ ---                 ┆ ---                  ┆ ---       ┆ ---       │
│ str   ┆ str     ┆ str       ┆ struct[2]           ┆ struct[2]            ┆ f64       ┆ f64       │
╞═══════╪═════════╪═══════════╪═════════════════════╪══════════════════════╪═══════════╪═══════════╡
│ OR    ┆ RES     ┆ Cu        ┆ {0.817417,0.00003}  ┆ {0.655478,4.5343e-8} ┆ 0.817417  ┆ 0.00003   │
│ OR    ┆ RES     ┆ Pb        ┆ {0.62145,3.7383e-8} ┆ {0.569183,8.0773e-9} ┆ 0.62145   ┆ 3.7383e-8 │
└───────┴─────────┴───────────┴─────────────────────┴──────────────────────┴───────────┴───────────┘

or

shapiro.filter(pl.col("infl").struct.field("statistic") > 0.7)
┌───────┬─────────┬───────────┬──────────────────────┬───────────────────────┐
│ state ┆ landuse ┆ pollutant ┆ infl                 ┆ effl                  │
│ ---   ┆ ---     ┆ ---       ┆ ---                  ┆ ---                   │
│ str   ┆ str     ┆ str       ┆ struct[2]            ┆ struct[2]             │
╞═══════╪═════════╪═══════════╪══════════════════════╪═══════════════════════╡
│ OR    ┆ RES     ┆ Cu        ┆ {0.817417,0.00003}   ┆ {0.655478,4.5343e-8}  │
│ WA    ┆ RES     ┆ Cu        ┆ {0.75436,0.000001}   ┆ {0.463176,1.2552e-10} │
│ WA    ┆ COM     ┆ Pb        ┆ {0.708217,1.3313e-7} ┆ {0.539097,5.0447e-10} │
└───────┴─────────┴───────────┴──────────────────────┴───────────────────────┘

When the Wilcoxon results get unpacked as a list, the equivalent queries aren't a tidy/readable, IMO.

To be clear, this isn't a huge deal. I can write a little wrapper for this. But it did catch me off guard that Wilcoxon was the only scipy.stats function that behaved this way (that I've come across so far).

My first & very naive thought would be that if the z-statistic is not generated, it exists in a standard namedtuple as None. I'm sure I missing something as why that's trickier than it sounds.

@mdhaber
Copy link
Contributor

mdhaber commented Jan 31, 2025

When wilcoxon was created, it returned a two-arg tuple without zstatistic. We can't add a third element to a standard tuple (named or not) without a backward-incompatible change.

IIUC (from what you described), polars has special handling for genuine namedtuples that allows the results to look better.

  • If polars is looking at whether objects are namedtuples, can it look at whether the objects behave like namedtuples instead?
  • If polars is already looking at whether the objects behave like namedtuples, what attribute/method are our Bunches missing?

_make_tuple_bunch is used in many places, though, so it's puzzling that you have only found it in wilcoxon (if you have tried several stats functions). What happens with ttest_1samp? That also uses _make_tuple_bunch.

@nickodell
Copy link
Member

If polars is already looking at whether the objects behave like namedtuples, what attribute/method are our Bunches missing?

If I'm understanding the Polars code correctly, it's missing _field_defaults and _replace.

Here's where the feature was added: pola-rs/polars#5057

Here's where this check is defined. The check has changed slightly, but none of the new checks implicate _make_tuple_bunch. https://github.com/pola-rs/polars/blob/b6e7ef8c1f26693346117785ca3e4cd8a52a394a/py-polars/polars/_utils/construction/utils.py#L50

As an experiment, I tried adding _field_defaults and _replace to the class dict in _make_tuple_bunch(), with both set to None, and this causes Polars to detect them as namedtuples.

# Note: This code is adapted from CPython:Lib/collections/__init__.py
def _make_tuple_bunch(typename, field_names, extra_field_names=None,
                      module=None):
    """
    Create a namedtuple-like class with additional attributes.

    This function creates a subclass of tuple that acts like a namedtuple
    and that has additional attributes.

    The additional attributes are listed in `extra_field_names`.  The
    values assigned to these attributes are not part of the tuple.

    The reason this function exists is to allow functions in SciPy
    that currently return a tuple or a namedtuple to returned objects
    that have additional attributes, while maintaining backwards
    compatibility.

    This should only be used to enhance *existing* functions in SciPy.
    New functions are free to create objects as return values without
    having to maintain backwards compatibility with an old tuple or
    namedtuple return value.

    Parameters
    ----------
    typename : str
        The name of the type.
    field_names : list of str
        List of names of the values to be stored in the tuple. These names
        will also be attributes of instances, so the values in the tuple
        can be accessed by indexing or as attributes.  At least one name
        is required.  See the Notes for additional restrictions.
    extra_field_names : list of str, optional
        List of names of values that will be stored as attributes of the
        object.  See the notes for additional restrictions.

    Returns
    -------
    cls : type
        The new class.

    Notes
    -----
    There are restrictions on the names that may be used in `field_names`
    and `extra_field_names`:

    * The names must be unique--no duplicates allowed.
    * The names must be valid Python identifiers, and must not begin with
      an underscore.
    * The names must not be Python keywords (e.g. 'def', 'and', etc., are
      not allowed).

    Examples
    --------
    >>> from scipy._lib._bunch import _make_tuple_bunch

    Create a class that acts like a namedtuple with length 2 (with field
    names `x` and `y`) that will also have the attributes `w` and `beta`:

    >>> Result = _make_tuple_bunch('Result', ['x', 'y'], ['w', 'beta'])

    `Result` is the new class.  We call it with keyword arguments to create
    a new instance with given values.

    >>> result1 = Result(x=1, y=2, w=99, beta=0.5)
    >>> result1
    Result(x=1, y=2, w=99, beta=0.5)

    `result1` acts like a tuple of length 2:

    >>> len(result1)
    2
    >>> result1[:]
    (1, 2)

    The values assigned when the instance was created are available as
    attributes:

    >>> result1.y
    2
    >>> result1.beta
    0.5
    """
    if len(field_names) == 0:
        raise ValueError('field_names must contain at least one name')

    if extra_field_names is None:
        extra_field_names = []
    _validate_names(typename, field_names, extra_field_names)

    typename = _sys.intern(str(typename))
    field_names = tuple(map(_sys.intern, field_names))
    extra_field_names = tuple(map(_sys.intern, extra_field_names))

    all_names = field_names + extra_field_names
    arg_list = ', '.join(field_names)
    full_list = ', '.join(all_names)
    repr_fmt = ''.join(('(',
                        ', '.join(f'{name}=%({name})r' for name in all_names),
                        ')'))
    tuple_new = tuple.__new__
    _dict, _tuple, _zip = dict, tuple, zip

    # Create all the named tuple methods to be added to the class namespace

    s = f"""\
def __new__(_cls, {arg_list}, **extra_fields):
    return _tuple_new(_cls, ({arg_list},))

def __init__(self, {arg_list}, **extra_fields):
    for key in self._extra_fields:
        if key not in extra_fields:
            raise TypeError("missing keyword argument '%s'" % (key,))
    for key, val in extra_fields.items():
        if key not in self._extra_fields:
            raise TypeError("unexpected keyword argument '%s'" % (key,))
        self.__dict__[key] = val

def __setattr__(self, key, val):
    if key in {repr(field_names)}:
        raise AttributeError("can't set attribute %r of class %r"
                             % (key, self.__class__.__name__))
    else:
        self.__dict__[key] = val
"""
    del arg_list
    namespace = {'_tuple_new': tuple_new,
                 '__builtins__': dict(TypeError=TypeError,
                                      AttributeError=AttributeError),
                 '__name__': f'namedtuple_{typename}'}
    exec(s, namespace)
    __new__ = namespace['__new__']
    __new__.__doc__ = f'Create new instance of {typename}({full_list})'
    __init__ = namespace['__init__']
    __init__.__doc__ = f'Instantiate instance of {typename}({full_list})'
    __setattr__ = namespace['__setattr__']

    def __repr__(self):
        'Return a nicely formatted representation string'
        return self.__class__.__name__ + repr_fmt % self._asdict()

    def _asdict(self):
        'Return a new dict which maps field names to their values.'
        out = _dict(_zip(self._fields, self))
        out.update(self.__dict__)
        return out

    def __getnewargs_ex__(self):
        'Return self as a plain tuple.  Used by copy and pickle.'
        return _tuple(self), self.__dict__

    # Modify function metadata to help with introspection and debugging
    for method in (__new__, __repr__, _asdict, __getnewargs_ex__):
        method.__qualname__ = f'{typename}.{method.__name__}'

    # Build-up the class namespace dictionary
    # and use type() to build the result class
    class_namespace = {
        '__doc__': f'{typename}({full_list})',
        '_fields': field_names,
        '__new__': __new__,
        '__init__': __init__,
        '__repr__': __repr__,
        '__setattr__': __setattr__,
        '_asdict': _asdict,
        '_extra_fields': extra_field_names,
        '__getnewargs_ex__': __getnewargs_ex__,
        '_field_defaults': None,
        '_replace': None,
    }
    for index, name in enumerate(field_names):

        def _get(self, index=index):
            return self[index]
        class_namespace[name] = property(_get)
    for name in extra_field_names:

        def _get(self, name=name):
            return self.__dict__[name]
        class_namespace[name] = property(_get)

    result = type(typename, (tuple,), class_namespace)

    # For pickling to work, the __module__ variable needs to be set to the
    # frame where the named tuple is created.  Bypass this step in environments
    # where sys._getframe is not defined (Jython for example) or sys._getframe
    # is not defined for arguments greater than 0 (IronPython), or where the
    # user has specified a particular module.
    if module is None:
        try:
            module = _sys._getframe(1).f_globals.get('__name__', '__main__')
        except (AttributeError, ValueError):
            pass
    if module is not None:
        result.__module__ = module
        __new__.__module__ = module

    return result
>>> WilcoxonResult = _make_tuple_bunch('WilcoxonResult', ['statistic', 'pvalue'])
>>> pl._utils.construction.utils.is_namedtuple(WilcoxonResult)
True

Polars also unpacks this, if you write a wrapper that converts this tuple bunch:

def statswrapper(func):
    def inner(*args, **kwargs):
        res = func(*args, **kwargs)
        return WilcoxonResult(res.statistic, res.pvalue)
    return inner

wilcoxon_wrapped = statswrapper(stats.wilcoxon)

wilcoxon = (
    df.with_columns(obs=pl.struct("infl", "effl"))
      .group_by(pl.col("state"), pl.col("landuse"), pl.col("pollutant"))
      .agg(
         stat=pl.col("obs").map_batches(
#               lambda g: stats.wilcoxon(g.struct.field("infl"), g.struct.field("effl")),
              lambda g: wilcoxon_wrapped(g.struct.field("infl"), g.struct.field("effl")),
              returns_scalar=True
         )
      )
)
wilcoxon

Output:

shape: (8, 4)
┌───────┬─────────┬───────────┬──────────────────┐
│ state ┆ landuse ┆ pollutant ┆ stat             │
│ ---   ┆ ---     ┆ ---       ┆ ---              │
│ str   ┆ str     ┆ str       ┆ struct[2]        │
╞═══════╪═════════╪═══════════╪══════════════════╡
│ WA    ┆ COM     ┆ Cu        ┆ {461.0,0.5309}   │
│ WA    ┆ RES     ┆ Pb        ┆ {201.0,0.160046} │
│ WA    ┆ COM     ┆ Pb        ┆ {408.0,0.984094} │
│ WA    ┆ RES     ┆ Cu        ┆ {269.0,0.144384} │
│ OR    ┆ RES     ┆ Cu        ┆ {319.0,0.633098} │
│ OR    ┆ COM     ┆ Pb        ┆ {331.0,0.4185}   │
│ OR    ┆ COM     ┆ Cu        ┆ {237.0,0.309202} │
│ OR    ┆ RES     ┆ Pb        ┆ {274.0,0.697783} │
└───────┴─────────┴───────────┴──────────────────┘

@mdhaber
Copy link
Contributor

mdhaber commented Jan 31, 2025

I don't have a problem with adding these attributes if it helps, but I don't understand how it's OK for them to be None if polars is going to look for their presence. Does it really just look and not attempt to use them?

@deanm0000
Copy link
Author

I would assume it's because neither isinstance(x, namedtuple) nor issubclass(x, namedtuple) work so they went back to some spec that says a namedtuple will have all of those even if polars isn't going to use them.

@nickodell
Copy link
Member

I don't have a problem with adding these attributes if it helps, but I don't understand how it's OK for them to be None if polars is going to look for their presence.

Oh, I wasn't suggesting that we actually set them to None. I was thinking we'd set them to some sensible value. (Who knows who else is inspecting _field_defaults? :) ) I'm just trying to find the very minimum thing that Polars considers a namedtuple.

Does it really just look and not attempt to use them?

Experimentally, it doesn't seem to.

Also, I searched their codebase for _field_defaults and _replace. The namedtuple check is the only place that uses _field_defaults. The only place in their codebase that uses namedtuple._replace() is an unrelated piece of code that calls dis.Instruction._replace(). Searches: 1 2


By the way, another option, besides pretending to be a namedtuple, would be to pretend to be a dataclass, as those get similar treatment from Polars. Source

@mdhaber
Copy link
Contributor

mdhaber commented Jan 31, 2025

Oh, I wasn't suggesting that we actually set them to None.

Sure, I was just surprised that None worked, and yeah, I guess that's because it's not actually being used.

Well, I wouldn't mind it if these looked more like either dataclasses or namedtuples. Hopefully this would only take a short, non-invasive PR, in which case I'd be happy to review it.

@lucascolley lucascolley added the query A question or suggestion that requires further information label Feb 1, 2025
@mdhaber
Copy link
Contributor

mdhaber commented Feb 9, 2025

Closed by gh-22494.

@mdhaber mdhaber closed this as completed Feb 9, 2025
@lucascolley lucascolley added this to the 1.16.0 milestone Feb 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
query A question or suggestion that requires further information scipy.stats
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants