API: setitem copy/view behavior ndarray vs Categorical vs other EA #38896

jbrockmendel · 2021-01-02T02:09:29Z

xref #33457 which is about similar issue but goes through different code paths.

In Block.setitem in cases where we are setting all the values for this block we have:

        elif exact_match and is_categorical_dtype(arr_value.dtype):
            # GH25495 - If the current dtype is not categorical, we need to create a new categorical block
            values[indexer] = value
            return self.make_block(Categorical(self.values, dtype=arr_value.dtype))

        elif exact_match and is_ea_value:
            # GH#32395 if we're going to replace the values entirely, just substitute in the new array
            return self.make_block(arr_value)

        elif exact_match:
            # We are setting _all_ of the array's values, so can cast to new dtype
            values[indexer] = value

            values = values.astype(arr_value.dtype, copy=False)

So we overwrite the existing values for categorical value or non-EA value. Example:

df = pd.DataFrame({
    "A": [.1, .2, .3],
    "B": pd.array([1, 2, None], dtype="Int64"),
    "C": ["a", "b", "c"]
})
orig_df = df[:]

arr_np = df["A"]._values
arr_ea = df["B"]._values
cat = pd.Categorical(df["C"])

# Note: there are many equivalent-looking ways of doing this setitem operation but few of them go through this code path.
df.loc[range(3), "A"] = arr_np[::-1]
df.loc[range(3), "B"] = arr_ea[::-1]
df.loc[range(3), "C"] = cat[::-1]

>>> df
     A     B  C
0  0.3  <NA>  c
1  0.2     2  b
2  0.1     1  a

>>> df.dtypes
A     float64
B       Int64
C    category
dtype: object

>>> orig_df
     A     B  C
0  0.3     1  a
1  0.2     2  b
2  0.1  <NA>  c

>>> orig_df.dtypes
A    float64
B      Int64
C     object
dtype: object

The categorical behavior we implemented in #23393 and AFAICT the over-writing behavior was not discussed/intentional. Similarly the other EA behavior was implemented in #32479 and I don't see anything about the overwrite-or-not. I haven't tracked down the origin of the non-EA behavior.

I think all three cases should have the same behavior. We should also have the same behavior for should-be-equivalent setters, e.g. if we used iloc instead of loc, or [:, "A"] instead of [range(3), "A"].

I think I agree with @TomAugspurger's comment that these should always be in-place, but not sure ATM if that can be done without breaking consistency elsewhere.

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2021-01-04T18:43:17Z

cc @jorisvandenbossche I noticed a funky behavior with IntegerArray/FloatingArray that I'd like to get your thoughts on.

arr = pd.array([4, pd.NA, 6])
arr2 = pd.array([4, 5, 6])
df = pd.DataFrame(np.arange(3))
blk = df._mgr.blocks[0]

>>> blk._can_hold_element(arr)
True

in this case the _can_hold_element is pretty clearly wrong. It ends up not breaking anything because, as in the OP, the call to .setitem swaps in arr without ever trying to set the values in-place.

My question for you is: what about when we have a IntegerArray/Floating array without NAs, so can_hold_element=True is correct? Would you expect it to set the values inplace or upcast to the nullable dtype?

jorisvandenbossche · 2021-01-05T22:04:35Z

My question for you is: what about when we have a IntegerArray/Floating array without NAs, so can_hold_element=True is correct? Would you expect it to set the values inplace or upcast to the nullable dtype?

When is can_hold_element exactly used? (i.e. which user example covers your question? A .loc setitem as in the OP df.loc[range(3), "A"] = arr where the original column is int64 and the RHS arr an Int64 without NAs?)

In [9]: df = pd.DataFrame({
   ...:     "A": [1, 2, 3],
   ...:     "B": [4, 5, 6],
   ...: })

In [10]: df.loc[:, 'A'] = pd.array([1, 2, 3])

In [11]: df.dtypes
Out[11]: 
A    Int64
B    int64
dtype: object

So currently we indeed change the dtype, but so you are saying that in case there are no nulls, the result could preserve the "int64" dtype? Since the rule for loc[:, "col"] is to update in place if possible, that sounds good to me.

jbrockmendel · 2021-01-16T19:35:01Z

Gentle ping @jreback @TomAugspurger I think Joris and I are in agreement, is there concensus on the propositions that

a) frame[key] = value should not be operating inplace

b) frame.loc[:, key] and frame.iloc[:, num] should try to operate inplace before falling back to casting

TomAugspurger · 2021-01-16T19:51:22Z

I think those rules are sensible.

jreback · 2021-01-18T15:14:25Z

yep i agree here, these are the de-facto rules that we have had for quite a long time (at least implicitly). can we also write these down explicity in the docs.

jbrockmendel · 2021-01-19T16:34:15Z

Great. #39044, #39163, #39266 all push in this direction. After those (and possibly #38709) we can consolidate a bunch of Series.__setitem__/iloc.__setitem__ code and de-special case 1-block vs multiple-block paths.

jorisvandenbossche · 2021-02-12T18:57:51Z

I ran into this problem when implementing it for ArrayManager, and I see now that Brock also noted it at #39163 (comment): one problem with this general rule is that you then can't set a new column positionally any longer using df.iloc[:, N] = ....

I think there should at least be some way to do this in all cases.

The general recommendation is then that you can lookup the name with the position, like df[df.columns[N]] = ..., but that doesn't work with duplicate labels (which is exactly a case where you might need to resort to index positionally).

jbrockmendel · 2021-02-12T21:16:09Z

I think there should at least be some way to do this in all cases.

Agreed. Brainstorming:

frame.iloc(axis=1)[2] = foo? (usually i find the loc/iloc axis attribute more trouble than its worth, but this may end up being a compelling use case)

frame.iloc[pd.IndexSlice(inplace=False), 2] = foo (or some sentinel other than IndexSlice)

jbrockmendel · 2021-03-03T19:24:31Z

Option 3: DataFrame.insert should work even with non-unique columns

jorisvandenbossche · 2021-03-05T21:08:02Z

(the loc/iloc part is not yet resolved I think)

simonjayhawkins · 2021-06-11T12:09:42Z

@jorisvandenbossche we've got the 1.3 milestone on this one (added for the merged PR). do we need a new issue/remove the milestone?

simonjayhawkins · 2021-06-11T18:30:55Z

i'll remove the milestone for now.

jbrockmendel · 2022-01-12T23:20:57Z

(the loc/iloc part is not yet resolved I think)

@jorisvandenbossche i think the remaining aspect of this is better covered in #44353 and that this is closable. pls confirm.

mroeschke · 2022-07-05T23:46:50Z

Looks like the discussion stalled here if there even was remaining items. If there are follow ups probably best to open a new issue to pair down the discussion. Closing

jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 2, 2021

arw2019 added Copy / view semantics ExtensionArray Extending pandas with custom dtypes or arrays. Categorical Categorical Data Type API - Consistency Internal Consistency of API/Behavior and removed Needs Triage Issue that has not been reviewed by a pandas team member Bug labels Jan 2, 2021

This was referenced Jan 8, 2021

BUG/API: make setitem-inplace preserve dtype when possible with PandasArray, IntegerArray, FloatingArray #39044

Closed

API/BUG: always try to operate inplace when setting with loc/iloc[foo, bar] #39163

Merged

This was referenced Jan 20, 2021

WIP/REF: BlockManager.setitem_blockwise #39302

Closed

Clean up DataFrame.setitem behavior for duplicate columns #39403

Merged

jbrockmendel mentioned this issue Jan 31, 2021

REF/API: DataFrame.__setitem__ never operate in-place #39510

Merged

4 tasks

jorisvandenbossche mentioned this issue Feb 4, 2021

DISCUSS/API: setitem-like operations should only update inplace and never fallback with upcast (i.e never change the dtype) #39584

Closed

jreback added this to the 1.3 milestone Feb 7, 2021

jreback closed this as completed in #39163 Mar 5, 2021

jorisvandenbossche reopened this Mar 5, 2021

simonjayhawkins removed this from the 1.3 milestone Jun 11, 2021

mroeschke added the Bug label Aug 14, 2021

jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Jan 12, 2022

mroeschke closed this as completed Jul 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: setitem copy/view behavior ndarray vs Categorical vs other EA #38896

API: setitem copy/view behavior ndarray vs Categorical vs other EA #38896

jbrockmendel commented Jan 2, 2021

jbrockmendel commented Jan 4, 2021

jorisvandenbossche commented Jan 5, 2021

jbrockmendel commented Jan 16, 2021

TomAugspurger commented Jan 16, 2021

jreback commented Jan 18, 2021

jbrockmendel commented Jan 19, 2021

jorisvandenbossche commented Feb 12, 2021

jbrockmendel commented Feb 12, 2021

jbrockmendel commented Mar 3, 2021

jorisvandenbossche commented Mar 5, 2021

simonjayhawkins commented Jun 11, 2021

simonjayhawkins commented Jun 11, 2021

jbrockmendel commented Jan 12, 2022

mroeschke commented Jul 5, 2022

API: setitem copy/view behavior ndarray vs Categorical vs other EA #38896

API: setitem copy/view behavior ndarray vs Categorical vs other EA #38896

Comments

jbrockmendel commented Jan 2, 2021

jbrockmendel commented Jan 4, 2021

jorisvandenbossche commented Jan 5, 2021

jbrockmendel commented Jan 16, 2021

TomAugspurger commented Jan 16, 2021

jreback commented Jan 18, 2021

jbrockmendel commented Jan 19, 2021

jorisvandenbossche commented Feb 12, 2021

jbrockmendel commented Feb 12, 2021

jbrockmendel commented Mar 3, 2021

jorisvandenbossche commented Mar 5, 2021

simonjayhawkins commented Jun 11, 2021

simonjayhawkins commented Jun 11, 2021

jbrockmendel commented Jan 12, 2022

mroeschke commented Jul 5, 2022