Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: setitem copy/view behavior ndarray vs Categorical vs other EA #38896

Closed
jbrockmendel opened this issue Jan 2, 2021 · 14 comments · Fixed by #39163
Closed

API: setitem copy/view behavior ndarray vs Categorical vs other EA #38896

jbrockmendel opened this issue Jan 2, 2021 · 14 comments · Fixed by #39163
Labels
API - Consistency Internal Consistency of API/Behavior Bug Categorical Categorical Data Type Closing Candidate May be closeable, needs more eyeballs Copy / view semantics ExtensionArray Extending pandas with custom dtypes or arrays.

Comments

@jbrockmendel
Copy link
Member

xref #33457 which is about similar issue but goes through different code paths.

In Block.setitem in cases where we are setting all the values for this block we have:

        elif exact_match and is_categorical_dtype(arr_value.dtype):
            # GH25495 - If the current dtype is not categorical, we need to create a new categorical block
            values[indexer] = value
            return self.make_block(Categorical(self.values, dtype=arr_value.dtype))

        elif exact_match and is_ea_value:
            # GH#32395 if we're going to replace the values entirely, just substitute in the new array
            return self.make_block(arr_value)

        elif exact_match:
            # We are setting _all_ of the array's values, so can cast to new dtype
            values[indexer] = value

            values = values.astype(arr_value.dtype, copy=False)

So we overwrite the existing values for categorical value or non-EA value. Example:

df = pd.DataFrame({
    "A": [.1, .2, .3],
    "B": pd.array([1, 2, None], dtype="Int64"),
    "C": ["a", "b", "c"]
})
orig_df = df[:]

arr_np = df["A"]._values
arr_ea = df["B"]._values
cat = pd.Categorical(df["C"])

# Note: there are many equivalent-looking ways of doing this setitem operation but few of them go through this code path.
df.loc[range(3), "A"] = arr_np[::-1]
df.loc[range(3), "B"] = arr_ea[::-1]
df.loc[range(3), "C"] = cat[::-1]

>>> df
     A     B  C
0  0.3  <NA>  c
1  0.2     2  b
2  0.1     1  a

>>> df.dtypes
A     float64
B       Int64
C    category
dtype: object

>>> orig_df
     A     B  C
0  0.3     1  a
1  0.2     2  b
2  0.1  <NA>  c

>>> orig_df.dtypes
A    float64
B      Int64
C     object
dtype: object

The categorical behavior we implemented in #23393 and AFAICT the over-writing behavior was not discussed/intentional. Similarly the other EA behavior was implemented in #32479 and I don't see anything about the overwrite-or-not. I haven't tracked down the origin of the non-EA behavior.

I think all three cases should have the same behavior. We should also have the same behavior for should-be-equivalent setters, e.g. if we used iloc instead of loc, or [:, "A"] instead of [range(3), "A"].

I think I agree with @TomAugspurger's comment that these should always be in-place, but not sure ATM if that can be done without breaking consistency elsewhere.

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 2, 2021
@arw2019 arw2019 added Copy / view semantics ExtensionArray Extending pandas with custom dtypes or arrays. Categorical Categorical Data Type API - Consistency Internal Consistency of API/Behavior and removed Needs Triage Issue that has not been reviewed by a pandas team member Bug labels Jan 2, 2021
@jbrockmendel
Copy link
Member Author

cc @jorisvandenbossche I noticed a funky behavior with IntegerArray/FloatingArray that I'd like to get your thoughts on.

arr = pd.array([4, pd.NA, 6])
arr2 = pd.array([4, 5, 6])
df = pd.DataFrame(np.arange(3))
blk = df._mgr.blocks[0]

>>> blk._can_hold_element(arr)
True

in this case the _can_hold_element is pretty clearly wrong. It ends up not breaking anything because, as in the OP, the call to .setitem swaps in arr without ever trying to set the values in-place.

My question for you is: what about when we have a IntegerArray/Floating array without NAs, so can_hold_element=True is correct? Would you expect it to set the values inplace or upcast to the nullable dtype?

@jorisvandenbossche
Copy link
Member

My question for you is: what about when we have a IntegerArray/Floating array without NAs, so can_hold_element=True is correct? Would you expect it to set the values inplace or upcast to the nullable dtype?

When is can_hold_element exactly used? (i.e. which user example covers your question? A .loc setitem as in the OP df.loc[range(3), "A"] = arr where the original column is int64 and the RHS arr an Int64 without NAs?)

In [9]: df = pd.DataFrame({
   ...:     "A": [1, 2, 3],
   ...:     "B": [4, 5, 6],
   ...: })

In [10]: df.loc[:, 'A'] = pd.array([1, 2, 3])

In [11]: df.dtypes
Out[11]: 
A    Int64
B    int64
dtype: object

So currently we indeed change the dtype, but so you are saying that in case there are no nulls, the result could preserve the "int64" dtype? Since the rule for loc[:, "col"] is to update in place if possible, that sounds good to me.

@jbrockmendel
Copy link
Member Author

Gentle ping @jreback @TomAugspurger I think Joris and I are in agreement, is there concensus on the propositions that

a) frame[key] = value should not be operating inplace

b) frame.loc[:, key] and frame.iloc[:, num] should try to operate inplace before falling back to casting

@TomAugspurger
Copy link
Contributor

I think those rules are sensible.

@jreback
Copy link
Contributor

jreback commented Jan 18, 2021

yep i agree here, these are the de-facto rules that we have had for quite a long time (at least implicitly). can we also write these down explicity in the docs.

@jbrockmendel
Copy link
Member Author

Great. #39044, #39163, #39266 all push in this direction. After those (and possibly #38709) we can consolidate a bunch of Series.__setitem__/iloc.__setitem__ code and de-special case 1-block vs multiple-block paths.

@jorisvandenbossche
Copy link
Member

I ran into this problem when implementing it for ArrayManager, and I see now that Brock also noted it at #39163 (comment): one problem with this general rule is that you then can't set a new column positionally any longer using df.iloc[:, N] = ....

I think there should at least be some way to do this in all cases.

The general recommendation is then that you can lookup the name with the position, like df[df.columns[N]] = ..., but that doesn't work with duplicate labels (which is exactly a case where you might need to resort to index positionally).

@jbrockmendel
Copy link
Member Author

I think there should at least be some way to do this in all cases.

Agreed. Brainstorming:

frame.iloc(axis=1)[2] = foo? (usually i find the loc/iloc axis attribute more trouble than its worth, but this may end up being a compelling use case)

frame.iloc[pd.IndexSlice(inplace=False), 2] = foo (or some sentinel other than IndexSlice)

@jbrockmendel
Copy link
Member Author

Option 3: DataFrame.insert should work even with non-unique columns

@jorisvandenbossche
Copy link
Member

(the loc/iloc part is not yet resolved I think)

@simonjayhawkins
Copy link
Member

@jorisvandenbossche we've got the 1.3 milestone on this one (added for the merged PR). do we need a new issue/remove the milestone?

@simonjayhawkins
Copy link
Member

i'll remove the milestone for now.

@simonjayhawkins simonjayhawkins removed this from the 1.3 milestone Jun 11, 2021
@mroeschke mroeschke added the Bug label Aug 14, 2021
@jbrockmendel
Copy link
Member Author

(the loc/iloc part is not yet resolved I think)

@jorisvandenbossche i think the remaining aspect of this is better covered in #44353 and that this is closable. pls confirm.

@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Jan 12, 2022
@mroeschke
Copy link
Member

Looks like the discussion stalled here if there even was remaining items. If there are follow ups probably best to open a new issue to pair down the discussion. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Bug Categorical Categorical Data Type Closing Candidate May be closeable, needs more eyeballs Copy / view semantics ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants