Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.append with timedelta64 #39574

Merged
merged 16 commits into from
Feb 12, 2021

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

AFAICT several of the existing tests are just wrong, xref #39122 cc @jorisvandenbossche

Fixes (but havent yet added a test for) #39037 (comment)

@@ -97,7 +97,7 @@ def _cast_to_common_type(arr: ArrayLike, dtype: DtypeObj) -> ArrayLike:
return arr.astype(dtype, copy=False)


def concat_compat(to_concat, axis: int = 0):
def concat_compat(to_concat, axis: int = 0, pretend_axis1: bool = False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you just allow axis=None and call it that case. this is very odd naming here (appreciate the de-duplication that this allows); so i guess a better question is this temporary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you just allow axis=None and call it that case. this is very odd naming here

I want the naming to be very clear that this is a dont-try-this-at-home kludge (need to update the docstring to that effect)

so i guess a better question is this temporary?

inasmuch as it wont be needed once we have 2D EAs, yes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok to the extent this is not removed anytime soon, can you come up with a better argument name or another way of doing this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed + green

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Feb 3, 2021
@jorisvandenbossche
Copy link
Member

Can you summarize the behaviour that is changed / bug that is fixed?

@jbrockmendel
Copy link
Member Author

Can you summarize the behaviour that is changed / bug that is fixed?

df = DataFrame({"a": pd.array([1], dtype="Int64")})
other = DataFrame({"a": [np.timedelta64("NaT", "ns")]})
result = df.append(other, ignore_index=True)

>>> result
      a
0     1
1  <NA>
>>> result.dtypes
a    Int64
dtype: object
df = DataFrame(columns=["a"]).astype("Int64")
other = DataFrame({"a": [np.timedelta64(1, "ns")]})

>>> result = df.append(other, ignore_index=True)
ValueError: Wrong number of dimensions. values.ndim != ndim [1 != 2]

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am personally a bit hesitant to change the all-NaN special case. That's longstanding behaviour (also for practical reasons), that will break code I think.

@@ -72,6 +72,8 @@ def concat_compat(to_concat, axis: int = 0):
----------
to_concat : array of arrays
axis : axis to provide concatenation
ea_compat_axis : bool, default False
For ExtensionArray compat, behave as if axis == 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what this means in practice to behave "as if axis=1"? Because I assume the arrays are still concatenated with axis=0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just matters for the dropping-empty check; will edit to clarify

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche helpful comments, thank you. i found a nice way to retain the None/nan behavior while fixing the original bug: supplementing JoinUnit.is_na with a check that it is a compatible NA.

Comment on lines 239 to 243
if self.dtype == object:
values = self.block.values
return all(
is_valid_nat_for_dtype(x, dtype) for x in values.ravel(order="K")
)
Copy link
Member

@jorisvandenbossche jorisvandenbossche Feb 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not only required for object dtype, I think. Also float NaN is considered "all NaN" when it comes to ignoring the dtype in concatting dataframes (and other dtypes as well I think):

In [39]: pd.concat([pd.DataFrame({'a': [np.nan, np.nan]}), pd.DataFrame({'a': [pd.Timestamp("2012-01-01")]})])
Out[39]: 
           a
0        NaT
1        NaT
0 2012-01-01

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the non-object case is handled below on L245-246. or do you have something else in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does my snippet above work with this PR?
(if so, then I don't fully understand why the changes to test_append_empty_frame_to_series_with_dateutil_tz are needed)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does my snippet above work with this PR?

yes it does

(if so, then I don't fully understand why the changes to test_append_empty_frame_to_series_with_dateutil_tz are needed)

I think that's driven by something sketchy-looking in get_reindexed_values, will see if that can be addressed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better?

@jreback jreback added this to the 1.3 milestone Feb 10, 2021
@jbrockmendel
Copy link
Member Author

@jorisvandenbossche i think your comments have all been addressed, lmk if i missed anything

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche gentle ping before your day ends

@jorisvandenbossche
Copy link
Member

Can you add a whatsnew?

@jbrockmendel
Copy link
Member Author

will do, thanks

Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
tm.assert_frame_equal(result_b, expected)

# column order is different
expected = expected[["c", "d", "date", "a", "b"]]
result = df.append([s, s], ignore_index=True)
dtype = Series([date]).dtype
expected["date"] = expected["date"].astype(dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed? (might be a left-over from astyping it to object before)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, updated

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback jreback merged commit bcf2406 into pandas-dev:master Feb 12, 2021
@jbrockmendel jbrockmendel deleted the bug-concat-4 branch February 12, 2021 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants