-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
COMPAT: unique() should preserve the dtype of the input #27874
Conversation
21f7003
to
979843e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would need actual tests for non 64bit dtypes
422804c
to
7262d9e
Compare
@jreback I added the requested test cases and rebased. The CI is passing now. BTW, I had a problem with an unrelated test. It was expected to fail, but in this PR it magically started passing -- under Python 3.5 only. Probably somehow related to dict ordering. Anyway, I "fixed" the issue by simply permitting the test to FWIW, I'm not the only one who ran into that |
@@ -1489,7 +1490,8 @@ def test_value_counts_with_nan(self): | |||
"unicode_", | |||
"timedelta64[h]", | |||
pytest.param( | |||
"datetime64[D]", marks=pytest.mark.xfail(reason="GH#7996", strict=True) | |||
"datetime64[D]", | |||
marks=pytest.mark.xfail(reason="GH#7996", strict=not PY35), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this xfail. Typically we just reference open issues. What's causing this to fail on 3.5? What's the failure you get?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's causing this to fail on 3.5?
Just to be clear: The problem is that this test doesn't fail in Python 3.5. But since it's marked with xfail(..., strict=True)
, that breaks the test suite.
You can see a failed build log here:
https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=15986
The relevant lines are:
=================================== FAILURES ===================================
TestCategoricalSeriesAnalytics.test_drop_duplicates_categorical_non_bool[True-datetime64[D]]
[gw1] linux -- Python 3.5.3 /home/vsts/miniconda3/envs/pandas-dev/bin/python
[XPASS(strict)] GH#7996
TestCategoricalSeriesAnalytics.test_drop_duplicates_categorical_non_bool[False-datetime64[D]]
[gw1] linux -- Python 3.5.3 /home/vsts/miniconda3/envs/pandas-dev/bin/python
[XPASS(strict)] GH#7996
TestCategoricalSeriesAnalytics.test_drop_duplicates_categorical_non_bool[None-datetime64[D]]
[gw1] linux -- Python 3.5.3 /home/vsts/miniconda3/envs/pandas-dev/bin/python
[XPASS(strict)] GH#7996
-------- generated xml file: /home/vsts/work/1/s/test-data-multiple.xml --------
What's the failure you get?
In Python 3.7, the test xfails
as expected. Removing xfail
identifies this line as the problem. And here's what pytest
shows in that case:
> raise_assert_detail(obj, msg, lobj, robj)
E AssertionError: Series are different
E
E Series values are different (50.0 %)
E [left]: [False, True, True, True]
E [right]: [False, False, False, True]
pandas/_libs/testing.pyx:178: AssertionError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this xfail. Typically we just reference open issues.
The referenced issue seems to imply that the trouble is related to converting/comparing datetime[D]
to datetime[ns]
. In this test, the input is datetime[D]
, but it's implicitly converted to datetime[ns]
when it is loaded into a Categorical
.
One simple hack to make this test pass is to change datetime[D]
to datetime[ns]
. That doesn't seem appropriate, though.
Anyway, to get an idea of what is actually going wrong, here's what happens when I try the test's first three lines in my terminal. Note that the first item becomes NaT
for some reason.
In [190]: cat_array = np.array([1, 2, 3, 4, 5], dtype=np.dtype(dtype))
In [191]: input1 = np.array([1, 2, 3, 3], dtype=np.dtype(dtype))
In [192]: tc1 = Series(Categorical(input1, categories=cat_array, ordered=False))
In [193]: tc1
Out[193]:
0 NaT
1 1970-01-03
2 1970-01-04
3 1970-01-04
dtype: category
Categories (5, datetime64[ns]): [1970-01-02, 1970-01-03, 1970-01-04, 1970-01-05, 1970-01-06]
186a544
to
71c6d1b
Compare
@@ -187,7 +187,24 @@ def setup_method(self, method): | |||
types = ["bool", "int", "float", "dt", "dt_tz", "period", "string", "unicode"] | |||
self.indexes = [getattr(self, "{}_index".format(t)) for t in types] | |||
self.series = [getattr(self, "{}_series".format(t)) for t in types] | |||
self.objs = self.indexes + self.series | |||
|
|||
# To test narrow dtypes, we use narrower *data* elements, not *index* elements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this whole thing badly needs parameterization.
can you pull out the unique tests and do that instead of repeating all of this?
can you merge master; update the release note to 1.0, will have a look after |
Sorry I haven't had time to look at this. I'll try to get to it next week. |
@stuarteberg looked reasonable, can you merge master and move the note to 1.0.0 |
71c6d1b
to
1aec960
Compare
Done. Sorry I haven't cleaned up the tests yet. If its sufficient as-is, great. If not, let me know. |
thanks @stuarteberg this looks great! Can you create an issue to parameterize test_base (where indicated)? A PR would be most appreciated for that as well (if you can). |
this did not have any issue (other than the PR itself) associated? can you do a quick search to see if this solves any open issues? |
OK, I couldn't find an issue for this in particular, but I just found #22824, which has a much broader scope. It does mention the return type issue:
But since this PR doesn't resolve everything mentioned in that issue, it should remain open. But I made a comment there referencing this PR.
OK, I have an issue draft ready to go (pasted into the details section below), but github noticed that #23877 is similar. Is that already sufficient, or would you like me to open the new issue anyway? Issue draft regarding test_baseThe tests in FWIW, this was noticed while reviewing #27874, so @jreback requested this issue to be opened. |
@stuarteberg no #23877 is ok |
thanks @stuarteberg |
The behavior of
pd.unique()
is surprising, because -- unlikenp.unique()
-- the result does not have the samedtype
as the input:This PR just casts the output array to match the input dtype. Supercedes #27869.
Update: Augmented the tests to cover narrow dtypes.
I added a new assertion intest_value_counts_unique_nunique()
, but it may not be sufficient. From what I can see, there isn't good coverage ofSeries
whose data is notint
/float
/ etc. There is only good coverage of various index types. Any advice concerning test coverage?black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff