-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandas 0.12 unexpected results when comparing dataframe to list or tuple #4576
Comments
These yield different shapes (which is confusing as you are using a 2x2 frame) A single list/tuple becomes a column, while a list-of-list yields rows
|
OK I see my example was a bit shady because of the 2x2 size. Here's an example with 3x1 dataframe: In [10]: df=pd.DataFrame(np.arange(6).reshape((3,2)))
In [11]: df
Out[11]:
0 1
0 0 1
1 2 3
2 4 5 If list/tuple really becomes a column, why does not 1D array? In [12]: df > [2, 2]
Out[12]:
0 1
0 True True
1 True True
2 True True In [13]: df > np.array([2, 2])
Out[13]:
0 1
0 False False
1 False True
2 True True I think both should result in the same since both the list and 1D array are... well, 1D objects. This would also match what happens with numpy 2D arrays: In [17]: df.values > [2, 2]
Out[17]:
array([[False, False],
[False, True],
[ True, True]], dtype=bool) Similar example that confuses me: In [24]: row_vector = np.atleast_2d([2,2])
In [25]: df > row_vector
Out[25]:
0 1
0 True True
1 True True
2 True True In [26]: df.values > row_vector
Out[26]:
array([[False, False],
[False, True],
[ True, True]], dtype=bool) Would not it be logical in this case that row_vector with shape (1,2) would be broadcasted to (3,2) before comparison? EDIT: these examples were with pandas 0.12 and numpy 1.7.1 |
actually that's not correct a passed
This is exactly the same as numpy behavior. There isn't any implicit broadcasting,
You are just passing a list which is a column, that's it. Remember that since you are not passing an index/columns, pandas has to follow |
I would like to disagree about pandas following numpy behaviour here. Firstly, 1d numpy arrays do not have a defined direction - they are just 1d vectors. For example, In [30]: a = np.array([1,2,3])
In [31]: a.shape
Out[31]: (3,)
In [32]: a
Out[32]: array([1, 2, 3])
In [33]: a.T
Out[33]: array([1, 2, 3])
In [34]: a==a.T
Out[34]: array([ True, True, True], dtype=bool) Therefore, it is in my opinion rather dangerous to assume that lists or tuples don't have a shape but that 1d arrays would. I believe they should behave identically. Second issue is that numpy does broadcasts with comparison operators, just as ssalonen showed above. I guess it would be OK if pandas didn't, but that should be an explicit and documented deviation from numpy semantics. Third, regardless of the broadcasts, I believe the comparison operators in pandas are quite broken at the moment: In [49]: df = pd.DataFrame(np.arange(6).reshape((3,2)))
In [50]: b = np.array([2, 2])
In [51]: b_r = np.atleast_2d([2,2])
In [52]: b_c = b_r.T
In [53]: df > b
Out[53]:
0 1
0 False False
1 False True
2 True True
In [54]: df > b_r
Out[54]:
0 1
0 True True
1 True True
2 True True
In [55]: df > b_c
Out[55]:
0 1
0 False False
1 False True
2 True True I don't quite understand the element-wise comparisons made in the example above. Some broadcasts are necessarily made, but not in any logical fashion. Also, the equality operator should work with the same semantics as greater than. However, it does not: In [60]: df == b
Out[60]:
0 1
0 False False
1 True False
2 False False
In [61]: df == b_r
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
[...]
TypeError: Could not compare [array([[2, 2]])] with block values
In [62]: df == b_c
Out[62]:
0 1
0 False False
1 True False
2 False False To me it would seem that 1d and column vectors behave as broadcast row vectors in the comparison operators, while row vectors are more thoroughly broken. :-) |
ok... so the results from
|
I don't think they should be the same. Rather, unless there any pandas-specific index alignment is performed in the comparisons, the behaviour should follow that of numpy: In [64]: df.values > b
Out[64]:
array([[False, False],
[False, True],
[ True, True]], dtype=bool)
In [65]: df.values > b_c
[...]
ValueError: operands could not be broadcast together with shapes (3,2) (2,1)
In [66]: df.values > b_r
Out[66]:
array([[False, False],
[False, True],
[ True, True]], dtype=bool) The above would seem to indicate that numpy treats 1d vectors as row vectors, so please disregard anything I wrote about it earlier. ;-) If you maintain a strong opinion that broadcasts should be avoided, then exceptions should be thrown for df>b and df>b_r, too. |
Some color. If the rhs side is a pure-numpy array, there is NO alignment done (as we would If I turn off the broadcast catching
I think you would like to see SEE THE PR...I updated to so waht I suggested
|
@mairas did you take a look at the PR? I believe it solves all of the open questions.... |
To me it looks like we are going to right direction Since we should have numpy behaviour here (as no alignment is done), expected results should be same as with df.values > x? df.values > b
Out[247]:
array([[False, False],
[False, True],
[ True, True]], dtype=bool)
df.values > b_r
Out[248]:
array([[False, False],
[False, True],
[ True, True]], dtype=bool) Note the different result in the second example above. Incompatible shapes should result in exception since broadcast is not possible. Numpy does not do automatical transpose in this case which I think is a good thing. df.values > b_c
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-249-4e61d2a85a75> in <module>()
----> 1 df.values > b_c
ValueError: operands could not be broadcast together with shapes (3,2) (2,1) Numpy implementation is a bit different in == comparison, see the following examples df.values == b
Out[259]:
array([[False, False],
[ True, False],
[False, False]], dtype=bool)
df.values == b_r
Out[260]:
array([[False, False],
[ True, False],
[False, False]], dtype=bool)
df.values == b_c
Out[261]: False Especially the final example is interesting; no exception is raised even though inequality comparison raises one. Examples with numpy 1.6.1 |
Numpy does weird things like this (bottom example), but we will raise
|
I agree that pandas should raise with equals-operator. The examples and pull request test cases did not include dataframe comparison to list/tuple. I believe they behave the same way as numpy 1D array, right? |
The examples are there now (in the PR page) |
Sorry for not replying earlier - saw shiny things elsewhere. :-) The semantics look great to me now! Thanks for bearing with us! :-) Cheers, ma. On Aug 23, 2013, at 17:17, jreback [email protected] wrote:
|
The behavior may be changed after #13637. Pls comment if any thoughts. |
It seems that when comparing DataFrame to list or tuple of values (lenght of DataFrame columns), the resulting boolean DataFrame is incorrect.
The text was updated successfully, but these errors were encountered: