Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected results for diagonal entries when using generic callable in corr #25726

Closed
fbnrst opened this issue Mar 14, 2019 · 14 comments · Fixed by #25732
Closed

Unexpected results for diagonal entries when using generic callable in corr #25726

fbnrst opened this issue Mar 14, 2019 · 14 comments · Fixed by #25732
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@fbnrst
Copy link
Contributor

fbnrst commented Mar 14, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd
from scipy.stats import pearsonr

df = pd.DataFrame({'A': [1,2,3], 'B': [2,5,6]})
print(df.corr(method=lambda x, y: pearsonr(x, y)[1]))

          A         B
A  1.000000  0.178912
B  0.178912  1.000000

Problem description

I want to use the method argument of corr to compute p-values. However, diagonal elements are set to 1. I would expect them to be 0. They are set to 1 here:

pandas/pandas/core/frame.py

Lines 7025 to 7026 in cb00deb

elif i == j:
c = 1.

Although I can see that for a 'normal' correlation 1 is expected, this is not the case in my example. Hence, I would suggest to remove these two lines from frame.py.

Expected Output

          A         B
A  0.000000  0.178912
B  0.178912  0.000000

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.165-81-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 4.1.1
pip: 18.1
setuptools: 40.6.3
Cython: 0.29.3
numpy: 1.15.4
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.8.3
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: 2.4.0-b1
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: 4.3.0
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
@TomAugspurger
Copy link
Contributor

Why do you expect the self-correlation to be 0?

@fbnrst
Copy link
Contributor Author

fbnrst commented Mar 14, 2019

I am not expecting the correlation coefficient to be 0. What I compute in the example above is the p-value for the significance of the correlation. And I expect this p-value to be 0.

The following example calculates the correlation coefficient and the p-value for a self-correlation.

from scipy.stats import pearsonr
a = [1,2,3]
print(pearsonr(a, a))

(1.0, 0.0)

In a way, I am misusing corr, because I do not compute correlation coefficients. However, I expect that if corr takes a generic callable, the result of this callable should be calculated also for the diagonals.

@TomAugspurger
Copy link
Contributor

Ahh, understood. I'm not sure about changing this.

cc @shadiakiki1986 if you have thoughts.

@fbnrst
Copy link
Contributor Author

fbnrst commented Mar 14, 2019

In line of my request, there is a question on StackOverflow about calculating p-Values. Using a callable in corr seems the perfect answer to this question:
https://stackoverflow.com/questions/25571882/pandas-columns-correlation-with-statistical-significance

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 14, 2019 via email

@fbnrst
Copy link
Contributor Author

fbnrst commented Mar 14, 2019

I can see this. So what I am asking for is a more general method than corr that computes pairwise summary statistics of columns. Does such a method already exist for pandas or would this be a feature request?

In case you want to keep the current implementation for the diagonals: Would you agree, that it would be good to mention in the documentation that pandas expects the callable to return 1 for diagonal elements? At least, I did not expect this.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 14, 2019 via email

@fbnrst
Copy link
Contributor Author

fbnrst commented Mar 14, 2019

I would prefer if diagonal elements are computed by the supplied method as this is what I need.

@shadiakiki1986
Copy link
Contributor

As @fabianrost84 indicated earlier, the 1 in the diagonal is indeed hard-coded and not returned by the generic callable passed to the .corr function. The below callable would still generate 1's along the diagonal

import pandas as pd
import numpy as np
return_zero = lambda a, b: 0
df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)], columns=['dogs', 'cats'])
df.corr(method=return_zero)

It would be convenient for this particular p-value use case to have the callable calculate the diagonals too, but it opens the door to other changes such as the resultant matrix from corr not having to be symmetric.

For the p-value issue, simply subtracting the diagonal would do, e.g. df.corr(method=...) - np.eye(len(df.columns)). I'm all for documenting the behavior and keeping the implementation as is.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 14, 2019 via email

@fbnrst
Copy link
Contributor Author

fbnrst commented Mar 14, 2019

What about this?

"""
            * callable: callable with input two 1d ndarrays
                and returning a float. The callable is expected to be commutative 
                and to return 1.0 for two identical input ndarrays.
                .. versionadded:: 0.24.0

@shadiakiki1986
Copy link
Contributor

shadiakiki1986 commented Mar 14, 2019 via email

@fbnrst
Copy link
Contributor Author

fbnrst commented Mar 14, 2019

Should I add this to #25729 and make the PR a bit more general? Or should I open a new PR for this?

@fbnrst
Copy link
Contributor Author

fbnrst commented Mar 14, 2019

#25729 is already merged, so I'll open a new PR.

fbnrst added a commit to fbnrst/pandas that referenced this issue Mar 14, 2019
@gfyoung gfyoung added Numeric Operations Arithmetic, Comparison, and Logical operations Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Mar 16, 2019
@jreback jreback added this to the 0.25.0 milestone Mar 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants