Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify which columns to keep #190

Closed
tqa236 opened this issue Jun 25, 2019 · 9 comments
Closed

Specify which columns to keep #190

tqa236 opened this issue Jun 25, 2019 · 9 comments
Labels
feature request 💬 Requests for new features

Comments

@tqa236
Copy link
Contributor

tqa236 commented Jun 25, 2019

Currently, I think pandas-profiling auto-rejects the columns that are highly correlated to previous columns. This is nice.

However, there are some important features that I would like to include in the final data frame (for the purpose of interpretability of a Machine Learning model). Is there a way for me to specify some important columns that I want to keep? It means that if this column is highly correlated with a previous column, the previous column will be rejected.

@tqa236 tqa236 added the feature request 💬 Requests for new features label Jun 25, 2019
@adamrossnelson
Copy link

adamrossnelson commented Jun 25, 2019

I had a similar request. But, for now, hopefully, this workaround will be a welcome contribution to the conversation on this issue?

import pandas as pd
import pandas_profiling

# I am a fan of this data set.
df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')

# Length left out due to correlation with weight.
pandas_profiling.ProfileReport(df)

# Pass rejected variables to new profile report. 
rejected = pandas_profiling.ProfileReport(df).get_rejected_variables(threshold=0.9)
pandas_profiling.ProfileReport(df[rejected])

Edit 1: As explained in Issue 183 I have not yet updated to v2.0.0.

Edit 2: As I dug deeper (and updated to v2.0.0+ I found this issue's feature is available. My example implementation:

import pandas as pd
import pandas_profiling

# I am a fan of this data set.
df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')
profile = df.profile_report()

# Get list of rejected variables.
rejected = profile.get_rejected_variables()

# Pass rejected variables with the correlation_overrides option.
df.profile_report(title='Stata Auto.dta Pandas Profiled',
                  correlation_overrides=rejected)

Edit 3: Corrected syntax error. correlation_overrides=[rejected] --> correlation_overrides=rejected

@sbrugman
Copy link
Collaborator

@adamrossnelson thank you for helping, there is indeed the correlation_overrides parameter, @tqa236 does this solve your problem?

@tqa236
Copy link
Contributor Author

tqa236 commented Jun 25, 2019

Hello, thank you for the code, I just reinstall version 2.0.0 of this package (which doesn't exist on Anaconda yet) and I think my problem is just half solved.

The part that is

  • solved: I can keep the columns I want to keep.
  • not solved: A general column is not automatically rejected even if there's a correlation.

For example, column 2, 4 and 5 are highly correlated in my data frame. By default, this package will eliminate column 4 and 5.

df.profile_report()

However, I want to keep column 4, in this case, only column 5 is rejected. Column 2 is still kept.

df.profile_report(correlation_overrides="Column 4")

I would like to reject both column 2 and 5 in this case and I think it's better than just reject column 5.

What is your opinion?

@tobycheese
Copy link

IMHO a simple 'reject nothing, but report high correlation warning anyway' option would be nice. With the solution above, I have to run the report twice, and I don't get the warning.

@github-actions
Copy link

Stale issue

@adamrossnelson
Copy link

I'm a fan of this package. This enhance would be pretty helpful. GitHub just marked it stale?

I might be able to contribute code for this enhancement I'm March. But, I'd also ask from the creator of am enhancement is in the works?

@sbrugman
Copy link
Collaborator

This issues shouldn't have been closed. The Github Actions does not remove the no-issue-activity label on comments (see actions/stale#21).

@adamrossnelson Thank you for picking this up. A PR is welcome :) The behaviour towards rejecting variables has changed in the time since this issue was opened, please make sure you check how the latest version handles these, to prevent double work.

@adamrossnelson
Copy link

I am contributing three more work around options. These were inspired by @tobycheese who said: "IMHO a simple 'reject nothing, but report high correlation warning anyway' option would be nice. With the solution above, I have to run the report twice, and I don't get the warning." This comment got six thumbs up. For what it is worth I think this problem has been solved by the addition of the check_correlation attribute (also demonstrated below).

The first new work around is to leverage the existing correlation_overrides attribute. To do this pass df.columns.tolist() to the attribute:

import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')
ProfileReport(df, correlation_overrides=df.columns.tolist())

The second new work around is to leverage the existing correlation_threshold attribute. To do this pass the value 1 to the attribute:

import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')
ProfileReport(df, correlcorrelation_threshold=1))

The third new work around is to leverage the existing check_correlation attribute. To do this pass False to the attribute:

import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')
ProfileReport(df, check_correlation=False))

These solutions do not fully solve the issue(s) as it is described by @tqa236. However, I would propose that the best solution for picking and choosing which columns to include or not include would be to re-order the columns. Place the column you desire to keep ahead of the other columns. Reordering columns in Pandas is not a simple task, really. But here is a useful reference:

https://towardsdatascience.com/reordering-pandas-dataframe-columns-thumbs-down-on-standard-solutions-1ff0bc2941d5

I'm planning a PR that will elaborate on the descriptions of these attributes. Will reference this issue in the PR.

@github-actions
Copy link

Stale issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request 💬 Requests for new features
Projects
None yet
Development

No branches or pull requests

4 participants