Specify which columns to keep #190

tqa236 · 2019-06-25T10:11:30Z

Currently, I think pandas-profiling auto-rejects the columns that are highly correlated to previous columns. This is nice.

However, there are some important features that I would like to include in the final data frame (for the purpose of interpretability of a Machine Learning model). Is there a way for me to specify some important columns that I want to keep? It means that if this column is highly correlated with a previous column, the previous column will be rejected.

The text was updated successfully, but these errors were encountered:

adamrossnelson · 2019-06-25T12:02:15Z

I had a similar request. But, for now, hopefully, this workaround will be a welcome contribution to the conversation on this issue?

import pandas as pd
import pandas_profiling

# I am a fan of this data set.
df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')

# Length left out due to correlation with weight.
pandas_profiling.ProfileReport(df)

# Pass rejected variables to new profile report. 
rejected = pandas_profiling.ProfileReport(df).get_rejected_variables(threshold=0.9)
pandas_profiling.ProfileReport(df[rejected])

Edit 1: As explained in Issue 183 I have not yet updated to v2.0.0.

Edit 2: As I dug deeper (and updated to v2.0.0+ I found this issue's feature is available. My example implementation:

import pandas as pd
import pandas_profiling

# I am a fan of this data set.
df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')
profile = df.profile_report()

# Get list of rejected variables.
rejected = profile.get_rejected_variables()

# Pass rejected variables with the correlation_overrides option.
df.profile_report(title='Stata Auto.dta Pandas Profiled',
                  correlation_overrides=rejected)

Edit 3: Corrected syntax error. correlation_overrides=[rejected] --> correlation_overrides=rejected

sbrugman · 2019-06-25T15:00:02Z

@adamrossnelson thank you for helping, there is indeed the correlation_overrides parameter, @tqa236 does this solve your problem?

tqa236 · 2019-06-25T15:58:13Z

Hello, thank you for the code, I just reinstall version 2.0.0 of this package (which doesn't exist on Anaconda yet) and I think my problem is just half solved.

The part that is

solved: I can keep the columns I want to keep.
not solved: A general column is not automatically rejected even if there's a correlation.

For example, column 2, 4 and 5 are highly correlated in my data frame. By default, this package will eliminate column 4 and 5.

df.profile_report()

However, I want to keep column 4, in this case, only column 5 is rejected. Column 2 is still kept.

df.profile_report(correlation_overrides="Column 4")

I would like to reject both column 2 and 5 in this case and I think it's better than just reject column 5.

What is your opinion?

tobycheese · 2019-07-10T10:23:14Z

IMHO a simple 'reject nothing, but report high correlation warning anyway' option would be nice. With the solution above, I have to run the report twice, and I don't get the warning.

github-actions · 2020-02-16T00:01:14Z

Stale issue

adamrossnelson · 2020-02-16T01:31:31Z

I'm a fan of this package. This enhance would be pretty helpful. GitHub just marked it stale?

I might be able to contribute code for this enhancement I'm March. But, I'd also ask from the creator of am enhancement is in the works?

sbrugman · 2020-02-26T15:13:55Z

This issues shouldn't have been closed. The Github Actions does not remove the no-issue-activity label on comments (see actions/stale#21).

@adamrossnelson Thank you for picking this up. A PR is welcome :) The behaviour towards rejecting variables has changed in the time since this issue was opened, please make sure you check how the latest version handles these, to prevent double work.

adamrossnelson · 2020-03-29T22:08:40Z

I am contributing three more work around options. These were inspired by @tobycheese who said: "IMHO a simple 'reject nothing, but report high correlation warning anyway' option would be nice. With the solution above, I have to run the report twice, and I don't get the warning." This comment got six thumbs up. For what it is worth I think this problem has been solved by the addition of the check_correlation attribute (also demonstrated below).

The first new work around is to leverage the existing correlation_overrides attribute. To do this pass df.columns.tolist() to the attribute:

import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')
ProfileReport(df, correlation_overrides=df.columns.tolist())

The second new work around is to leverage the existing correlation_threshold attribute. To do this pass the value 1 to the attribute:

import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')
ProfileReport(df, correlcorrelation_threshold=1))

The third new work around is to leverage the existing check_correlation attribute. To do this pass False to the attribute:

import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')
ProfileReport(df, check_correlation=False))

These solutions do not fully solve the issue(s) as it is described by @tqa236. However, I would propose that the best solution for picking and choosing which columns to include or not include would be to re-order the columns. Place the column you desire to keep ahead of the other columns. Reordering columns in Pandas is not a simple task, really. But here is a useful reference:

https://towardsdatascience.com/reordering-pandas-dataframe-columns-thumbs-down-on-standard-solutions-1ff0bc2941d5

I'm planning a PR that will elaborate on the descriptions of these attributes. Will reference this issue in the PR.

github-actions · 2020-05-29T00:01:39Z

Stale issue

tqa236 added the feature request 💬 Requests for new features label Jun 25, 2019

github-actions bot added the no-issue-activity label Feb 16, 2020

github-actions bot closed this as completed Feb 24, 2020

sbrugman reopened this Feb 26, 2020

sbrugman removed the no-issue-activity label Feb 26, 2020

github-actions bot added the no-issue-activity label May 29, 2020

github-actions bot closed this as completed Jun 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify which columns to keep #190

Specify which columns to keep #190

tqa236 commented Jun 25, 2019 •

edited

Loading

adamrossnelson commented Jun 25, 2019 •

edited

Loading

sbrugman commented Jun 25, 2019

tqa236 commented Jun 25, 2019

tobycheese commented Jul 10, 2019

github-actions bot commented Feb 16, 2020

adamrossnelson commented Feb 16, 2020

sbrugman commented Feb 26, 2020

adamrossnelson commented Mar 29, 2020

github-actions bot commented May 29, 2020

Specify which columns to keep #190

Specify which columns to keep #190

Comments

tqa236 commented Jun 25, 2019 • edited Loading

adamrossnelson commented Jun 25, 2019 • edited Loading

sbrugman commented Jun 25, 2019

tqa236 commented Jun 25, 2019

tobycheese commented Jul 10, 2019

github-actions bot commented Feb 16, 2020

adamrossnelson commented Feb 16, 2020

sbrugman commented Feb 26, 2020

adamrossnelson commented Mar 29, 2020

github-actions bot commented May 29, 2020

tqa236 commented Jun 25, 2019 •

edited

Loading

adamrossnelson commented Jun 25, 2019 •

edited

Loading