-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify which columns to keep #190
Comments
I had a similar request. But, for now, hopefully, this workaround will be a welcome contribution to the conversation on this issue? import pandas as pd
import pandas_profiling
# I am a fan of this data set.
df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')
# Length left out due to correlation with weight.
pandas_profiling.ProfileReport(df)
# Pass rejected variables to new profile report.
rejected = pandas_profiling.ProfileReport(df).get_rejected_variables(threshold=0.9)
pandas_profiling.ProfileReport(df[rejected]) Edit 1: As explained in Issue 183 I have not yet updated to v2.0.0. Edit 2: As I dug deeper (and updated to v2.0.0+ I found this issue's feature is available. My example implementation: import pandas as pd
import pandas_profiling
# I am a fan of this data set.
df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')
profile = df.profile_report()
# Get list of rejected variables.
rejected = profile.get_rejected_variables()
# Pass rejected variables with the correlation_overrides option.
df.profile_report(title='Stata Auto.dta Pandas Profiled',
correlation_overrides=rejected) Edit 3: Corrected syntax error. |
@adamrossnelson thank you for helping, there is indeed the |
Hello, thank you for the code, I just reinstall version 2.0.0 of this package (which doesn't exist on Anaconda yet) and I think my problem is just half solved. The part that is
For example, column 2, 4 and 5 are highly correlated in my data frame. By default, this package will eliminate column 4 and 5.
However, I want to keep column 4, in this case, only column 5 is rejected. Column 2 is still kept.
I would like to reject both column 2 and 5 in this case and I think it's better than just reject column 5. What is your opinion? |
IMHO a simple 'reject nothing, but report high correlation warning anyway' option would be nice. With the solution above, I have to run the report twice, and I don't get the warning. |
Stale issue |
I'm a fan of this package. This enhance would be pretty helpful. GitHub just marked it stale? I might be able to contribute code for this enhancement I'm March. But, I'd also ask from the creator of am enhancement is in the works? |
This issues shouldn't have been closed. The Github Actions does not remove the @adamrossnelson Thank you for picking this up. A PR is welcome :) The behaviour towards rejecting variables has changed in the time since this issue was opened, please make sure you check how the latest version handles these, to prevent double work. |
I am contributing three more work around options. These were inspired by @tobycheese who said: "IMHO a simple 'reject nothing, but report high correlation warning anyway' option would be nice. With the solution above, I have to run the report twice, and I don't get the warning." This comment got six thumbs up. For what it is worth I think this problem has been solved by the addition of the The first new work around is to leverage the existing import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')
ProfileReport(df, correlation_overrides=df.columns.tolist()) The second new work around is to leverage the existing import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')
ProfileReport(df, correlcorrelation_threshold=1)) The third new work around is to leverage the existing import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')
ProfileReport(df, check_correlation=False)) These solutions do not fully solve the issue(s) as it is described by @tqa236. However, I would propose that the best solution for picking and choosing which columns to include or not include would be to re-order the columns. Place the column you desire to keep ahead of the other columns. Reordering columns in Pandas is not a simple task, really. But here is a useful reference: I'm planning a PR that will elaborate on the descriptions of these attributes. Will reference this issue in the PR. |
Stale issue |
Currently, I think
pandas-profiling
auto-rejects the columns that are highly correlated to previous columns. This is nice.However, there are some important features that I would like to include in the final data frame (for the purpose of interpretability of a Machine Learning model). Is there a way for me to specify some important columns that I want to keep? It means that if this column is highly correlated with a previous column, the previous column will be rejected.
The text was updated successfully, but these errors were encountered: