-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Excel writer doesn't handle "cols" option correctly #5429
Conversation
@@ -941,6 +941,22 @@ def test_duplicated_columns(self): | |||
|
|||
tm.assert_frame_equal(write_frame, read_frame) | |||
|
|||
def test_swapped_columns(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you know there's an issue with dup columns, probably good to add a test case for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtratner There is a test for duplicate columns right above this one from the previous "fix".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it.
to_csv handles this exact issue I'll put a refenenece up |
You can do almost exactly the same things, essentially you iterate on the blocks (which are dtypes separated), and use the self.column_map to map the column (which handles all of the dup issues) https://github.com/pydata/pandas/blob/master/pandas/core/format.py#L897 https://github.com/pydata/pandas/blob/master/pandas/core/format.py#L1176 |
@jreback I had a look at that already but it didn't look to be easily transferable since As an aside, row/column order handling was something that I was interested in adding. In general the |
maybe it was an issue of compatibility with one of the readers? |
Ok, I'm wrong, that does look applicable. I'll look at it again. |
ping me again when you feel this is ready |
Also, is the import pandas as pd
import pandas.util.testing as tm
series1 = pd.Series(['a', 'b', 'c'])
series2 = pd.Series(['d', 'e', 'f'])
tm.assert_series_equal(series1, series2)
# No exception. That sent me in circles for a while because a test that should have failed didn't. |
that should assert, looks like a bug in a particular part of the checking code (which was recently changed) |
@@ -1373,7 +1373,8 @@ def _format_regular_rows(self): | |||
yield ExcelCell(self.rowcounter + idx, 0, idxval, header_style) | |||
|
|||
for colidx in range(len(self.columns)): | |||
series = self.df.iloc[:, colidx] | |||
# Get series for column name accounting for dupes. | |||
series = self.df.loc[:, self.columns].iloc[:, colidx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can move the self.df.loc[:, self.columns]
out of the inner loop, right? It's not different each iteration and no point in recalculating...
@jmcnamara rebase on current master and the assert_series_equal will now work correctly....that was a bug! thanks! |
|
Ok guys, thanks for that. Update includes:
|
@jtratner looks fine, ok to merge |
@jmcnamara - can you clarify one thing? I thought you said there was a test case for dup columns specified by the user? Shouldn't that test case be failing without this PR? If not, can you add a test that covers specified dupe columns? Maybe I'm just missing something here. Aside from that, this looks good to me and I will merge after I hear back. |
Sorry for the confusion. Initially there was an issue reported on SO where the Excel output was failing with duplicate columns. I wrote a test and fix for that in issue #5235 and PR #5237 but inadvertently introduced a bug for swapped columns. This PR fixes the swapped columns bug without displacing the duplicate column fix. In fact it should fix any further issues of this kind (at least as robustly as to_csv()). |
@jmcnamara could you rebase this on top of master to make sure it merges well with your other PR (that was just merged into master)? |
Added fix to ensure that Series associated with column names are handled correctly even if names are duplicated or swapped. Issue #5427
Rebase complete. Good to merge. |
BUG: Excel writer doesn't handle "cols" option correctly
Thanks! |
closes #5427
Notes on this PR:
cols
optionThe proposed solution is to iterate through self.columns in the default or user supplied order. If a duplicate column name is encountered (i.e. if
df['B']
returns more than one series) then we select the first series and keep track of that index. If the duplicate column name is encountered again then we select the next available series fromdf['B']
up to the last series available. If the duplicate name is encountered again then we return the last series again.The patch is slightly kludgy. I didn't know how to check if
df[col]
contained more than one series so I usedlen(self.df[col_name].columns)
in atry:catch
. Hopefully there is something cleaner.The change is repeated twice in the code.
There is a new test for this issue and an existing test for the previous issue.
I don't think this needs a release note since it is a fix for an issue that was never released.