Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Excel writer doesn't handle "cols" option correctly #5429

Merged
merged 1 commit into from
Nov 6, 2013
Merged

BUG: Excel writer doesn't handle "cols" option correctly #5429

merged 1 commit into from
Nov 6, 2013

Conversation

jmcnamara
Copy link
Contributor

closes #5427

Notes on this PR:

# Case 1
df = pd.DataFrame([[1, 2, 3], [1, 2, 3], [1, 2, 3]])

df.columns = ['A', 'B', 'B']
# Case 2
df = pd.DataFrame({'A': ['a', 'a', 'a'],
                   'B': ['b', 'b', 'b']})

df.to_excel('frame.xlsx', sheet_name='Sheet1', cols=['B', 'A'])

The proposed solution is to iterate through self.columns in the default or user supplied order. If a duplicate column name is encountered (i.e. if df['B'] returns more than one series) then we select the first series and keep track of that index. If the duplicate column name is encountered again then we select the next available series from df['B'] up to the last series available. If the duplicate name is encountered again then we return the last series again.

The patch is slightly kludgy. I didn't know how to check if df[col] contained more than one series so I used len(self.df[col_name].columns) in a try:catch. Hopefully there is something cleaner.

The change is repeated twice in the code.

There is a new test for this issue and an existing test for the previous issue.

I don't think this needs a release note since it is a fix for an issue that was never released.

@@ -941,6 +941,22 @@ def test_duplicated_columns(self):

tm.assert_frame_equal(write_frame, read_frame)

def test_swapped_columns(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you know there's an issue with dup columns, probably good to add a test case for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtratner There is a test for duplicate columns right above this one from the previous "fix".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it.

@jreback
Copy link
Contributor

jreback commented Nov 4, 2013

to_csv handles this exact issue

I'll put a refenenece up

@jreback
Copy link
Contributor

jreback commented Nov 4, 2013

You can do almost exactly the same things, essentially you iterate on the blocks (which are dtypes separated), and use the self.column_map to map the column (which handles all of the dup issues)

https://github.com/pydata/pandas/blob/master/pandas/core/format.py#L897

https://github.com/pydata/pandas/blob/master/pandas/core/format.py#L1176

@jmcnamara
Copy link
Contributor Author

to_csv handles this exact issue

@jreback I had a look at that already but it didn't look to be easily transferable since to_csv is handling the data in a row column order. But maybe it is. I don't mind additional input.

As an aside, row/column order handling was something that I was interested in adding. In general the to_csv code is much simpler than the to_excel. I don't see any substantive reason for the Excel complexity.

@jtratner
Copy link
Contributor

jtratner commented Nov 4, 2013

maybe it was an issue of compatibility with one of the readers?

@jmcnamara
Copy link
Contributor Author

You can do almost exactly the same things, essentially you iterate on the blocks (which are dtypes separated), and use the self.column_map to map the column (which handles all of the dup issues)

Ok, I'm wrong, that does look applicable. I'll look at it again.

@jtratner
Copy link
Contributor

jtratner commented Nov 4, 2013

ping me again when you feel this is ready

@jmcnamara
Copy link
Contributor Author

@jreback, @jtratner

Updated with simpler fix based on the to_csv() implementation.

@jmcnamara
Copy link
Contributor Author

Also, is the tm.assert_series_equal() method supposed to act like this:

import pandas as pd
import pandas.util.testing as tm

series1 = pd.Series(['a', 'b', 'c'])
series2 = pd.Series(['d', 'e', 'f'])

tm.assert_series_equal(series1, series2)
# No exception.

That sent me in circles for a while because a test that should have failed didn't.

@jreback
Copy link
Contributor

jreback commented Nov 4, 2013

that should assert, looks like a bug in a particular part of the checking code (which was recently changed)

@@ -1373,7 +1373,8 @@ def _format_regular_rows(self):
yield ExcelCell(self.rowcounter + idx, 0, idxval, header_style)

for colidx in range(len(self.columns)):
series = self.df.iloc[:, colidx]
# Get series for column name accounting for dupes.
series = self.df.loc[:, self.columns].iloc[:, colidx]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can move the self.df.loc[:, self.columns] out of the inner loop, right? It's not different each iteration and no point in recalculating...

@jreback
Copy link
Contributor

jreback commented Nov 4, 2013

@jmcnamara rebase on current master and the assert_series_equal will now work correctly....that was a bug! thanks!

@jreback
Copy link
Contributor

jreback commented Nov 4, 2013

In [4]: series1 = pd.Series(['a', 'b', 'c'])

In [5]: series2 = pd.Series(['d', 'e', 'f'])

In [6]: tm.assert_series_equal(series1, series2)

AssertionError: 'a' != 'd'

@jmcnamara
Copy link
Contributor Author

Ok guys, thanks for that. Update includes:

  • Rebased to pick up test fix.
  • Fixed temp path in test cases.
  • Moving frame copy out of loop (d'oh).

@jreback
Copy link
Contributor

jreback commented Nov 6, 2013

@jtratner looks fine, ok to merge

@jtratner
Copy link
Contributor

jtratner commented Nov 6, 2013

@jmcnamara - can you clarify one thing? I thought you said there was a test case for dup columns specified by the user? Shouldn't that test case be failing without this PR? If not, can you add a test that covers specified dupe columns?

Maybe I'm just missing something here.

Aside from that, this looks good to me and I will merge after I hear back.

@jmcnamara
Copy link
Contributor Author

can you clarify one thing? I thought you said there was a test case for dup columns specified by the user? Shouldn't that test case be failing without this PR?

Sorry for the confusion.

Initially there was an issue reported on SO where the Excel output was failing with duplicate columns.

I wrote a test and fix for that in issue #5235 and PR #5237 but inadvertently introduced a bug for swapped columns.

This PR fixes the swapped columns bug without displacing the duplicate column fix. In fact it should fix any further issues of this kind (at least as robustly as to_csv()).

@jtratner
Copy link
Contributor

jtratner commented Nov 6, 2013

@jmcnamara could you rebase this on top of master to make sure it merges well with your other PR (that was just merged into master)?

Added fix to ensure that Series associated with column
names are handled correctly even if names are duplicated
or swapped. Issue #5427
@jmcnamara
Copy link
Contributor Author

Rebase complete. Good to merge.

jtratner added a commit that referenced this pull request Nov 6, 2013
BUG: Excel writer doesn't handle "cols" option correctly
@jtratner jtratner merged commit 80d8a7a into pandas-dev:master Nov 6, 2013
@jtratner
Copy link
Contributor

jtratner commented Nov 6, 2013

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Excel writer doesn't handle "cols" option correctly
3 participants