BUG: Excel writer doesn't handle "cols" option correctly #5429

jmcnamara · 2013-11-04T01:38:29Z

Notes on this PR:

This fix addresses an issue introduced in Issue with Excel writers when column names are duplicated #5235 (Issue with Excel writers when column names are duplicated).
Basically it needs to handle 2 different use cases:
- The user uses duplicate column names
- The user specifies a different order via the cols option

# Case 1
df = pd.DataFrame([[1, 2, 3], [1, 2, 3], [1, 2, 3]])

df.columns = ['A', 'B', 'B']

# Case 2
df = pd.DataFrame({'A': ['a', 'a', 'a'],
                   'B': ['b', 'b', 'b']})

df.to_excel('frame.xlsx', sheet_name='Sheet1', cols=['B', 'A'])

The proposed solution is to iterate through self.columns in the default or user supplied order. If a duplicate column name is encountered (i.e. if df['B'] returns more than one series) then we select the first series and keep track of that index. If the duplicate column name is encountered again then we select the next available series from df['B'] up to the last series available. If the duplicate name is encountered again then we return the last series again.

The patch is slightly kludgy. I didn't know how to check if df[col] contained more than one series so I used len(self.df[col_name].columns) in a try:catch. Hopefully there is something cleaner.

The change is repeated twice in the code.

There is a new test for this issue and an existing test for the previous issue.

I don't think this needs a release note since it is a fix for an issue that was never released.

jtratner · 2013-11-04T01:42:15Z

pandas/io/tests/test_excel.py

@@ -941,6 +941,22 @@ def test_duplicated_columns(self):

            tm.assert_frame_equal(write_frame, read_frame)

+    def test_swapped_columns(self):


Now that you know there's an issue with dup columns, probably good to add a test case for that?

@jtratner There is a test for duplicate columns right above this one from the previous "fix".

jreback · 2013-11-04T01:45:37Z

to_csv handles this exact issue

I'll put a refenenece up

jreback · 2013-11-04T01:50:46Z

You can do almost exactly the same things, essentially you iterate on the blocks (which are dtypes separated), and use the self.column_map to map the column (which handles all of the dup issues)

https://github.com/pydata/pandas/blob/master/pandas/core/format.py#L897

https://github.com/pydata/pandas/blob/master/pandas/core/format.py#L1176

jmcnamara · 2013-11-04T01:51:22Z

to_csv handles this exact issue

@jreback I had a look at that already but it didn't look to be easily transferable since to_csv is handling the data in a row column order. But maybe it is. I don't mind additional input.

As an aside, row/column order handling was something that I was interested in adding. In general the to_csv code is much simpler than the to_excel. I don't see any substantive reason for the Excel complexity.

jtratner · 2013-11-04T01:53:25Z

maybe it was an issue of compatibility with one of the readers?

jmcnamara · 2013-11-04T01:53:42Z

You can do almost exactly the same things, essentially you iterate on the blocks (which are dtypes separated), and use the self.column_map to map the column (which handles all of the dup issues)

Ok, I'm wrong, that does look applicable. I'll look at it again.

jtratner · 2013-11-04T05:13:56Z

ping me again when you feel this is ready

jmcnamara · 2013-11-04T15:47:31Z

@jreback, @jtratner

Updated with simpler fix based on the to_csv() implementation.

jmcnamara · 2013-11-04T21:04:44Z

Also, is the tm.assert_series_equal() method supposed to act like this:

import pandas as pd
import pandas.util.testing as tm

series1 = pd.Series(['a', 'b', 'c'])
series2 = pd.Series(['d', 'e', 'f'])

tm.assert_series_equal(series1, series2)
# No exception.

That sent me in circles for a while because a test that should have failed didn't.

jreback · 2013-11-04T21:34:25Z

that should assert, looks like a bug in a particular part of the checking code (which was recently changed)

jtratner · 2013-11-04T22:19:27Z

pandas/core/format.py

@@ -1373,7 +1373,8 @@ def _format_regular_rows(self):
                yield ExcelCell(self.rowcounter + idx, 0, idxval, header_style)

        for colidx in range(len(self.columns)):
-            series = self.df.iloc[:, colidx]
+            # Get series for column name accounting for dupes.
+            series = self.df.loc[:, self.columns].iloc[:, colidx]


you can move the self.df.loc[:, self.columns] out of the inner loop, right? It's not different each iteration and no point in recalculating...

jreback · 2013-11-04T22:50:16Z

@jmcnamara rebase on current master and the assert_series_equal will now work correctly....that was a bug! thanks!

jreback · 2013-11-04T22:51:34Z

In [4]: series1 = pd.Series(['a', 'b', 'c'])

In [5]: series2 = pd.Series(['d', 'e', 'f'])

In [6]: tm.assert_series_equal(series1, series2)

AssertionError: 'a' != 'd'

jmcnamara · 2013-11-04T23:43:27Z

Ok guys, thanks for that. Update includes:

Rebased to pick up test fix.
Fixed temp path in test cases.
Moving frame copy out of loop (d'oh).

jreback · 2013-11-06T00:03:06Z

@jtratner looks fine, ok to merge

jtratner · 2013-11-06T00:21:15Z

@jmcnamara - can you clarify one thing? I thought you said there was a test case for dup columns specified by the user? Shouldn't that test case be failing without this PR? If not, can you add a test that covers specified dupe columns?

Maybe I'm just missing something here.

Aside from that, this looks good to me and I will merge after I hear back.

jmcnamara · 2013-11-06T00:29:56Z

can you clarify one thing? I thought you said there was a test case for dup columns specified by the user? Shouldn't that test case be failing without this PR?

Sorry for the confusion.

Initially there was an issue reported on SO where the Excel output was failing with duplicate columns.

I wrote a test and fix for that in issue #5235 and PR #5237 but inadvertently introduced a bug for swapped columns.

This PR fixes the swapped columns bug without displacing the duplicate column fix. In fact it should fix any further issues of this kind (at least as robustly as to_csv()).

jtratner · 2013-11-06T01:44:00Z

@jmcnamara could you rebase this on top of master to make sure it merges well with your other PR (that was just merged into master)?

Added fix to ensure that Series associated with column names are handled correctly even if names are duplicated or swapped. Issue #5427

jmcnamara · 2013-11-06T21:42:45Z

Rebase complete. Good to merge.

BUG: Excel writer doesn't handle "cols" option correctly

jtratner · 2013-11-06T22:16:35Z

Thanks!

jtratner reviewed Nov 4, 2013
View reviewed changes

jreback mentioned this pull request Nov 4, 2013

BUG/TST: bug in the test apparatus when dtypes are equal in a Series but the values are not equal #5434

Merged

jtratner reviewed Nov 4, 2013
View reviewed changes

BUG: Excel writer doesn't handle "cols" option correctly

96134ba

Added fix to ensure that Series associated with column names are handled correctly even if names are duplicated or swapped. Issue #5427

jtratner added a commit that referenced this pull request Nov 6, 2013

Merge pull request #5429 from jmcnamara/bug_excel_cols_option

80d8a7a

BUG: Excel writer doesn't handle "cols" option correctly

jtratner merged commit 80d8a7a into pandas-dev:master Nov 6, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Excel writer doesn't handle "cols" option correctly #5429

BUG: Excel writer doesn't handle "cols" option correctly #5429

jmcnamara commented Nov 4, 2013

jtratner Nov 4, 2013

jmcnamara Nov 4, 2013

jtratner Nov 4, 2013

jreback commented Nov 4, 2013

jreback commented Nov 4, 2013

jmcnamara commented Nov 4, 2013

jtratner commented Nov 4, 2013

jmcnamara commented Nov 4, 2013

jtratner commented Nov 4, 2013

jmcnamara commented Nov 4, 2013

jmcnamara commented Nov 4, 2013

jreback commented Nov 4, 2013

jtratner Nov 4, 2013

jreback commented Nov 4, 2013

jreback commented Nov 4, 2013

jmcnamara commented Nov 4, 2013

jreback commented Nov 6, 2013

jtratner commented Nov 6, 2013

jmcnamara commented Nov 6, 2013

jtratner commented Nov 6, 2013

jmcnamara commented Nov 6, 2013

jtratner commented Nov 6, 2013

		@@ -941,6 +941,22 @@ def test_duplicated_columns(self):

		tm.assert_frame_equal(write_frame, read_frame)

		def test_swapped_columns(self):

BUG: Excel writer doesn't handle "cols" option correctly #5429

BUG: Excel writer doesn't handle "cols" option correctly #5429

Conversation

jmcnamara commented Nov 4, 2013

jtratner Nov 4, 2013

Choose a reason for hiding this comment

jmcnamara Nov 4, 2013

Choose a reason for hiding this comment

jtratner Nov 4, 2013

Choose a reason for hiding this comment

jreback commented Nov 4, 2013

jreback commented Nov 4, 2013

jmcnamara commented Nov 4, 2013

jtratner commented Nov 4, 2013

jmcnamara commented Nov 4, 2013

jtratner commented Nov 4, 2013

jmcnamara commented Nov 4, 2013

jmcnamara commented Nov 4, 2013

jreback commented Nov 4, 2013

jtratner Nov 4, 2013

Choose a reason for hiding this comment

jreback commented Nov 4, 2013

jreback commented Nov 4, 2013

jmcnamara commented Nov 4, 2013

jreback commented Nov 6, 2013

jtratner commented Nov 6, 2013

jmcnamara commented Nov 6, 2013

jtratner commented Nov 6, 2013

jmcnamara commented Nov 6, 2013

jtratner commented Nov 6, 2013