BUG: to_excel swaps order of values of duplicate columns #11007

jorisvandenbossche · 2015-09-05T18:56:36Z

On master, as a small example:

In [1]: df = pd.DataFrame([[1,2,3,4],[5,6,7,8]], columns=['A','B','A','B'])

In [2]: df
Out[2]:
   A  B  A  B
0  1  2  3  4
1  5  6  7  8

In [4]: df.to_excel('test_excel_duplicate_columns.xlsx')

gives:

So the values of columns 2 and 3 are swapped (not the column names)

BTW, this happens both with .xlsx as .xls (openpyxl / xlsxwriter / xlwt)

Possibly related: #10982, #10970

The text was updated successfully, but these errors were encountered:

terrytangyuan · 2015-09-05T18:58:44Z

@jorisvandenbossche I got this bug when running your new example df.to_excel('test_excel_duplicate_columns.xlsx'):

/Library/Python/2.7/site-packages/openpyxl/styles/styleable.py:111: UserWarning: Use formatting objects such as font directly
  warn("Use formatting objects such as font directly")
/Library/Python/2.7/site-packages/openpyxl/styles/__init__.py:52: UserWarning: Call to deprecated function or class copy (Copy formatting objects like font directly).
  def copy(self):
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-66313a95798a> in <module>()
----> 1 df.to_excel('test_excel_duplicate_columns.xlsx')

/Library/Python/2.7/site-packages/pandas/core/frame.pyc in to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep)
   1272         formatted_cells = formatter.get_formatted_cells()
   1273         excel_writer.write_cells(formatted_cells, sheet_name,
-> 1274                                  startrow=startrow, startcol=startcol)
   1275         if need_save:
   1276             excel_writer.save()

/Library/Python/2.7/site-packages/pandas/io/excel.pyc in write_cells(self, cells, sheet_name, startrow, startcol)
    776
    777             if style_kwargs:
--> 778                 xcell.style = xcell.style.copy(**style_kwargs)
    779
    780             if cell.mergestart is not None and cell.mergeend is not None:

/Library/Python/2.7/site-packages/openpyxl/compat/__init__.pyc in new_func(*args, **kwargs)
     65                 lineno=_code.co_firstlineno + 1
     66             )
---> 67             return obj(*args, **kwargs)
     68         return new_func
     69

TypeError: copy() got an unexpected keyword argument 'font'

Do you know what went wrong?

jorisvandenbossche · 2015-09-05T18:59:22Z

as @jreback said, you have to look at the openpyxl version

terrytangyuan · 2015-09-05T19:06:00Z

Still got this error after upgrade. (I was actually using the most updated one earlier)

pd.__version__
Out[4]: '0.16.2'

openpyxl.__version__
Out[6]: '2.2.6'

jreback · 2015-09-05T19:07:20Z

yeh, that version is borked :)

jreback · 2015-09-05T19:08:31Z

latest conda version is ok, its the pip version that is a problem (which we are not testing with .... :<)

terrytangyuan · 2015-09-05T19:09:08Z

Okay. I'll try conda then.

jorisvandenbossche · 2015-09-05T19:09:17Z

There have been some duplicate-column fixes previously: #5237, but clearly not solved all.

cc @neirbowj
cc @jtratner
cc @jmcnamara

jorisvandenbossche · 2015-09-17T07:13:01Z

Another report of this on SO: http://stackoverflow.com/questions/32592526/potential-bug-in-pandas-xlsxwriter-pd-to-excel-not-working-well

jmcnamara · 2015-09-17T08:47:14Z

I think the code causing the issue the following (and was introduced by me):

        # Get a frame that will account for any duplicates in the column names.
        col_mapped_frame = self.df.loc[:, self.columns]

        # Write the body of the frame data series by series.
        for colidx in range(len(self.columns)):
            series = col_mapped_frame.iloc[:, colidx]
            for i, val in enumerate(series):
                yield ExcelCell(self.rowcounter + i, colidx + coloffset, val)

If so then the loc/iloc code isn't retrieving/writing the column data in the correct order. Also, perhaps len(self.columns) should be len(col_mapped_frame).

Any suggestions on a better way to fix this?

jorisvandenbossche · 2015-09-17T09:14:33Z

So the problem is of course that the self.df.loc[:, self.columns] is not working with duplicate column names.
As a small illustration:

In [33]: df = pd.DataFrame([[1,2,3]], columns=['A','B','A'])

In [34]: df
Out[34]:
   A  B  A
0  1  2  3

In [35]: df.loc[:, df.columns]
Out[35]:
   A  A  B  A  A
0  1  3  2  1  3

So a possible solution I think, is to only use self.df.loc[:, self.columns] if the user has specified columns, and otherwise just to use df. So maybe already in the init function to set self.df = df.loc[:, cols] if cols is not None and otherwise as self.df = df ?

@jmcnamara do you have time to try to push a fix? Would be nice if we could still put this in 0.17 as this is quite a serious bug

jmcnamara · 2015-09-17T09:29:06Z

@jorisvandenbossche I can work on a fix at the weekend.

jreback added the IO Excel read_excel, to_excel label Sep 5, 2015

jorisvandenbossche mentioned this issue Sep 5, 2015

.to_excel() cuts off columns #10982

Closed

jorisvandenbossche added the Bug label Sep 5, 2015

jreback added this to the Next Major Release milestone Sep 8, 2015

chris-b1 mentioned this issue Oct 4, 2015

BUG: to_excel duplicate columns #11237

Merged

jreback modified the milestones: 0.17.1, Next Major Release Oct 5, 2015

jreback closed this as completed in #11237 Oct 10, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: to_excel swaps order of values of duplicate columns #11007

BUG: to_excel swaps order of values of duplicate columns #11007

jorisvandenbossche commented Sep 5, 2015

terrytangyuan commented Sep 5, 2015

jorisvandenbossche commented Sep 5, 2015

terrytangyuan commented Sep 5, 2015

jreback commented Sep 5, 2015

jreback commented Sep 5, 2015

terrytangyuan commented Sep 5, 2015

jorisvandenbossche commented Sep 5, 2015

jorisvandenbossche commented Sep 17, 2015

jmcnamara commented Sep 17, 2015

jorisvandenbossche commented Sep 17, 2015

jmcnamara commented Sep 17, 2015

BUG: to_excel swaps order of values of duplicate columns #11007

BUG: to_excel swaps order of values of duplicate columns #11007

Comments

jorisvandenbossche commented Sep 5, 2015

terrytangyuan commented Sep 5, 2015

jorisvandenbossche commented Sep 5, 2015

terrytangyuan commented Sep 5, 2015

jreback commented Sep 5, 2015

jreback commented Sep 5, 2015

terrytangyuan commented Sep 5, 2015

jorisvandenbossche commented Sep 5, 2015

jorisvandenbossche commented Sep 17, 2015

jmcnamara commented Sep 17, 2015

jorisvandenbossche commented Sep 17, 2015

jmcnamara commented Sep 17, 2015