Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: to_excel swaps order of values of duplicate columns #11007

Closed
jorisvandenbossche opened this issue Sep 5, 2015 · 11 comments · Fixed by #11237
Closed

BUG: to_excel swaps order of values of duplicate columns #11007

jorisvandenbossche opened this issue Sep 5, 2015 · 11 comments · Fixed by #11237
Labels
Bug IO Excel read_excel, to_excel
Milestone

Comments

@jorisvandenbossche
Copy link
Member

On master, as a small example:

In [1]: df = pd.DataFrame([[1,2,3,4],[5,6,7,8]], columns=['A','B','A','B'])

In [2]: df
Out[2]:
   A  B  A  B
0  1  2  3  4
1  5  6  7  8

In [4]: df.to_excel('test_excel_duplicate_columns.xlsx')

gives:

capture

So the values of columns 2 and 3 are swapped (not the column names)

BTW, this happens both with .xlsx as .xls (openpyxl / xlsxwriter / xlwt)

Possibly related: #10982, #10970

@terrytangyuan
Copy link
Contributor

@jorisvandenbossche I got this bug when running your new example df.to_excel('test_excel_duplicate_columns.xlsx'):

/Library/Python/2.7/site-packages/openpyxl/styles/styleable.py:111: UserWarning: Use formatting objects such as font directly
  warn("Use formatting objects such as font directly")
/Library/Python/2.7/site-packages/openpyxl/styles/__init__.py:52: UserWarning: Call to deprecated function or class copy (Copy formatting objects like font directly).
  def copy(self):
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-66313a95798a> in <module>()
----> 1 df.to_excel('test_excel_duplicate_columns.xlsx')

/Library/Python/2.7/site-packages/pandas/core/frame.pyc in to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep)
   1272         formatted_cells = formatter.get_formatted_cells()
   1273         excel_writer.write_cells(formatted_cells, sheet_name,
-> 1274                                  startrow=startrow, startcol=startcol)
   1275         if need_save:
   1276             excel_writer.save()

/Library/Python/2.7/site-packages/pandas/io/excel.pyc in write_cells(self, cells, sheet_name, startrow, startcol)
    776
    777             if style_kwargs:
--> 778                 xcell.style = xcell.style.copy(**style_kwargs)
    779
    780             if cell.mergestart is not None and cell.mergeend is not None:

/Library/Python/2.7/site-packages/openpyxl/compat/__init__.pyc in new_func(*args, **kwargs)
     65                 lineno=_code.co_firstlineno + 1
     66             )
---> 67             return obj(*args, **kwargs)
     68         return new_func
     69

TypeError: copy() got an unexpected keyword argument 'font'

Do you know what went wrong?

@jorisvandenbossche
Copy link
Member Author

as @jreback said, you have to look at the openpyxl version

@terrytangyuan
Copy link
Contributor

Still got this error after upgrade. (I was actually using the most updated one earlier)

pd.__version__
Out[4]: '0.16.2'

openpyxl.__version__
Out[6]: '2.2.6'

@jreback
Copy link
Contributor

jreback commented Sep 5, 2015

yeh, that version is borked :)

@jreback
Copy link
Contributor

jreback commented Sep 5, 2015

latest conda version is ok, its the pip version that is a problem (which we are not testing with .... :<)

@terrytangyuan
Copy link
Contributor

Okay. I'll try conda then.

@jorisvandenbossche
Copy link
Member Author

There have been some duplicate-column fixes previously: #5237, but clearly not solved all.

cc @neirbowj
cc @jtratner
cc @jmcnamara

@jreback jreback added this to the Next Major Release milestone Sep 8, 2015
@jorisvandenbossche
Copy link
Member Author

@jmcnamara
Copy link
Contributor

I think the code causing the issue the following (and was introduced by me):

        # Get a frame that will account for any duplicates in the column names.
        col_mapped_frame = self.df.loc[:, self.columns]

        # Write the body of the frame data series by series.
        for colidx in range(len(self.columns)):
            series = col_mapped_frame.iloc[:, colidx]
            for i, val in enumerate(series):
                yield ExcelCell(self.rowcounter + i, colidx + coloffset, val)

If so then the loc/iloc code isn't retrieving/writing the column data in the correct order. Also, perhaps len(self.columns) should be len(col_mapped_frame).

Any suggestions on a better way to fix this?

@jorisvandenbossche
Copy link
Member Author

So the problem is of course that the self.df.loc[:, self.columns] is not working with duplicate column names.
As a small illustration:

In [33]: df = pd.DataFrame([[1,2,3]], columns=['A','B','A'])

In [34]: df
Out[34]:
   A  B  A
0  1  2  3

In [35]: df.loc[:, df.columns]
Out[35]:
   A  A  B  A  A
0  1  3  2  1  3

So a possible solution I think, is to only use self.df.loc[:, self.columns] if the user has specified columns, and otherwise just to use df. So maybe already in the init function to set self.df = df.loc[:, cols] if cols is not None and otherwise as self.df = df ?

@jmcnamara do you have time to try to push a fix? Would be nice if we could still put this in 0.17 as this is quite a serious bug

@jmcnamara
Copy link
Contributor

@jorisvandenbossche I can work on a fix at the weekend.

@jreback jreback modified the milestones: 0.17.1, Next Major Release Oct 5, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants