Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with invalid xml tag ("<br>" without closing) in vml comments #1181

Closed
wants to merge 9 commits into from

Conversation

rdoering
Copy link

@rdoering rdoering commented Oct 1, 2019

This is:

- [x] a bugfix
- [ ] a new feature

Checklist:

Why this change is needed?

  • Excel writes nie-lines VML-comments as "
    "-tag without a closing "<\br>"-tag
  • This is no valid xml code
  • therefore simplexml_load_string() is failing and phpShreadsheet is failing to open the file

rdoering and others added 5 commits April 17, 2019 09:20
Excel writes a new lineas "<br>"-tag without closing tag. That causes simplexml_load_string() to fail.
� Conflicts:
�	src/PhpSpreadsheet/Reader/Xlsx.php
… workbook names like "workbook2.xml", which are fine for excel"
@rdoering
Copy link
Author

rdoering commented Oct 2, 2019

If the missing php 7.4 support is a problem, I could try to fix it.

@rdoering
Copy link
Author

Any questions?

@rdoering
Copy link
Author

Is here any progress?

@rdoering
Copy link
Author

rdoering commented Nov 4, 2019

Travis is only failing on PHP 7.4.

@rdoering
Copy link
Author

Any news here?

Copy link
Member

@PowerKiKi PowerKiKi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improvement on workbook names would be welcome but str_replace'ing things in XML is a very dangerous slope. A different approach should be used or be removed entirely.

@rdoering
Copy link
Author

May I should change the name of the second method.

@rdoering rdoering requested a review from PowerKiKi November 22, 2019 16:22
@rdoering
Copy link
Author

rdoering commented Dec 8, 2019

Please don't hesitate to contact me, if there still is a change requested.

@rdoering
Copy link
Author

Do you need more information?

@rdoering
Copy link
Author

Please let me know, if I could help to push this topic forward.

@rdoering
Copy link
Author

rdoering commented Mar 2, 2020

@PowerKiKi Is there an issue I can help you with?

@rdoering
Copy link
Author

rdoering commented Mar 9, 2020

I would like to support you with this issue.

@rdoering
Copy link
Author

I am already here :-)

@rdoering
Copy link
Author

Anything new here?

@rdoering
Copy link
Author

Please tell me, if I can help you.

Copy link
Member

@PowerKiKi PowerKiKi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long wait, but I don't think this should be merged. This add a second, very different, way of loading files, which may or may not open up a lot of new issues depending on which way the file was loaded.

IMHO this is going too far only to allow to load an invalid file. I'd suggest fixing invalid files before loading them with PhpSpreadsheet, or maintain your fork if you encounter many of those files.

I'll ask request @MarkBaker opinion on the matter though. Maybe he thinks differently...

@PowerKiKi PowerKiKi requested a review from MarkBaker April 26, 2020 02:46
@PowerKiKi PowerKiKi added the reader/xlsx Reader for MS OfficeOpenXML-format (xlsx) spreadsheet files label Apr 26, 2020
@rdoering
Copy link
Author

@PowerKiKi
Thanks for reviewing the code.

The difficulty was, that the function simplexml_load_string() wasn't throw an exception in default config. That is why we added the new wrapper method simplexml_load_string().

If the method simplexml_load_string() is fails by calling the same function simplexml_load_string() coded as before, then and only then we try to parse the code as HTML-Code instead of pure XML.

$dom = new DOMDocument; $rc = $dom->loadHTML($string, $options);

Unfortunately we rethrow the exception. Thats why you are right, the new code could effect the behaviour and we should remove the rethrow-statement.

IMHO this is going too far only to allow to load an invalid file. I'd suggest fixing invalid files before loading them with PhpSpreadsheet, or maintain your fork if you encounter many of those files.

Such Excel-files with new-lines in a VML commands are written by Excel 2013 and not really invalid. A VML command contains HTML-code instead pure XML-code. This is mostly the same but not at all. (1)

(1) - https://tools.ietf.org/html/rfc7992#section-9.12

@PowerKiKi
Copy link
Member

then and only then we try to parse the code as HTML

That's my problem, one time we parse as XML, but sometimes we parse as HTML. This might create new issues that might be hard to resolve.

Also Excel 2013 is EOL since 2018-10-04 for most people. I don't think it is worth adding a risk to our codebase for something that will naturally disappear.

But again, I'll let @MarkBaker have the final word on this.

@rdoering
Copy link
Author

rdoering commented Jun 9, 2020

Any news here?

@rdoering
Copy link
Author

Don't hasited to ask.

@rdoering
Copy link
Author

Any progress on this front?

@PowerKiKi
Copy link
Member

I know this might be frustrating after all the effort you put into this, but I'll close this PR, to avoid giving hope that it might get merged.

The technical solution suggested in this PR brings too much risk and potential maintenance nightmare.

Again, if @MarkBaker has a different opinion, I'll let him re-open/merge it. But for now I'll reject it.

@PowerKiKi PowerKiKi closed this Jun 28, 2020
@rdoering
Copy link
Author

Hi @PowerKiKi @MarkBaker

I totally agree, it is a big step to switch form parsing comments as HTML instead of XML but, if Microsoft is writing HTML-comments, IMHO it seems to be the right approach :-).

The only risk I see is to not fail to load a file with a malformed HTML-comment, if PHP-method DOMDocument::loadHTML has is buggy. The reason for this is:

  • The new approach is limited to VML-handling, it only comes to play if simplexml_load_string() is failing while loading comments.
  • We currently processed tons of file with applied changes, without any error

If you see a potential maintenance nightmare, I would like to support you with this.

Maybe it would be better create an issue-ticket first to get a better view about what how problem looks like?

oleibman added a commit to oleibman/PhpSpreadsheet that referenced this pull request Oct 19, 2022
Fix PHPOffice#3125 (rejected previous PR PHPOffice#1181 as too risky, issue was also reported as PHPOffice#170). Vml file should be valid Xml, but Excel (possibly just Excel 2013) can generate unclosed `<br>` tags, preventing Xlsx reader from reading file correctly. I believe a very narrowly targeted fix, changing `<br>` to `<br/>`, and only when reading vml files, probably mitigates the risk. I am going to leave this PR as a draft for a while to see if others disagree.

Another reason to leave it as a draft is the absence of a decent test file. I am able to copy an existing test spreadsheet which contains a vml file, and add some text and the problematic tag, and I can confirm that it fails to load correctly without the fix but loads correctly with the fix. However, it would be better if I had a real file. I cannot figure out how to generate a file like this "naturally"; when I add a textbox to a spreadsheet, it is stored as a regular xml file, not as vml, in both Excel 365 and 2007. Also, the text that I added in the vml file doesn't show anywhere when I open the file in Excel, so I don't know whether my file is a normal case for this condition. And, incidentally, when I load the file in PhpSpreadsheet and save it, the textbox data is not in the resulting output file; this is clearly not a problem in the particular case I constructed, but I don't know if that's true in general. If not, that would be a different problem than the one I'm fixing, but I would prefer to resolve them both at the same time.
oleibman added a commit to oleibman/PhpSpreadsheet that referenced this pull request Oct 20, 2022
This is a replacement for draft PR PHPOffice#2455 and draft PR PHPOffice#3127. There is some useful commentary in those PRs which I have mostly, but not entirely, duplicated below. Fix PHPOffice#2396. Fix PHPOffice#1770. Fix PHPOffice#2388.

A related problem is that the vml files used for the form controls sometimes contain invalid xml. Fix PHPOffice#3125 (rejected previous PR PHPOffice#1181 as too risky, issue was also reported as PHPOffice#170). Vml file should be valid Xml, but Excel can generate unclosed `<br>` tags, preventing Xlsx reader from reading file correctly. I believe a very narrowly targeted fix, changing `<br>` to `<br/>`, and only when reading vml files, probably mitigates the risk. The sample file formscomments.xlsx which is part of this change shows this problem with `Button 1` on sheet `Forms`; the spreadsheet was created with Excel 365, so the problem is not restricted to Excel 2013 as originally reported. A comment on PR 3127 indicates that other tags might be involved, but, without a file demonstrating that, I will restrict this change to br tags for now.

I am starting this out in draft status, and will probably leave it that way for some time. I'm not sure where we want to go with this. It fixes some problems, but in a limited manner, and creates some others. I'm not sure the pain of the others is balanced considering the limitations of the fix. If enough interest is generated as a result of this ticket being out there, we can proceed; if not, it probably isn't worth it.

This fix allows form control elements to be read in and written out. It does not allow you to add such elements, nor even to locate them or determine their properties (so you can't modify or delete them). Although it handles reading and writing of sheets containing both form controls and comments, it will probably create a corrupt spreadsheet if you try adding a new comment to a sheet with form controls - probably quite difficult to solve. Cloning the sheet probably won't work either - probably easier than the other. It is conceivable that we want to add a new property to the Xlsx Reader which turns the reading of form elements on or off (default=off), so that negative effects will be limited to those who have explictly opted in. The change in its current form does not implement such a property.

Because of its limitations, the change isn't really testable. As in some other recent installs, I have added a sample to demonstrate that it works correctly.

As it turns out, if we have a worksheet which contains both form controls and comments (see formscomments.xlsx which is part of this PR), PhpSpreadsheet already creates a corrupt file when it tries to load and save the spreadsheet with such a worksheet. With this change, the file is saved without corruption. This tilts things in favor of proceeding. I'm still not ready, but this will be an important consideration.

A sample file for issue PHPOffice#2621 illustrated a problem with shape files. Since they are involved here, I took a look at how the sample worked with this code. In master, and with this change, a corrupt file results. Fixing that is probably easier than the general problem of handling shape files, but it's an argument against moving this forward until the corruption problem can be addressed.

Fix PHPOffice#2661. A template including checkboxes was leading to file corruption solved by this PR. Another argument for moving forward.
oleibman added a commit that referenced this pull request Dec 28, 2022
* WIP Limited Support for Form Controls V2 (ListBox, Buttons, etc.)

This is a replacement for draft PR #2455 and draft PR #3127. There is some useful commentary in those PRs which I have mostly, but not entirely, duplicated below. Fix #2396. Fix #1770. Fix #2388.

A related problem is that the vml files used for the form controls sometimes contain invalid xml. Fix #3125 (rejected previous PR #1181 as too risky, issue was also reported as #170). Vml file should be valid Xml, but Excel can generate unclosed `<br>` tags, preventing Xlsx reader from reading file correctly. I believe a very narrowly targeted fix, changing `<br>` to `<br/>`, and only when reading vml files, probably mitigates the risk. The sample file formscomments.xlsx which is part of this change shows this problem with `Button 1` on sheet `Forms`; the spreadsheet was created with Excel 365, so the problem is not restricted to Excel 2013 as originally reported. A comment on PR 3127 indicates that other tags might be involved, but, without a file demonstrating that, I will restrict this change to br tags for now.

I am starting this out in draft status, and will probably leave it that way for some time. I'm not sure where we want to go with this. It fixes some problems, but in a limited manner, and creates some others. I'm not sure the pain of the others is balanced considering the limitations of the fix. If enough interest is generated as a result of this ticket being out there, we can proceed; if not, it probably isn't worth it.

This fix allows form control elements to be read in and written out. It does not allow you to add such elements, nor even to locate them or determine their properties (so you can't modify or delete them). Although it handles reading and writing of sheets containing both form controls and comments, it will probably create a corrupt spreadsheet if you try adding a new comment to a sheet with form controls - probably quite difficult to solve. Cloning the sheet probably won't work either - probably easier than the other. It is conceivable that we want to add a new property to the Xlsx Reader which turns the reading of form elements on or off (default=off), so that negative effects will be limited to those who have explictly opted in. The change in its current form does not implement such a property.

Because of its limitations, the change isn't really testable. As in some other recent installs, I have added a sample to demonstrate that it works correctly.

As it turns out, if we have a worksheet which contains both form controls and comments (see formscomments.xlsx which is part of this PR), PhpSpreadsheet already creates a corrupt file when it tries to load and save the spreadsheet with such a worksheet. With this change, the file is saved without corruption. This tilts things in favor of proceeding. I'm still not ready, but this will be an important consideration.

A sample file for issue #2621 illustrated a problem with shape files. Since they are involved here, I took a look at how the sample worked with this code. In master, and with this change, a corrupt file results. Fixing that is probably easier than the general problem of handling shape files, but it's an argument against moving this forward until the corruption problem can be addressed.

Fix #2661. A template including checkboxes was leading to file corruption solved by this PR. Another argument for moving forward.

* Improved Sample File, and Documentation

Add more realistic worksheet to spreadsheet. Document new feature, adding caveats to how it can be used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reader/xlsx Reader for MS OfficeOpenXML-format (xlsx) spreadsheet files
Development

Successfully merging this pull request may close these issues.

3 participants