Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data validation issues #975

Merged
merged 7 commits into from
Jul 30, 2021
Merged

Fix data validation issues #975

merged 7 commits into from
Jul 30, 2021

Conversation

Arnie97
Copy link
Contributor

@Arnie97 Arnie97 commented Jul 28, 2021

PR Details

Description

This patch escapes the drop-down list items, and changes the way to calculate the formula length.

Related Issue

Fixes #971 and #972.

Motivation and Context

Changes for #971

This patch included a XML entity mapping table, instead of simply reusing xml.EscapeText() in the standard library. This was on purpose. Disassembly of an OOXML file produced by Microsoft Excel indicated that, it implemented a different escaping rule which was more or less incompatible with xml.EscapeText(). Consider the following example:

A<,B>,C",D	,E',F

Microsoft Excel escaped it into:

<formula1>"A&lt;,B&gt;,C"",D	,E',F"</formula1>

And my initial attempt to fix #971 in commit b1faede, which employed xml.EscapeText(), produced:

<formula1>"A&lt;,B&gt;,C&#34;,D&#x9;,E&#39;,F"</formula1>

The latter was unfortunately still incompatible with Microsoft Excel, not to mention it was longer:

image

Changes for #972

This patch calculated the string length in UTF-16 code units instead of UTF-8 bytes.
Tests indicated that Microsoft Excel's limitation of data validation formula always worked in this way, even on UTF-8 dominated platforms like macOS. The following snippet was developed to help identification of the upper bound:

package main

import (
	"fmt"
	"strings"
	"unicode/utf16"

	"github.com/360EntSecGroup-Skylar/excelize/v2"
)

func main() {
	var choices = []string{
		strings.Repeat("😅", 100), // U+1F605, "SMILING FACE WITH OPEN MOUTH AND COLD SWEAT"
		strings.Repeat("漢", 55),  // U+06F22, "CJK UNIFIED IDEOGRAPH-6F22"
	}

	joined := strings.Join(choices, ",")
	runes := []rune(joined)
	codeUnits := utf16.Encode(runes)

	fmt.Printf("%4d bytes in utf-8 \n", len(joined))
	fmt.Printf("%4d units in utf-16\n", len(codeUnits))
	fmt.Printf("%4d runes in utf-32\n", len(runes))

	newWorkbookWithDropList(choices)
}

func newWorkbookWithDropList(choices []string) {
	f := excelize.NewFile()
	dv := excelize.NewDataValidation(false)
	dv.Sqref = "A:A"
	if err := dv.SetDropList(choices); err != nil {
		panic(err)
	}
	f.AddDataValidation(f.GetSheetName(0), dv)
	f.SaveAs("TestDropList.xlsx")
}

The snippet would print ...

 566 bytes in utf-8
 256 units in utf-16
 156 runes in utf-32

... and produce a corrupt file (or panic if the original length validation was not removed from excelize codebase).

image

Change the constant 55 to 54, and then the snippet would print ...

 563 bytes in utf-8
 255 units in utf-16
 155 runes in utf-32

... and produce a valid Excel workbook:

image

How Has This Been Tested

The patches was tested against:

  • Microsoft Office 365 for Mac, 16.49 (21050901)
  • Microsoft Office 365 for Windows, 2102 (Build 13801.20808)
  • Microsoft Office 2019 Professional Plus for Windows, 2105 (Build 14026.20302)
  • Microsoft Office 2013 Professional Plus for Windows, 15.0.4919.1002

Types of changes

  • Docs change / refactoring / dependency upgrade
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Arnie97 added 2 commits July 27, 2021 23:17
This patch included a XML entity mapping table instead of
xml.EscapeText() to be fully compatible with Microsoft Excel.
@Arnie97
Copy link
Contributor Author

Arnie97 commented Jul 28, 2021

Other thoughts about these issues:

  • I've noticed existing usages of xml.EscapeText() in stream.go, but I did not further investigate into the code to check whether their use case were compatible with Microsoft Excel, so they were left untouched.

  • I've noticed and removed some hard code integer literals 21 from the length validation code, but haven't tested this change yet. I guess it stands for:

len("<formula1></formula1>")
  • AFAIK there's no way to escape the separator commas , in the validation list. However I did not add an assertion for them either, in order to avoid breaking any existing projects, which might invoke the excelize library in the following way:
const choices = "Alpha,Beta,Gamma"
...
dv.SetDropList([]string{choices})
...

Arnie97 added 3 commits July 28, 2021 17:00
This patch changed the string length calculation unit of data
validation formulas from UTF-8 bytes to UTF-16 code units.
17 decimal significant digits should be more than enough to represent
every IEEE-754 double-precision float number without losing precision,
and numbers in this form will never reach the Excel limitation of 255
UTF-16 code units.
@Arnie97 Arnie97 marked this pull request as ready for review July 28, 2021 09:59
@xuri xuri added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jul 28, 2021
@xuri
Copy link
Member

xuri commented Jul 28, 2021

Thanks for your PR. I have fixed it by commit 7dbf88f
. Make formula1 and formula2 as string data type instead of the formulaEscaper, we can let the XML encoder escape characters.

@xuri
Copy link
Member

xuri commented Jul 28, 2021

Note that the value of formula in the drop list should have a quote, overwise the rule will be invalid. I closed the PR.

@xuri xuri closed this Jul 28, 2021
@Arnie97
Copy link
Contributor Author

Arnie97 commented Jul 29, 2021

@xuri While your patch looks much cleaner, the XML encoder in the standard library is NOT compatible with Microsoft Excel.
May you please read the "Motivation and Context" section above? With the current master branch 7dbf88f, the result generated for my example test case A<,B>,C",D ,E',F is still reported as a corrupt file.

@xuri xuri reopened this Jul 30, 2021
@codecov-commenter
Copy link

codecov-commenter commented Jul 30, 2021

Codecov Report

Merging #975 (c285f4e) into master (7dbf88f) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #975      +/-   ##
==========================================
- Coverage   97.44%   97.44%   -0.01%     
==========================================
  Files          31       31              
  Lines       11023    11021       -2     
==========================================
- Hits        10741    10739       -2     
  Misses        157      157              
  Partials      125      125              
Flag Coverage Δ
unittests 97.44% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
errors.go 100.00% <ø> (ø)
xmlWorksheet.go 100.00% <ø> (ø)
datavalidation.go 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7dbf88f...c285f4e. Read the comment docs.

@xuri xuri merged commit 7ac37ed into qax-os:master Jul 30, 2021
@xuri
Copy link
Member

xuri commented Jul 30, 2021

@xuri While your patch looks much cleaner, the XML encoder in the standard library is NOT compatible with Microsoft Excel.
May you please read the "Motivation and Context" section above? With the current master branch 7dbf88f, the result generated for my example test case A<,B>,C",D ,E',F is still reported as a corrupt file.

Yeah, thanks for your PR, I have added error messages and minimum range checking base on this branch.

jenbonzhang pushed a commit to jenbonzhang/excelize that referenced this pull request Oct 22, 2023
* Fix `SetDropList` to allow XML special characters

* This closes qax-os#971, allow quotation marks in SetDropList()

This patch included a XML entity mapping table instead of
xml.EscapeText() to be fully compatible with Microsoft Excel.

* This closes qax-os#972, allow more than 255 bytes of validation formulas

This patch changed the string length calculation unit of data
validation formulas from UTF-8 bytes to UTF-16 code units.

* Add unit tests for SetDropList()

* Fix: allow MaxFloat64 to be used in validation range

17 decimal significant digits should be more than enough to represent
every IEEE-754 double-precision float number without losing precision,
and numbers in this form will never reach the Excel limitation of 255
UTF-16 code units.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Drop-down list items are not XML escaped
3 participants