Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML syntax error on line 213: illegal character code U+000B #11

Open
SHU-red opened this issue Jun 11, 2024 · 2 comments
Open

XML syntax error on line 213: illegal character code U+000B #11

SHU-red opened this issue Jun 11, 2024 · 2 comments

Comments

@SHU-red
Copy link

SHU-red commented Jun 11, 2024

Hello,
i am using xml2map since a while now and it was very robust and stable!

From the very beginning i tried to prepare the xml containing string with the following function strings.ToValidUTF8

Function ToValidUTF8
// ToValidUTF8 returns a copy of the string s with each run of invalid UTF-8 byte sequences
// replaced by the replacement string, which may be empty.
func ToValidUTF8(s, replacement string) string {
	var b Builder

	for i, c := range s {
		if c != utf8.RuneError {
			continue
		}

		_, wid := utf8.DecodeRuneInString(s[i:])
		if wid == 1 {
			b.Grow(len(s) + len(replacement))
			b.WriteString(s[:i])
			s = s[i:]
			break
		}
	}

	// Fast path for unchanged input
	if b.Cap() == 0 { // didn't call b.Grow above
		return s
	}

	invalid := false // previous byte was from an invalid UTF-8 sequence
	for i := 0; i < len(s); {
		c := s[i]
		if c < utf8.RuneSelf {
			i++
			invalid = false
			b.WriteByte(c)
			continue
		}
		_, wid := utf8.DecodeRuneInString(s[i:])
		if wid == 1 {
			i++
			if !invalid {
				invalid = true
				b.WriteString(replacement)
			}
			continue
		}
		invalid = false
		b.WriteString(s[i : i+wid])
		i += wid
	}

	return b.String()
}

I am executing xml2map like this

// Prepare bytes a string
str := string(*b)

// Strip Bad UTF-8
str = strings.ToValidUTF8(str, "")

decoder := xml2map.NewDecoder(strings.NewReader(str))
	result, err := decoder.Decode()
	if err != nil {

		zap.L().Error("Could not unmarshal XML", zap.Error(err), zap.String("XML", str))
		return err
	}

Since a few days it seems that there is a uncaught case i have a hard time to chase down which seems to be causing problems with illegal character code U+000B

Do you have a robust way to make strip out every character xml2map has problems with?

Thanks a lot in advance

@SHU-red
Copy link
Author

SHU-red commented Jun 11, 2024

I seem to have problems with the following line in my data (some content replaced)

<Value dataType="string">-0200:&#xA;kSi for max. Blabla text obfuscated.&#xBMoreText&#xA;&#xA;-0300:&#xA;kSi &gt; 5/6&#xA;</Value>
So it seems that these ?html-encoded? values cause problems?

@SHU-red
Copy link
Author

SHU-red commented Jun 11, 2024

OK as maybe already guessed, this can be fixed by just replacing the HTML encoded characters before attempting to decode the xml

// Remove HTML encoded characters like &#xA; or &#xB;
// These cause xml2map to fail encoding
re_html := regexp.MustCompile(`&#x[A-Fa-f0-9]{0,2};`)
str = re_html.ReplaceAllString(str, "_")

Not sure if this is a bug or expected, so leaving it open (to you) to close this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant