Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using String(data:encoding:) to read UTF-8 that begins with a BOM includes it in the characters. #1164

Closed
logancollins opened this issue Feb 11, 2025 · 1 comment · Fixed by #1165
Assignees

Comments

@logancollins
Copy link

When creating a string using UTF-8 data which begins with the UTF-8 byte order mark ([0xEF, 0xBB, 0xBF]), the resulting string contains a leading ZWNBSP character. While it could be debated what exactly String should do here, it should be noted that its current behavior differs from NSString.

NSString in Objc Foundation has always silently stripped / ignored a leading BOM when creating a string from data, and otherwise treated the sequence as its other purpose (the ZWNBSP character) when it appears elsewhere. Stripping it loses exact roundtrip-ness, but it also means that any tools not expecting this character to be at the start of the string might break (as was the case for us haha).

This also presents potential problems for code that is moving from NSString to String over time, especially in parts, as it might present an unexpected change in behavior.

For reference / posterity (apologies in advance to those reading this who already know): AFAIK, the UTF-8 BOM is pretty much unnecessary in practice, since UTF-8 doesn't have byte order in the way UTF-16 and UTF-32 do, but it is still present in files from a lot of cross-platform tools, if anything to mark that the data stream is explicitly UTF-8. Most everything I can remember encountering will silently exclude it from the actual characters if displayed in, say, a text editor.

Minimal code required for reproduction:

let s1 = String(data: Data([0xEF, 0xBB, 0xBF, 0x20]), encoding: .utf8)!
let s2 = NSString(data: Data([0xEF, 0xBB, 0xBF, 0x20]), encoding: 4) as! String

s1.count // => "2"
s2.count // => "1"

Also apologies in advance if this has been brought up before but I failed to find the discussion!

@jmschonfeld jmschonfeld self-assigned this Feb 11, 2025
@jmschonfeld
Copy link
Contributor

Yeah it looks like the new Swift implementation doesn't skip the UTF-8 BOM but NSString historically has. It seems reasonable that in a higher level API like Foundation here (as opposed to the stdlib-provided String(decoding:as:) which also doesn't handle UTF-16 BOMs) we can skip a UTF-8 BOM if present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants