Using `String(data:encoding:)` to read UTF-8 that begins with a BOM includes it in the characters. #1164

logancollins · 2025-02-11T01:11:05Z

When creating a string using UTF-8 data which begins with the UTF-8 byte order mark ([0xEF, 0xBB, 0xBF]), the resulting string contains a leading ZWNBSP character. While it could be debated what exactly String should do here, it should be noted that its current behavior differs from NSString.

NSString in Objc Foundation has always silently stripped / ignored a leading BOM when creating a string from data, and otherwise treated the sequence as its other purpose (the ZWNBSP character) when it appears elsewhere. Stripping it loses exact roundtrip-ness, but it also means that any tools not expecting this character to be at the start of the string might break (as was the case for us haha).

This also presents potential problems for code that is moving from NSString to String over time, especially in parts, as it might present an unexpected change in behavior.

For reference / posterity (apologies in advance to those reading this who already know): AFAIK, the UTF-8 BOM is pretty much unnecessary in practice, since UTF-8 doesn't have byte order in the way UTF-16 and UTF-32 do, but it is still present in files from a lot of cross-platform tools, if anything to mark that the data stream is explicitly UTF-8. Most everything I can remember encountering will silently exclude it from the actual characters if displayed in, say, a text editor.

Minimal code required for reproduction:

let s1 = String(data: Data([0xEF, 0xBB, 0xBF, 0x20]), encoding: .utf8)!
let s2 = NSString(data: Data([0xEF, 0xBB, 0xBF, 0x20]), encoding: 4) as! String

s1.count // => "2"
s2.count // => "1"

Also apologies in advance if this has been brought up before but I failed to find the discussion!

The text was updated successfully, but these errors were encountered:

jmschonfeld · 2025-02-11T18:33:46Z

Yeah it looks like the new Swift implementation doesn't skip the UTF-8 BOM but NSString historically has. It seems reasonable that in a higher level API like Foundation here (as opposed to the stdlib-provided String(decoding:as:) which also doesn't handle UTF-16 BOMs) we can skip a UTF-8 BOM if present.

jmschonfeld self-assigned this Feb 11, 2025

jmschonfeld mentioned this issue Feb 11, 2025

Drop UTF-8 BOM when present while decoding UTF-8 bytes into String #1165

Merged

jmschonfeld closed this as completed in #1165 Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using `String(data:encoding:)` to read UTF-8 that begins with a BOM includes it in the characters. #1164

Using `String(data:encoding:)` to read UTF-8 that begins with a BOM includes it in the characters. #1164

logancollins commented Feb 11, 2025

jmschonfeld commented Feb 11, 2025

Using String(data:encoding:) to read UTF-8 that begins with a BOM includes it in the characters. #1164

Using String(data:encoding:) to read UTF-8 that begins with a BOM includes it in the characters. #1164

Comments

logancollins commented Feb 11, 2025

jmschonfeld commented Feb 11, 2025

Using `String(data:encoding:)` to read UTF-8 that begins with a BOM includes it in the characters. #1164

Using `String(data:encoding:)` to read UTF-8 that begins with a BOM includes it in the characters. #1164