You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When creating a string using UTF-8 data which begins with the UTF-8 byte order mark ([0xEF, 0xBB, 0xBF]), the resulting string contains a leading ZWNBSP character. While it could be debated what exactly Stringshould do here, it should be noted that its current behavior differs from NSString.
NSString in Objc Foundation has always silently stripped / ignored a leading BOM when creating a string from data, and otherwise treated the sequence as its other purpose (the ZWNBSP character) when it appears elsewhere. Stripping it loses exact roundtrip-ness, but it also means that any tools not expecting this character to be at the start of the string might break (as was the case for us haha).
This also presents potential problems for code that is moving from NSString to String over time, especially in parts, as it might present an unexpected change in behavior.
For reference / posterity (apologies in advance to those reading this who already know): AFAIK, the UTF-8 BOM is pretty much unnecessary in practice, since UTF-8 doesn't have byte order in the way UTF-16 and UTF-32 do, but it is still present in files from a lot of cross-platform tools, if anything to mark that the data stream is explicitly UTF-8. Most everything I can remember encountering will silently exclude it from the actual characters if displayed in, say, a text editor.
Yeah it looks like the new Swift implementation doesn't skip the UTF-8 BOM but NSString historically has. It seems reasonable that in a higher level API like Foundation here (as opposed to the stdlib-provided String(decoding:as:) which also doesn't handle UTF-16 BOMs) we can skip a UTF-8 BOM if present.
When creating a string using UTF-8 data which begins with the UTF-8 byte order mark (
[0xEF, 0xBB, 0xBF]
), the resulting string contains a leading ZWNBSP character. While it could be debated what exactlyString
should do here, it should be noted that its current behavior differs fromNSString
.NSString
in Objc Foundation has always silently stripped / ignored a leading BOM when creating a string from data, and otherwise treated the sequence as its other purpose (the ZWNBSP character) when it appears elsewhere. Stripping it loses exact roundtrip-ness, but it also means that any tools not expecting this character to be at the start of the string might break (as was the case for us haha).This also presents potential problems for code that is moving from
NSString
toString
over time, especially in parts, as it might present an unexpected change in behavior.For reference / posterity (apologies in advance to those reading this who already know): AFAIK, the UTF-8 BOM is pretty much unnecessary in practice, since UTF-8 doesn't have byte order in the way UTF-16 and UTF-32 do, but it is still present in files from a lot of cross-platform tools, if anything to mark that the data stream is explicitly UTF-8. Most everything I can remember encountering will silently exclude it from the actual characters if displayed in, say, a text editor.
Minimal code required for reproduction:
Also apologies in advance if this has been brought up before but I failed to find the discussion!
The text was updated successfully, but these errors were encountered: