-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rethink strings #1
Comments
I need some advice here on how we wish to proceed. HTML says "code unit" is defined by IDL as restricted to being a 16-bit integer. IDL uses "code unit" for DOMString (16-bit integer), USVString (21-bit integer), and ByteString (8-bit integer). We tend to use "code point", "code unit", and "character" in roughly the same way, even though that is not correct per Unicode. We have a special kind of string where we take a JavaScript string and combine surrogate pairs, but leave lone surrogates alone. This is what the platform displays on screen and various JavaScript operations use. My thinking is that we should have "JavaScript string" (DOMString) and indexing upon that goes through "code unit". A "JavaScript string" can be addressed as "string" as well (some magic casting underneath) at which point you address code points and (valid) surrogate pairs represent a single code point. We also have a "scalar value string" and indexing upon that can go through "code unit", but that ends up meaning the same thing as "code point" (though lone surrogates cannot be found or added). We leave enforcing validity of a "scalar value string" to the users. It's mostly for implementers and clarity. A "scalar value string" can also be addressed as "string" since it's compatible. We keep the designation "ASCII string" which specification can use as an optimization hint. (That's why URL uses it for instance.) Again, addressable as "string". We don't need "byte string" I think. The idea with ByteString is that it's an input and return value. The IDL algorithm actually operates a byte sequence. |
I don't feel terribly qualfied in this area. I think it would be good if we matched Unicode as much as possible. Maybe we can continue abusing a generic term like "character", but we should use "code point" and "code unit" correctly. Your plan sounds pretty good for the different types of strings. I think we'll want to carefully spell things out in such a way that most of the time people can avoid knowing or talking about the difference. I guess I haven't seen people index into/iterate over units of a string in specs, most of the time, but maybe some of the parsing algorithms that operate on post-decoding strings do. |
I agree. The current definition of string is "a sequence of code points", but a code point is a non-negative integer. It can represent a character, but it is not a character. Quoting from the Character Model spec:
However, the definition of string in TUS 9.0 Section 3.9 is:
That is, strings are defined as code unit strings, instead of character strings. Personally, I like the 'character string' definition, that is, a string is viewed as a sequence of characters, each represented by a code point. Although matching Unicode as much as possible SHOULD be a goal, in this case, I think the 'character string' definition is more suitable than the Unicode definition, since it has the highest layer of abstraction (which ensures interoperability). Of course, we can define both 'code unit string' and 'character string', plus other kinds of strings @annevk mentioned.
Agreed. The Character Model spec says "Specifications SHOULD NOT define a string as a 'byte string'", and their rationale seems fair enough. |
@xfq That's not a correct reading of TUS or Charmod. The TUS definition you cite includes the phrase "... of a particular Unicode encoding form", which means that an encoding (UTF-8, UTF-16, UTF-32) must be defined. However... Charmod actually defines the term 'character string' itself and that's the definition that should be inferred from the quoted requirement:
What's missing from the current definition is that 'code point' means 'Unicode Scalar Value'. @annevk The terms 'character', 'code point', and 'code unit', in my opinion, should follow Unicode and/or Charmod. In your original statement at the top of this issue, for 'JavaScript', 'byte', and 'ASCII' strings, I would use the term 'code unit' where you said 'code point', since 'code point' is essentially a synonym for 'Unicode Scalar Value'. In the case of JS, the encoding is UTF-16. In the case of 'byte string', it's usually UTF-8 (pace Encoding). ASCII string's encoding is pretty clear :-). Generally speaking, it's usually best to refer to strings as 'character strings', as Charmod recommends, although a lot of the Web platform relies on DOM and that necessarily involves UTF-16. With the rise of emoji, there are lots of supplementary (surrogate pair in UTF-16) characters in the world, so a lot of care needs to be used in ensuring that |
Thanks so much for weighin in @aphillips. I hope we can end up with something you're happy with and spread it across as many web specs as possible :). One thing to note is that in the web specs space I've always seen "code unit", unqualified, as meaning UTF-16 code unit. I.e., the thing that JavaScript deals with. It sure is nice to be able to say that for brevity, and I'd kind of like to be able to keep that, but maybe it is too confusing and we should say "UTF-16 code unit" everywhere? |
Happy to help. I think this thread (and others like it) help illustrate the need for some precision. Like you, I generally read 'code unit' in a Web spec to mean "UTF-16 code unit" and the problem is deciding whether importing UTF-16 was intentional vs. using 'code point' (USV). As such, 'code unit' has to remain distinct from 'character'. I don't have a problem making a definition that allows us all to infer the UTF-16 part. Just need to make it a referenced definition. Otherwise folks have a way of getting sloppy and treating 'character', 'code point', and 'code unit' as the same---and getting into trouble when there's an emoji (or such) in the data. |
I think using UTF-16 is a bit of a distraction since the sequences we are dealing with do not have to match UTF-16. They are simply sequences of 16-bit integers. That is also why I would not want to interchange code point and scalar value, since we can and do have code points that are not scalar values (yay lone surrogates) within the web platform. I still think we want JavaScript string and scalar value string. A JavaScript string is as defined in ECMAScript, including how ECMAScript defines to extract code points from it. You cannot use the term scalar value to identify items in JavaScript strings. A scalar value string is IDL's USVString or @aphillips's "character string". It can be backed by any Unicode encoding, cannot be indexed by code unit (since we'll use that exclusively for JavaScript strings as we have already been doing, rather than what I suggested earlier), and here code point and scalar value can be used interchangeably. Casting a JavaScript string into a scalar value string should be easy and will cause lone surrogates to turn into U+FFFD. Basically what IDL already defines for USVString (but we'd define that in terms of this instead going forward). The term string can be used to reference to either when it's already established what the type is. |
[...]
@aphillips Would you please elaborate why? What about high- and low- surrogate code points? |
@xfq You're correct: I glossed over the difference and should not have. @annevk I almost but don't quite agree. While in some ways "UTF-16" doesn't matter, the problem I have is that the term "code points" is confusing compared to the term "code units" when talking about JavaScript strings. Consider two strings:
String X encodes 4 code points using 4 code units Technically, String X encodes no (Unicode) characters, while String Y encodes 2. In certain contexts, one can say that String X encodes 4 isolated surrogates or 4 of U+FFFD. I just think that using the term code point here produces surprise where using the term code unit does not. I'm good with the idea of JavaScript vs. scalar string. JavaScript strings (and their friends in other programming languages, such as Java) are just 16-bit integer arrays. Invalid values such as |
@aphillips the term code points for JavaScript is relevant when discussing a string like DC00 D800 DC00, which would have two code points and three code units. And as I said, the JavaScript standard already makes that distinction. |
And also surrogate code point, code unit, and cast (for strings). Fixes #1.
And also surrogate code point, code unit, and cast (for strings). Fixes #1.
I need to study the various dependencies of strings and figure out what we want to do. It seems there's a couple kind of strings that probably need to be distinguished and named somehow:
The text was updated successfully, but these errors were encountered: