-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normative: Permit use of implementation-defined tailoring for case mapping #341
Conversation
@anba @srl295 @littledan -- can you all take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for picking up this change! The PR looks good, just a few nits. Note that another issue of permitting tailorings is #134 -- in my opinion, we should permit tailorings for currency as well.
</p> | ||
|
||
<emu-note> | ||
As of Unicode 10.0, the _availableLocales_ list contains the elements `"az"`, `"lt"`, and `"tr"`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be valid to leave in this line, modifying it to say that availableLocales must include those three (but may include others). What would you think of that change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are more locales in unicode that have case mapping than those three -- it seems misleading to only mention them? Do you think I should include a list of all of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know. I am fine with removing this list.
spec/locale-sensitive-functions.html
Outdated
@@ -65,27 +65,23 @@ <h1>String.prototype.toLocaleLowerCase ( [ _locales_ ] )</h1> | |||
1. Else, | |||
1. Let _requestedLocale_ be DefaultLocale(). | |||
1. Let _noExtensionsLocale_ be the String value that is _requestedLocale_ with all Unicode locale extension sequences (<emu-xref href="#sec-unicode-locale-extension-sequences"></emu-xref>) removed. | |||
1. Let _availableLocales_ be a List with the language tags of the languages for which the Unicode character database contains language sensitive case mappings. | |||
1. Let _availableLocales_ be a List with language tags that includes the languages for which the Unicode Character Database contains language sensitive case mappings. Implementations may add additional language tags if they support custom tailoring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd delete " if they support custom tailoring". The phrasing I'd suggest, and which much of the spec tends towards, implementations maintain their own locale database, rather than talking about "tailoring".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leobalter also thought I should use language more like what you just suggested, @littledan -- but I only see the word "database" used in the spec when referring to "Unicode Character Database" or "IANA Time Zone Database". I feel like I would be confused if I saw a reference suddenly to a "implementation maintained locale database". But then again, I am new to reading this spec :)
What about not using the phrase "custom tailoring":
Implementations may add additional language tags if they support case mapping for additional locales.
spec/locale-sensitive-functions.html
Outdated
1. Let _locale_ be BestAvailableLocale(_availableLocales_, _noExtensionsLocale_). | ||
1. If _locale_ is *undefined*, let _locale_ be `"und"`. | ||
1. Let _cpList_ be a List containing in order the code points of _S_ as defined in ES2020, <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>, starting at the first element of _S_. | ||
1. For each code point _c_ in _cpList_, if the Unicode Character Database provides a lower case equivalent of _c_ that is either language insensitive or for the language _locale_, replace _c_ in _cpList_ with that/those equivalent code point(s). | ||
1. For each code point _c_ in _cpList_, replace _c_ with a lower case equivalent of _c_ that is either language insensitive or defined according to the _locale_ specific code point mapping. Implementations must always be consistent in the case mapping they use for a specified locale. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This "consistent" wording is interesting, but it could be a little confusing. Could you use phrasing that indicates, the output of this method is a function of the locale data and its inputs? E.g., "In a particular implementation, the transformation of cpList always yields the same result, for any fixed cpList and locale." (Maybe you have ideas for making that clearer.)
Note, the "for each code point" wording here is misleading; I fixed it in ECMA-262 in https://github.com/tc39/ecma262/pull/932/files , and we should probably apply the same fix in ECMA-402. (Could be a separate PR, or the same one, as you're already touching this line.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your replacement for "consistent" - I'll add it.
Do you think code points is used clearly in many places in this spec? If so can you file and issue? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not aware of other instances of this error outside of case mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I didn't understand what was misleading about the "for each code point" until now, but I see -- the mapping isn't always directly from on code point to another.
spec/locale-sensitive-functions.html
Outdated
1. Let _cuList_ be a new empty List. | ||
1. For each code point _c_ in _cpList_, in order, append to _cuList_ the elements of the UTF-16 Encoding (defined in ES2020, <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>) of _c_. | ||
1. Let _L_ be a String whose elements are, in order, the elements of _cuList_. | ||
1. Return _L_. | ||
</emu-alg> | ||
|
||
<p> | ||
The result must be derived according to the case mappings in the Unicode character database (this explicitly includes not only the UnicodeData.txt file, but also the SpecialCasings.txt file that accompanies it). | ||
Lower case code point mappings may be derived according to a tailored version of the Default Case Conversion Algorithms of the Unicode Standard. Implementaitons may use locale specific tailoring defined in SpecialCasings.txt and/or CLDR and/or any other custom tailoring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Implementations
(I feel like it's OK to use the word "tailoring" in this section, but I'm not sure if SpecialCasings.txt is considered a tailoring or not.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Default Case Conversion Algorithm (unicode 3.13) calls it a "tailoring" so I think we are ok there.
@leobalter can you also give one more look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I really like the new wording.
spec/locale-sensitive-functions.html
Outdated
@@ -102,7 +96,7 @@ <h1>String.prototype.toLocaleUpperCase ( [ _locales_ ] )</h1> | |||
</p> | |||
|
|||
<p> | |||
This function interprets a string value as a sequence of code points, as described in ES2020, <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>. This function behaves in exactly the same way as `String.prototype.toLocaleLowerCase`, except that characters are mapped to their _uppercase_ equivalents as specified in the Unicode character database. | |||
This function interprets a string value as a sequence of code points, as described in ES2020, <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>. This function behaves in exactly the same way as `String.prototype.toLocaleLowerCase`, except that characters are mapped to their _uppercase_ equivalents. Implementations must always be consistent in the case mapping they use for a specified locale. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the consistency wording you adopted above. Would it make sense to use something similar here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
I'm not sure we can author tests for this so I'm happy to merge it after a rebase to organize the commits. |
Apparently there was never a deliberate decision to omit these tailorings, but the specification was written back when all tailorings were in SpecialCasings.txt. The new spec text permits looking at CLDR as well, matching the Unicode Standard. Closes tc39#287
Ecma402 will allow any case-mapping, as long as implementations are consistent.
This is a new PR after feedback from @littledan's original fix #291
The feedback in #299 and discussed in the 2018-11-26 meeting was to: