Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normative: Permit use of implementation-defined tailoring for case mapping #341

Merged
merged 2 commits into from
May 2, 2019

Conversation

spectranaut
Copy link

This is a new PR after feedback from @littledan's original fix #291

The feedback in #299 and discussed in the 2018-11-26 meeting was to:

  1. Allow any custom tailoring, not only the Unicode Database.
  2. Mandate the implementations consistently use the same case mapping for the same locale.

@spectranaut
Copy link
Author

@anba @srl295 @littledan -- can you all take a look?

Copy link
Member

@littledan littledan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for picking up this change! The PR looks good, just a few nits. Note that another issue of permitting tailorings is #134 -- in my opinion, we should permit tailorings for currency as well.

</p>

<emu-note>
As of Unicode 10.0, the _availableLocales_ list contains the elements `"az"`, `"lt"`, and `"tr"`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be valid to leave in this line, modifying it to say that availableLocales must include those three (but may include others). What would you think of that change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are more locales in unicode that have case mapping than those three -- it seems misleading to only mention them? Do you think I should include a list of all of them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know. I am fine with removing this list.

@@ -65,27 +65,23 @@ <h1>String.prototype.toLocaleLowerCase ( [ _locales_ ] )</h1>
1. Else,
1. Let _requestedLocale_ be DefaultLocale().
1. Let _noExtensionsLocale_ be the String value that is _requestedLocale_ with all Unicode locale extension sequences (<emu-xref href="#sec-unicode-locale-extension-sequences"></emu-xref>) removed.
1. Let _availableLocales_ be a List with the language tags of the languages for which the Unicode character database contains language sensitive case mappings.
1. Let _availableLocales_ be a List with language tags that includes the languages for which the Unicode Character Database contains language sensitive case mappings. Implementations may add additional language tags if they support custom tailoring.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd delete " if they support custom tailoring". The phrasing I'd suggest, and which much of the spec tends towards, implementations maintain their own locale database, rather than talking about "tailoring".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leobalter also thought I should use language more like what you just suggested, @littledan -- but I only see the word "database" used in the spec when referring to "Unicode Character Database" or "IANA Time Zone Database". I feel like I would be confused if I saw a reference suddenly to a "implementation maintained locale database". But then again, I am new to reading this spec :)

What about not using the phrase "custom tailoring":
Implementations may add additional language tags if they support case mapping for additional locales.

1. Let _locale_ be BestAvailableLocale(_availableLocales_, _noExtensionsLocale_).
1. If _locale_ is *undefined*, let _locale_ be `"und"`.
1. Let _cpList_ be a List containing in order the code points of _S_ as defined in ES2020, <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>, starting at the first element of _S_.
1. For each code point _c_ in _cpList_, if the Unicode Character Database provides a lower case equivalent of _c_ that is either language insensitive or for the language _locale_, replace _c_ in _cpList_ with that/those equivalent code point(s).
1. For each code point _c_ in _cpList_, replace _c_ with a lower case equivalent of _c_ that is either language insensitive or defined according to the _locale_ specific code point mapping. Implementations must always be consistent in the case mapping they use for a specified locale.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "consistent" wording is interesting, but it could be a little confusing. Could you use phrasing that indicates, the output of this method is a function of the locale data and its inputs? E.g., "In a particular implementation, the transformation of cpList always yields the same result, for any fixed cpList and locale." (Maybe you have ideas for making that clearer.)

Note, the "for each code point" wording here is misleading; I fixed it in ECMA-262 in https://github.com/tc39/ecma262/pull/932/files , and we should probably apply the same fix in ECMA-402. (Could be a separate PR, or the same one, as you're already touching this line.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your replacement for "consistent" - I'll add it.

Do you think code points is used clearly in many places in this spec? If so can you file and issue? :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not aware of other instances of this error outside of case mapping.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I didn't understand what was misleading about the "for each code point" until now, but I see -- the mapping isn't always directly from on code point to another.

1. Let _cuList_ be a new empty List.
1. For each code point _c_ in _cpList_, in order, append to _cuList_ the elements of the UTF-16 Encoding (defined in ES2020, <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>) of _c_.
1. Let _L_ be a String whose elements are, in order, the elements of _cuList_.
1. Return _L_.
</emu-alg>

<p>
The result must be derived according to the case mappings in the Unicode character database (this explicitly includes not only the UnicodeData.txt file, but also the SpecialCasings.txt file that accompanies it).
Lower case code point mappings may be derived according to a tailored version of the Default Case Conversion Algorithms of the Unicode Standard. Implementaitons may use locale specific tailoring defined in SpecialCasings.txt and/or CLDR and/or any other custom tailoring.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Implementations

(I feel like it's OK to use the word "tailoring" in this section, but I'm not sure if SpecialCasings.txt is considered a tailoring or not.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Default Case Conversion Algorithm (unicode 3.13) calls it a "tailoring" so I think we are ok there.

@leobalter leobalter added the has consensus Has consensus from TC39-TG2 label Apr 19, 2019
@spectranaut spectranaut requested a review from littledan April 25, 2019 16:23
@spectranaut
Copy link
Author

@leobalter can you also give one more look?

Copy link
Member

@littledan littledan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I really like the new wording.

@@ -102,7 +96,7 @@ <h1>String.prototype.toLocaleUpperCase ( [ _locales_ ] )</h1>
</p>

<p>
This function interprets a string value as a sequence of code points, as described in ES2020, <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>. This function behaves in exactly the same way as `String.prototype.toLocaleLowerCase`, except that characters are mapped to their _uppercase_ equivalents as specified in the Unicode character database.
This function interprets a string value as a sequence of code points, as described in ES2020, <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>. This function behaves in exactly the same way as `String.prototype.toLocaleLowerCase`, except that characters are mapped to their _uppercase_ equivalents. Implementations must always be consistent in the case mapping they use for a specified locale.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the consistency wording you adopted above. Would it make sense to use something similar here as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@leobalter
Copy link
Member

I'm not sure we can author tests for this so I'm happy to merge it after a rebase to organize the commits.

littledan and others added 2 commits May 2, 2019 11:56
Apparently there was never a deliberate decision to omit these
tailorings, but the specification was written back when all tailorings
were in SpecialCasings.txt. The new spec text permits looking at CLDR
as well, matching the Unicode Standard.

Closes tc39#287
Ecma402 will allow any case-mapping, as long as implementations
are consistent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
has consensus Has consensus from TC39-TG2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants