Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

localeCompare regression #25762

Closed
yaronn opened this issue Jul 23, 2015 · 10 comments
Closed

localeCompare regression #25762

yaronn opened this issue Jul 23, 2015 · 10 comments
Assignees
Labels

Comments

@yaronn
Copy link

yaronn commented Jul 23, 2015

Hi

It seems node 0.12.x has a a different behavior for localeCompare than 0.10.40.

"A".localeCompare("a")

0.12.x: 1
0.10.40: -32

the above can be fixed by using:

"A".localeCompare("a", 'en', {caseFirst: 'upper'})

but this one yields a different result and no configuration can fix it:

"a".localeCompare("Ab")

0.10.40: 32
0.12.x: -1

the different numbers are less significant - the real issue is the different sign.

@srl295 srl295 added the i18n label Jul 23, 2015
@srl295
Copy link
Member

srl295 commented Jul 23, 2015

You can use the ICU collation explorer to view what's happening with collation.

  • if you click here you can see a few characters collated using the English rules. Note the order: a, A, … Ab…
  • if you click here we are setting the equivalent of {caseFirst: 'upper'}. Note A before a - uppercase first - but Ab is still AFTER both A and B.

Actually, 0.10 seems to be using code point sort order. So the letters are basically in ASCII / iso-8859-1 order. Consider this on .10 and .12:

n = [];
for(var i=1;i<255;i++) {
    n.push(String.fromCharCode(i));
}
n.sort(function(a,b){return a.localeCompare(b);});
console.dir(n);

So, don't expect localeCompare to really be locale sensitive in .10. .12 has the correct behavior.

You should be able to get the ASCII behavior with:

 "a".localeCompare("Ab", 'en-US-u-va-posix');

… which doesn't work for me. There may be a bug there. But the original report is definitely working as designed.

@srl295 srl295 self-assigned this Jul 23, 2015
@srl295
Copy link
Member

srl295 commented Jul 23, 2015

https://github.com/joyent/node/blob/v0.10.40-release/deps/v8/src/string.js#L166 says this function is "implementation specific".

@yaronn
Copy link
Author

yaronn commented Jul 23, 2015

thanks @srl295 this is eye opening information!

I came to realize that I only expect ASCII characters so ><= seem the most adequate (and version consistent) option.

@yaronn yaronn closed this as completed Jul 23, 2015
@srl295
Copy link
Member

srl295 commented Jul 23, 2015

@yaronn welcome- Who needs anything but ASCII? But can you explain the use case a bit more? Perhaps there's another way to do it.

@yaronn
Copy link
Author

yaronn commented Jul 24, 2015

I am writing xml-crypto (https://github.com/yaronn/xml-crypto). As part of the exclusive canonicalization algorithm I need to sort all attributes of each xml element by alphabetical order where caps are first.

@srl295
Copy link
Member

srl295 commented Jul 24, 2015

@yaronn 'Canonical' and 'Locale' aren't compatible concepts… locales change, by user and over time. Consider the following example, which is working 100% correctly: (same JS, different user)

$ env LC_ALL=en node -e 'console.log("cs".localeCompare("ct"))'
-1
$ env LC_ALL=hu node -e 'console.log("cs".localeCompare("ct"))'
1

where caps are first

Specifically, "ZZZ" sorts before "aaa" , with [ \ ] ^ in between Z and a

If you mean "a"<"Ab" yes that could be a good idea- will be consistent no matter what.

@srl295
Copy link
Member

srl295 commented Jul 24, 2015

@yaronn I followed your document (thanks for the link) and found section 2.2 of REC-xml-c14n-20010315. (emphasis mine)

Lexicographic comparison, which orders strings from least to greatest alphabetically, is based on the UCS code point values, which is equivalent to lexicographic ordering based on UTF-8.

So, using "a"<"Ab" — or even plain .sort() — should order based on UCS code point values.
Other notes:

  • Normalization to NFC is discussed in the preceding paragraph, not sure if it applies to your case or not.
  • UTF-8 is not in binary code point order. So saying "based on UTF-8" may not be helpful to implementations. EDIT: Wrong! Thanks @duerst , below
  • To me "lexicographic" implies ordering based on an alphabet, such as the DUCET (the root Unicode order) or a specific language's tailoring.

In all, I would not use the word lexicographic here, but simply Unicode (or UCS) code point order. Which is what "a"<"Ab" gives you.

@yaronn
Copy link
Author

yaronn commented Jul 24, 2015

Thanks @srl295 for some eye opening information :) a lot of these specs are ambiguous which is why various implementations do not always interop. Thanks!

@duerst
Copy link

duerst commented Jul 25, 2015

@srl295: Re the use of "lexicographic", it's used in the general mathematical sense; see https://en.wikipedia.org/wiki/Lexicographical_order.
Also, I don't understand what you mean by "UTF-8 is not in binary code point order.". For any string of Unicode characters, encoding them as UTF-8 and using e.g. the C strcmp function will give exactly the same result as comparing UCS code point values. It's only in UTF-16 where the surrogates make things more complicated.
[Even for UTF-16, in practice, the chance that an attribute name contains a non-BMP character is pretty slim, and at the point the spec was created, it was actually 0, because up to and including XML forth edition (http://www.w3.org/TR/2006/REC-xml-20060816/#NT-Letter), non-BMP characters strictly were not allowed. That only changed in the Fifth Edition (http://www.w3.org/TR/2008/REC-xml-20081126/#NT-NameStartChar).]

@srl295
Copy link
Member

srl295 commented Aug 4, 2015

@duerst Thank you for the response and clarification. You are correct of course, UTF-8 is in binary order, it is UTF-16 which is not.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants