localeCompare regression #25762

yaronn · 2015-07-23T21:53:58Z

Hi

It seems node 0.12.x has a a different behavior for localeCompare than 0.10.40.

"A".localeCompare("a")

0.12.x: 1
0.10.40: -32

the above can be fixed by using:

"A".localeCompare("a", 'en', {caseFirst: 'upper'})

but this one yields a different result and no configuration can fix it:

"a".localeCompare("Ab")

0.10.40: 32
0.12.x: -1

the different numbers are less significant - the real issue is the different sign.

The text was updated successfully, but these errors were encountered:

srl295 · 2015-07-23T22:32:30Z

You can use the ICU collation explorer to view what's happening with collation.

if you click here you can see a few characters collated using the English rules. Note the order: a, A, … Ab…
if you click here we are setting the equivalent of {caseFirst: 'upper'}. Note A before a - uppercase first - but Ab is still AFTER both A and B.

Actually, 0.10 seems to be using code point sort order. So the letters are basically in ASCII / iso-8859-1 order. Consider this on .10 and .12:

n = [];
for(var i=1;i<255;i++) {
    n.push(String.fromCharCode(i));
}
n.sort(function(a,b){return a.localeCompare(b);});
console.dir(n);

So, don't expect localeCompare to really be locale sensitive in .10. .12 has the correct behavior.

You should be able to get the ASCII behavior with:

 "a".localeCompare("Ab", 'en-US-u-va-posix');

… which doesn't work for me. There may be a bug there. But the original report is definitely working as designed.

srl295 · 2015-07-23T22:35:34Z

https://github.com/joyent/node/blob/v0.10.40-release/deps/v8/src/string.js#L166 says this function is "implementation specific".

yaronn · 2015-07-23T23:21:21Z

thanks @srl295 this is eye opening information!

I came to realize that I only expect ASCII characters so ><= seem the most adequate (and version consistent) option.

srl295 · 2015-07-23T23:52:26Z

@yaronn welcome- Who needs anything but ASCII? But can you explain the use case a bit more? Perhaps there's another way to do it.

yaronn · 2015-07-24T00:05:53Z

I am writing xml-crypto (https://github.com/yaronn/xml-crypto). As part of the exclusive canonicalization algorithm I need to sort all attributes of each xml element by alphabetical order where caps are first.

srl295 · 2015-07-24T15:23:03Z

@yaronn 'Canonical' and 'Locale' aren't compatible concepts… locales change, by user and over time. Consider the following example, which is working 100% correctly: (same JS, different user)

$ env LC_ALL=en node -e 'console.log("cs".localeCompare("ct"))'
-1
$ env LC_ALL=hu node -e 'console.log("cs".localeCompare("ct"))'
1

where caps are first

Specifically, "ZZZ" sorts before "aaa" , with [ \ ] ^ in between Z and a

If you mean "a"<"Ab" yes that could be a good idea- will be consistent no matter what.

srl295 · 2015-07-24T15:40:20Z

@yaronn I followed your document (thanks for the link) and found section 2.2 of REC-xml-c14n-20010315. (emphasis mine)

Lexicographic comparison, which orders strings from least to greatest alphabetically, is based on the UCS code point values, which is equivalent to lexicographic ordering based on UTF-8.

So, using "a"<"Ab" — or even plain .sort() — should order based on UCS code point values.
Other notes:

Normalization to NFC is discussed in the preceding paragraph, not sure if it applies to your case or not.
~~UTF-8 is not in binary code point order. So saying "based on UTF-8" may not be helpful to implementations.~~ EDIT: Wrong! Thanks @duerst , below
~~To me "lexicographic" implies ordering based on an alphabet, such as the DUCET (the root Unicode order) or a specific language's tailoring.~~

~~In all, I would not use the word lexicographic here, but simply Unicode (or UCS) code point order. Which is what "a"<"Ab" gives you.~~

yaronn · 2015-07-24T19:57:37Z

Thanks @srl295 for some eye opening information :) a lot of these specs are ambiguous which is why various implementations do not always interop. Thanks!

duerst · 2015-07-25T07:06:01Z

@srl295: Re the use of "lexicographic", it's used in the general mathematical sense; see https://en.wikipedia.org/wiki/Lexicographical_order.
Also, I don't understand what you mean by "UTF-8 is not in binary code point order.". For any string of Unicode characters, encoding them as UTF-8 and using e.g. the C strcmp function will give exactly the same result as comparing UCS code point values. It's only in UTF-16 where the surrogates make things more complicated.
[Even for UTF-16, in practice, the chance that an attribute name contains a non-BMP character is pretty slim, and at the point the spec was created, it was actually 0, because up to and including XML forth edition (http://www.w3.org/TR/2006/REC-xml-20060816/#NT-Letter), non-BMP characters strictly were not allowed. That only changed in the Fifth Edition (http://www.w3.org/TR/2008/REC-xml-20081126/#NT-NameStartChar).]

srl295 · 2015-08-04T21:23:57Z

@duerst Thank you for the response and clarification. You are correct of course, UTF-8 is in binary order, it is UTF-16 which is not.

srl295 added the i18n label Jul 23, 2015

srl295 self-assigned this Jul 23, 2015

yaronn closed this as completed Jul 23, 2015

srl295 mentioned this issue Aug 4, 2015

doc: update what’s off without Intl nodejs/node#2300

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

localeCompare regression #25762

localeCompare regression #25762

yaronn commented Jul 23, 2015

srl295 commented Jul 23, 2015

srl295 commented Jul 23, 2015

yaronn commented Jul 23, 2015

srl295 commented Jul 23, 2015

yaronn commented Jul 24, 2015

srl295 commented Jul 24, 2015

srl295 commented Jul 24, 2015

yaronn commented Jul 24, 2015

duerst commented Jul 25, 2015

srl295 commented Aug 4, 2015

localeCompare regression #25762

localeCompare regression #25762

Comments

yaronn commented Jul 23, 2015

srl295 commented Jul 23, 2015

srl295 commented Jul 23, 2015

yaronn commented Jul 23, 2015

srl295 commented Jul 23, 2015

yaronn commented Jul 24, 2015

srl295 commented Jul 24, 2015

srl295 commented Jul 24, 2015

yaronn commented Jul 24, 2015

duerst commented Jul 25, 2015

srl295 commented Aug 4, 2015