Ordinal Ignore Case Optimization #40910

tarekgh · 2020-08-16T22:27:18Z

The changes here is to optimize the ordinal ignore case for all scenarios (e.g. String/Span compare, StartsWith, EndsWith, IndexOf, LastIndexOf...etc.) when using ICU. We consider NLS is the baseline comparing with ICU.

I am pasting here some perf numbers before and after the change. The perf numbers collected on Windows machine as this is main regression when switch from using NLS to ICU. The changes also is mainly for ordinal operations so there is no optimization done yet for linguistic operations.
Please note in some ASCII scenarios will find the numbers very close before and after the optimization, that is because we have some code to handle ASCII cases without calling the underlying NLS/ICU. But still you'll notice some minor improvements there too.

Also, this change include the initial refactoring for the ordinal operations. Introduced Ordinal classes that contains the ordinal operations but I didn't do full refactoring to avoid bigger code churn. Notice some ordinal scattered code moved to the new Ordinal classes.

NLS (Baseline) 5.0.100-preview.8.20362.3

Method	Mean	Error	StdDev	Median
IndexOf_OrdinalIgnoreCase_ShortAscii	82.681 ns	1.6594 ns	2.1577 ns	82.887 ns
IndexOf_OrdinalIgnoreCase_LongAscii	833.006 ns	16.1442 ns	15.8557 ns	835.015 ns
IndexOf_OrdinalIgnoreCase_ShortNonAscii	39.629 ns	0.8296 ns	1.9060 ns	39.328 ns
IndexOf_OrdinalIgnoreCase_LongNonAscii	170.314 ns	3.4508 ns	7.2789 ns	168.718 ns
LastIndexOf_OrdinalIgnoreCase_ShortAscii	48.099 ns	1.0119 ns	2.8870 ns	47.298 ns
LastIndexOf_OrdinalIgnoreCase_LongAscii	47.085 ns	0.9703 ns	1.5390 ns	46.931 ns
LastIndexOf_OrdinalIgnoreCase_ShortNonAscii	39.789 ns	0.8308 ns	2.1887 ns	39.719 ns
LastIndexOf_OrdinalIgnoreCase_LongNonAscii	171.161 ns	3.4428 ns	6.1195 ns	171.122 ns
Compare_OrdinalIgnoreCase_ShortAscii	11.133 ns	0.2412 ns	0.4704 ns	11.056 ns
Compare_OrdinalIgnoreCase_LongAscii	213.874 ns	4.2155 ns	6.9262 ns	212.655 ns
StartsWith_OrdinalIgnoreCase_ShortAscii	7.497 ns	0.1839 ns	0.4112 ns	7.465 ns
StartsWith_OrdinalIgnoreCase_LongAscii	63.026 ns	1.2622 ns	3.2127 ns	62.343 ns
EndsWith_OrdinalIgnoreCase_ShortAscii	12.651 ns	0.2852 ns	0.6496 ns	12.476 ns
EndsWith_OrdinalIgnoreCase_LongAscii	8.543 ns	0.2021 ns	0.3846 ns	8.561 ns
Compare_OrdinalIgnoreCase_ShortNonAscii	12.190 ns	0.2766 ns	0.6889 ns	12.060 ns
Compare_OrdinalIgnoreCase_LongNonAscii	11.879 ns	0.2707 ns	0.4214 ns	11.896 ns
StartsWith_OrdinalIgnoreCase_ShortNonAscii	32.850 ns	0.6560 ns	1.5843 ns	32.798 ns
StartsWith_OrdinalIgnoreCase_LongNonAscii	32.282 ns	0.6862 ns	1.5063 ns	32.089 ns
EndsWith_OrdinalIgnoreCase_ShortNonAscii	12.612 ns	0.2854 ns	0.4923 ns	12.556 ns
EndsWith_OrdinalIgnoreCase_LongNonAscii	12.788 ns	0.2788 ns	0.6516 ns	12.677 ns

ICU (Baseline) 5.0.100-preview.8.20362.3

Method	Mean	Error	StdDev	Median
IndexOf_OrdinalIgnoreCase_ShortAscii	281.915 ns	5.5935 ns	9.0324 ns	282.806 ns
IndexOf_OrdinalIgnoreCase_LongAscii	4,654.658 ns	91.2758 ns	157.4462 ns	4,656.749 ns
IndexOf_OrdinalIgnoreCase_ShortNonAscii	81.924 ns	1.5851 ns	3.3779 ns	82.129 ns
IndexOf_OrdinalIgnoreCase_LongNonAscii	553.243 ns	10.9822 ns	23.4040 ns	549.160 ns
LastIndexOf_OrdinalIgnoreCase_ShortAscii	102.842 ns	2.0976 ns	3.7823 ns	102.124 ns
LastIndexOf_OrdinalIgnoreCase_LongAscii	103.048 ns	2.0764 ns	4.1468 ns	102.458 ns
LastIndexOf_OrdinalIgnoreCase_ShortNonAscii	81.418 ns	1.5798 ns	2.3157 ns	81.939 ns
LastIndexOf_OrdinalIgnoreCase_LongNonAscii	536.305 ns	10.4910 ns	11.2252 ns	537.333 ns
Compare_OrdinalIgnoreCase_ShortAscii	12.388 ns	0.4576 ns	1.3131 ns	11.950 ns
Compare_OrdinalIgnoreCase_LongAscii	224.877 ns	4.4548 ns	9.2000 ns	224.065 ns
StartsWith_OrdinalIgnoreCase_ShortAscii	6.940 ns	0.1724 ns	0.3785 ns	6.923 ns
StartsWith_OrdinalIgnoreCase_LongAscii	61.033 ns	1.2384 ns	2.5015 ns	60.804 ns
EndsWith_OrdinalIgnoreCase_ShortAscii	13.519 ns	0.3038 ns	0.6205 ns	13.434 ns
EndsWith_OrdinalIgnoreCase_LongAscii	9.008 ns	0.2101 ns	0.5074 ns	8.961 ns
Compare_OrdinalIgnoreCase_ShortNonAscii	12.963 ns	0.2912 ns	0.6392 ns	12.884 ns
Compare_OrdinalIgnoreCase_LongNonAscii	12.220 ns	0.2748 ns	0.6637 ns	12.181 ns
StartsWith_OrdinalIgnoreCase_ShortNonAscii	39.288 ns	0.8146 ns	2.0586 ns	39.227 ns
StartsWith_OrdinalIgnoreCase_LongNonAscii	39.845 ns	0.8206 ns	1.7489 ns	40.118 ns
EndsWith_OrdinalIgnoreCase_ShortNonAscii	13.222 ns	0.2878 ns	0.6376 ns	13.305 ns
EndsWith_OrdinalIgnoreCase_LongNonAscii	13.152 ns	0.2982 ns	0.6158 ns	12.996 ns

(Baseline) 3.1

Method	Mean	Error	StdDev	Median
IndexOf_OrdinalIgnoreCase_ShortAscii	77.048 ns	1.5508 ns	2.5910 ns	76.470 ns
IndexOf_OrdinalIgnoreCase_LongAscii	826.192 ns	16.0535 ns	15.0164 ns	824.904 ns
IndexOf_OrdinalIgnoreCase_ShortNonAscii	34.301 ns	0.7206 ns	1.7265 ns	34.156 ns
IndexOf_OrdinalIgnoreCase_LongNonAscii	175.133 ns	3.4291 ns	9.4447 ns	174.532 ns
LastIndexOf_OrdinalIgnoreCase_ShortAscii	41.660 ns	0.8008 ns	2.0384 ns	41.199 ns
LastIndexOf_OrdinalIgnoreCase_LongAscii	40.870 ns	0.8510 ns	1.7191 ns	40.812 ns
LastIndexOf_OrdinalIgnoreCase_ShortNonAscii	36.558 ns	1.0939 ns	3.1563 ns	35.709 ns
LastIndexOf_OrdinalIgnoreCase_LongNonAscii	164.163 ns	3.2957 ns	4.1680 ns	164.756 ns
Compare_OrdinalIgnoreCase_ShortAscii	13.685 ns	0.3012 ns	0.4864 ns	13.692 ns
Compare_OrdinalIgnoreCase_LongAscii	229.758 ns	4.5842 ns	6.7195 ns	228.661 ns
StartsWith_OrdinalIgnoreCase_ShortAscii	7.783 ns	0.1894 ns	0.5088 ns	7.689 ns
StartsWith_OrdinalIgnoreCase_LongAscii	64.775 ns	1.3240 ns	2.9613 ns	64.815 ns
EndsWith_OrdinalIgnoreCase_ShortAscii	14.431 ns	0.3251 ns	0.7271 ns	14.368 ns
EndsWith_OrdinalIgnoreCase_LongAscii	9.639 ns	0.2264 ns	0.3247 ns	9.618 ns
Compare_OrdinalIgnoreCase_ShortNonAscii	13.184 ns	0.2971 ns	0.6266 ns	13.174 ns
Compare_OrdinalIgnoreCase_LongNonAscii	13.214 ns	0.2667 ns	0.5265 ns	13.197 ns
StartsWith_OrdinalIgnoreCase_ShortNonAscii	25.277 ns	0.5374 ns	1.0607 ns	25.334 ns
StartsWith_OrdinalIgnoreCase_LongNonAscii	24.705 ns	0.5292 ns	1.2780 ns	24.511 ns
EndsWith_OrdinalIgnoreCase_ShortNonAscii	13.586 ns	0.3067 ns	0.7465 ns	13.592 ns
EndsWith_OrdinalIgnoreCase_LongNonAscii	14.096 ns	0.3219 ns	0.9441 ns	14.062 ns

ICU (After optimization)

Method	Mean	Error	StdDev
IndexOf_OrdinalIgnoreCase_ShortAscii	56.725 ns	1.1710 ns	3.3597 ns
IndexOf_OrdinalIgnoreCase_LongAscii	696.613 ns	13.6316 ns	19.5501 ns
IndexOf_OrdinalIgnoreCase_ShortNonAscii	25.295 ns	0.5328 ns	1.0392 ns
IndexOf_OrdinalIgnoreCase_LongNonAscii	130.638 ns	2.5330 ns	5.1742 ns
LastIndexOf_OrdinalIgnoreCase_ShortAscii	28.075 ns	0.5972 ns	1.6147 ns
LastIndexOf_OrdinalIgnoreCase_LongAscii	28.237 ns	0.5948 ns	1.2675 ns
LastIndexOf_OrdinalIgnoreCase_ShortNonAscii	35.266 ns	0.7289 ns	1.1349 ns
LastIndexOf_OrdinalIgnoreCase_LongNonAscii	132.163 ns	2.6766 ns	6.6657 ns
Compare_OrdinalIgnoreCase_ShortAscii	11.097 ns	0.2520 ns	0.4413 ns
Compare_OrdinalIgnoreCase_LongAscii	225.093 ns	3.6706 ns	3.2539 ns
StartsWith_OrdinalIgnoreCase_ShortAscii	7.167 ns	0.1742 ns	0.3096 ns
StartsWith_OrdinalIgnoreCase_LongAscii	62.781 ns	1.2765 ns	2.5494 ns
EndsWith_OrdinalIgnoreCase_ShortAscii	11.466 ns	0.2631 ns	0.5374 ns
EndsWith_OrdinalIgnoreCase_LongAscii	7.460 ns	0.1795 ns	0.2999 ns
Compare_OrdinalIgnoreCase_ShortNonAscii	11.297 ns	0.2583 ns	0.4387 ns
Compare_OrdinalIgnoreCase_LongNonAscii	11.295 ns	0.2573 ns	0.4369 ns
StartsWith_OrdinalIgnoreCase_ShortNonAscii	12.219 ns	0.2735 ns	0.4790 ns
StartsWith_OrdinalIgnoreCase_LongNonAscii	11.957 ns	0.2694 ns	0.3688 ns
EndsWith_OrdinalIgnoreCase_ShortNonAscii	11.770 ns	0.2638 ns	0.4260 ns
EndsWith_OrdinalIgnoreCase_LongNonAscii	11.856 ns	0.2644 ns	0.4116 ns

NLS (After optimization)

Method	Mean	Error	StdDev	Median
IndexOf_OrdinalIgnoreCase_ShortAscii	76.700 ns	1.5640 ns	2.6979 ns	75.863 ns
IndexOf_OrdinalIgnoreCase_LongAscii	822.702 ns	16.0500 ns	15.7633 ns	819.457 ns
IndexOf_OrdinalIgnoreCase_ShortNonAscii	33.952 ns	0.7083 ns	1.6557 ns	33.791 ns
IndexOf_OrdinalIgnoreCase_LongNonAscii	163.543 ns	3.3181 ns	5.1659 ns	163.107 ns
LastIndexOf_OrdinalIgnoreCase_ShortAscii	40.847 ns	0.8527 ns	1.7225 ns	40.646 ns
LastIndexOf_OrdinalIgnoreCase_LongAscii	40.291 ns	0.8005 ns	0.7862 ns	40.375 ns
LastIndexOf_OrdinalIgnoreCase_ShortNonAscii	33.811 ns	0.7107 ns	1.8722 ns	33.311 ns
LastIndexOf_OrdinalIgnoreCase_LongNonAscii	162.103 ns	3.2151 ns	5.3718 ns	162.527 ns
Compare_OrdinalIgnoreCase_ShortAscii	10.816 ns	0.2495 ns	0.5783 ns	10.613 ns
Compare_OrdinalIgnoreCase_LongAscii	223.861 ns	4.4112 ns	5.8888 ns	223.382 ns
StartsWith_OrdinalIgnoreCase_ShortAscii	7.169 ns	0.1800 ns	0.4064 ns	7.045 ns
StartsWith_OrdinalIgnoreCase_LongAscii	63.064 ns	1.3015 ns	3.2889 ns	62.302 ns
EndsWith_OrdinalIgnoreCase_ShortAscii	11.395 ns	0.2573 ns	0.5079 ns	11.389 ns
EndsWith_OrdinalIgnoreCase_LongAscii	7.660 ns	0.1859 ns	0.4698 ns	7.545 ns
Compare_OrdinalIgnoreCase_ShortNonAscii	11.388 ns	0.2591 ns	0.4470 ns	11.411 ns
Compare_OrdinalIgnoreCase_LongNonAscii	11.352 ns	0.2586 ns	0.5398 ns	11.254 ns
StartsWith_OrdinalIgnoreCase_ShortNonAscii	11.650 ns	0.2673 ns	0.3917 ns	11.549 ns
StartsWith_OrdinalIgnoreCase_LongNonAscii	11.688 ns	0.2675 ns	0.5816 ns	11.670 ns
EndsWith_OrdinalIgnoreCase_ShortNonAscii	11.763 ns	0.2623 ns	0.4084 ns	11.814 ns
EndsWith_OrdinalIgnoreCase_LongNonAscii	11.716 ns	0.2637 ns	0.2338 ns	11.678 ns

ghost · 2020-08-16T22:27:24Z

Tagging subscribers to this area: @tarekgh, @safern, @krwq
See info in area-owners.md if you want to be subscribed.

tarekgh · 2020-08-16T22:29:55Z

@safern @GrabYourPitchforks could you please help reviewing the change here. I hope I can merge it before the deadline tomorrow. So, if you have any comment, please tell if it is blocking this change or something can be done later in other PR.

@GrabYourPitchforks I have changed some of the coding style in some methods which used goto. so I hope this is ok with you.

src/libraries/System.Private.CoreLib/src/System/Globalization/CompareInfo.cs

src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs

src/libraries/System.Private.CoreLib/src/System/Globalization/OrdinalCasing.Icu.cs

src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs

src/libraries/System.Private.CoreLib/src/System/Globalization/OrdinalCasing.Icu.cs

safern · 2020-08-17T17:56:05Z

@tarekgh FYI, the runtime test failure is: #40885

tarekgh · 2020-08-17T22:03:36Z

For completeness, here is the perf numbers on my WSL Ubuntu 18.04:

Linux Baseline (3.1)

Method	Mean	Error	StdDev	Median
IndexOf_OrdinalIgnoreCase_ShortAscii	153.989 ns	3.1440 ns	9.2701 ns	154.114 ns
IndexOf_OrdinalIgnoreCase_LongAscii	2,575.618 ns	64.7196 ns	185.6927 ns	2,525.609 ns
IndexOf_OrdinalIgnoreCase_ShortNonAscii	54.585 ns	1.1196 ns	2.2870 ns	54.411 ns
IndexOf_OrdinalIgnoreCase_LongNonAscii	321.643 ns	6.4673 ns	13.3560 ns	321.997 ns
LastIndexOf_OrdinalIgnoreCase_ShortAscii	69.103 ns	1.6689 ns	4.8682 ns	69.025 ns
LastIndexOf_OrdinalIgnoreCase_LongAscii	67.707 ns	1.7037 ns	4.9155 ns	67.039 ns
LastIndexOf_OrdinalIgnoreCase_ShortNonAscii	56.050 ns	1.4809 ns	4.2965 ns	56.146 ns
LastIndexOf_OrdinalIgnoreCase_LongNonAscii	335.489 ns	8.5171 ns	24.0227 ns	331.953 ns
Compare_OrdinalIgnoreCase_ShortAscii	32.471 ns	0.7306 ns	2.1542 ns	32.559 ns
Compare_OrdinalIgnoreCase_LongAscii	270.407 ns	6.5089 ns	19.0896 ns	269.830 ns
StartsWith_OrdinalIgnoreCase_ShortAscii	8.564 ns	0.2140 ns	0.4874 ns	8.525 ns
StartsWith_OrdinalIgnoreCase_LongAscii	73.859 ns	1.6127 ns	4.6788 ns	72.517 ns
EndsWith_OrdinalIgnoreCase_ShortAscii	34.015 ns	1.0407 ns	3.0192 ns	33.279 ns
EndsWith_OrdinalIgnoreCase_LongAscii	28.954 ns	0.6095 ns	1.2450 ns	28.817 ns
Compare_OrdinalIgnoreCase_ShortNonAscii	34.650 ns	0.8253 ns	2.3547 ns	34.223 ns
Compare_OrdinalIgnoreCase_LongNonAscii	33.184 ns	0.7023 ns	1.7875 ns	33.034 ns
StartsWith_OrdinalIgnoreCase_ShortNonAscii	42.865 ns	1.0150 ns	2.8793 ns	42.635 ns
StartsWith_OrdinalIgnoreCase_LongNonAscii	42.814 ns	0.9564 ns	2.7286 ns	42.802 ns
EndsWith_OrdinalIgnoreCase_ShortNonAscii	34.069 ns	0.7571 ns	2.1965 ns	34.021 ns
EndsWith_OrdinalIgnoreCase_LongNonAscii	33.374 ns	0.6929 ns	1.5210 ns	33.148 ns

Linux Baseline (5.0.0-preview.8.20361.2)

Method	Mean	Error	StdDev
IndexOf_OrdinalIgnoreCase_ShortAscii	171.314 ns	3.4618 ns	9.2401 ns
IndexOf_OrdinalIgnoreCase_LongAscii	2,530.398 ns	50.1847 ns	117.3051 ns
IndexOf_OrdinalIgnoreCase_ShortNonAscii	65.324 ns	1.7259 ns	4.9796 ns
IndexOf_OrdinalIgnoreCase_LongNonAscii	321.646 ns	6.4492 ns	17.3253 ns
LastIndexOf_OrdinalIgnoreCase_ShortAscii	79.391 ns	2.1797 ns	6.1120 ns
LastIndexOf_OrdinalIgnoreCase_LongAscii	79.923 ns	1.7343 ns	5.0316 ns
LastIndexOf_OrdinalIgnoreCase_ShortNonAscii	65.837 ns	1.5754 ns	4.6204 ns
LastIndexOf_OrdinalIgnoreCase_LongNonAscii	331.291 ns	6.7892 ns	19.8045 ns
Compare_OrdinalIgnoreCase_ShortAscii	28.217 ns	0.6025 ns	1.3475 ns
Compare_OrdinalIgnoreCase_LongAscii	250.868 ns	5.0631 ns	13.2494 ns
StartsWith_OrdinalIgnoreCase_ShortAscii	8.355 ns	0.2060 ns	0.4608 ns
StartsWith_OrdinalIgnoreCase_LongAscii	70.660 ns	1.4495 ns	3.3305 ns
EndsWith_OrdinalIgnoreCase_ShortAscii	27.416 ns	0.5876 ns	1.1037 ns
EndsWith_OrdinalIgnoreCase_LongAscii	25.726 ns	0.6210 ns	1.8212 ns
Compare_OrdinalIgnoreCase_ShortNonAscii	29.755 ns	0.6429 ns	1.8549 ns
Compare_OrdinalIgnoreCase_LongNonAscii	29.212 ns	0.7063 ns	2.0038 ns
StartsWith_OrdinalIgnoreCase_ShortNonAscii	35.549 ns	2.2755 ns	6.6738 ns
StartsWith_OrdinalIgnoreCase_LongNonAscii	29.127 ns	0.5807 ns	0.5431 ns
EndsWith_OrdinalIgnoreCase_ShortNonAscii	22.088 ns	0.4197 ns	0.3926 ns
EndsWith_OrdinalIgnoreCase_LongNonAscii	21.528 ns	0.3856 ns	0.3220 ns

Linux with the Optimization

Method	Mean	Error	StdDev
IndexOf_OrdinalIgnoreCase_ShortAscii	55.132 ns	1.1119 ns	2.0609 ns
IndexOf_OrdinalIgnoreCase_LongAscii	673.226 ns	10.4223 ns	9.7490 ns
IndexOf_OrdinalIgnoreCase_ShortNonAscii	19.971 ns	0.4136 ns	0.3869 ns
IndexOf_OrdinalIgnoreCase_LongNonAscii	126.101 ns	2.2821 ns	2.0230 ns
LastIndexOf_OrdinalIgnoreCase_ShortAscii	23.490 ns	0.3425 ns	0.4077 ns
LastIndexOf_OrdinalIgnoreCase_LongAscii	22.956 ns	0.2715 ns	0.2407 ns
LastIndexOf_OrdinalIgnoreCase_ShortNonAscii	19.403 ns	0.1864 ns	0.1556 ns
LastIndexOf_OrdinalIgnoreCase_LongNonAscii	127.550 ns	2.5765 ns	3.3502 ns
Compare_OrdinalIgnoreCase_ShortAscii	8.947 ns	0.1400 ns	0.1241 ns
Compare_OrdinalIgnoreCase_LongAscii	178.202 ns	2.3762 ns	2.1065 ns
StartsWith_OrdinalIgnoreCase_ShortAscii	6.476 ns	0.1605 ns	0.1501 ns
StartsWith_OrdinalIgnoreCase_LongAscii	56.053 ns	0.4483 ns	0.3743 ns
EndsWith_OrdinalIgnoreCase_ShortAscii	10.298 ns	0.0981 ns	0.0819 ns
EndsWith_OrdinalIgnoreCase_LongAscii	6.888 ns	0.1083 ns	0.0960 ns
Compare_OrdinalIgnoreCase_ShortNonAscii	11.199 ns	0.2417 ns	0.2482 ns
Compare_OrdinalIgnoreCase_LongNonAscii	11.133 ns	0.1970 ns	0.1645 ns
StartsWith_OrdinalIgnoreCase_ShortNonAscii	10.521 ns	0.1371 ns	0.1282 ns
StartsWith_OrdinalIgnoreCase_LongNonAscii	10.754 ns	0.1534 ns	0.1360 ns
EndsWith_OrdinalIgnoreCase_ShortNonAscii	12.965 ns	0.2162 ns	0.1917 ns
EndsWith_OrdinalIgnoreCase_LongNonAscii	13.210 ns	0.2505 ns	0.3429 ns

jefgen · 2020-08-17T22:35:07Z

src/libraries/Native/Unix/System.Globalization.Native/pal_casing.c

+    for (int i = 0; i < 256; i++)
+    {
+        // Unfortunately, to ensure one-to-one simple mapping we have to call u_toupper on every character.
+        // Using string casing ICU APIs cannot give such results even when using NULL locale to force root behavior.


FWIW, this is because Unicode itself doesn't have 1:1 case mapping.

Actually if limit the functionality to UnicodeData.txt, it will be 1:1. Yes, I understand in general Unicode casing is not 1:1.

jefgen · 2020-08-17T22:37:04Z

This is really impressive @tarekgh! 👍

GrabYourPitchforks

Looks good to me! However, we're now carrying some of our own ICU data (especially with regard to surrogate handling), and with this comes the possibility that our own ICU data will become out-of-sync with the ICU data of the underlying operating system.

The first immediate consequence is that the two lines below might now return different values, depending on runtime version and underlying OS:

// assume 'foo' and 'bar' are strings or ROS<char>
bool areEqual1 = foo.ToUpperInvariant() == bar.ToUpperInvariant();
bool areEqual2 = string.Equals(foo, bar, StringComparison.OrdinalIgnoreCase);

Per MSDN, these two lines are guaranteed to produce the same result.

There was some discussion on this over at #30960, where we proposed making the string.Equals method use simple case folding semantics rather than "convert to uppercase" semantics. One of the reasons given for pushback was that it could break this contract.

We might choose to say that it's ok to break this contract and that the two lines above shouldn't be considered equal. But if we make this claim we should do so consciously and deliberately.

Second (and this can come later), we should introduce a unit test that validates the data carried by OrdinalHelper is always up-to-date with the other data like CharUnicodeData that we carry within the runtime. A unit test for this might look like the following (see here for more info).

using System.Text.Unicode;

[Fact]
public void OrdinalIgnoreCaseTestForAllChars()
{
    for (int i = 0; i < 0xD800; i++)
    {
        RunTest(i);
    }
    // skip unpaired surrogates
    for (int i = 0xE000; i <= 0x10FFFF; i++)
    {
        RunTest(i);
    }

    static void RunTest(int codePoint)
    {
        int upperCodePoint = UnicodeData.GetData(codePoint).SimpleUppercaseMapping;
        if (codePoint != upperCodePoint)
        {
            // 'codePoint' and 'upperCodePoint' should compare as case-insensitive equal

            string s1 = new Rune(codePoint).ToString();
            string s2 = new Rune(upperCodePoint).ToString();

            Assert.True(string.Equals(s1, s2, StringComparison.OrdinalIgnoreCase));
            Assert.Equal(0, string.Compare(s1, s2, StringComparison.OrdinalIgnoreCase));
        }
    }
}

src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs

GrabYourPitchforks · 2020-08-17T23:00:18Z

src/libraries/System.Private.CoreLib/src/System/Globalization/OrdinalCasing.Icu.cs

+                    continue;
+                }
+
+                // we come here only if we have valid full surrogates


Nit: This comment isn't 100% correct. It's possible for the contents of the ROS<char> input to have changed between the first read (line 235) and the second read (line 261), which makes this no longer a valid surrogate pair. The ToUpperSurrogate appears is resilient to malformed surrogate pairs, but I wouldn't make such a solid "this is valid" statement in a comment.

(This applies to a few other places in this file, such as the pointer-based routine below.)

WOW. I like the security thinking here :-) if someone change the buffer contents during the operation they already missed up anyway. I wouldn't care much about that. We may clarify the comment more later but I would avoid that as it may suggest allow changing the underlying buffer.

GrabYourPitchforks · 2020-08-17T23:02:53Z

src/libraries/System.Private.CoreLib/src/System/Globalization/OrdinalCasing.Icu.cs

+
+        // s_casingTable is covering the Unicode BMP plane only. Surrogate casing is handled separately.
+        // Every cell in the table is covering the casing of 256 characters in the BMP.
+        // Every cell is array of 512 character for uppercasing mapping.


Worst case: this results in 26.5 KB of static cached data that never gets cleaned up during the lifetime of the application. Is this acceptable? Is it worth pointing out as a comment?

I am seeing this very reasonable for the worst case.

@tarekgh I do not think so. Because I have some IOT applications, these applications can use very little memory

And you expect this application will perform the casing on all Unicode ranges? We carry data more than that and what I am allocating is in the memory which can get swapped to the paging files. ICU data file only is much bigger than that anyway.

tarekgh · 2020-08-17T23:52:55Z

@GrabYourPitchforks

Per MSDN, these two lines are guaranteed to produce the same result.

Even before the change we were not consistent either. when upper/lower casing invariant we were using u_upper/u_lower. but when comparing the strings with invariant we were using ICU collations APIs which can have different results.

We might choose to say that it's ok to break this contract and that the two lines above shouldn't be considered equal. But if we make this claim we should do so consciously and deliberately.

As I pointed before we were not really 100% conforming to the contract anyway. But it is a good call to update the docs to clarify the behavior. Thanks for pointing at that.

Second (and this can come later), we should introduce a unit test that validates the data carried by OrdinalHelper is always up-to-date with the other data like CharUnicodeData that we carry within the runtime. A unit test for this might look like the following (see here for more info).

Fully agree. I was already thinking to do that but didn't have chance to fully do it. I'll track that for doing it later.

Last, thanks for your review and thoughts.

This is unused after #40910 and breaks building standalone System.Globalization.Native (`error G94D986AD: unused function 'AreEqualOrdinalIgnoreCase' [-Werror,-Wunused-function]`).

Ordinal Ignore Case Optimization

725bbc1

Dotnet-GitSync-Bot added the area-System.Globalization label Aug 16, 2020

tarekgh requested review from safern and GrabYourPitchforks August 16, 2020 22:30

lindexi reviewed Aug 17, 2020

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Globalization/CompareInfo.cs Outdated Show resolved Hide resolved