-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ordinal Ignore Case Optimization #40910
Conversation
@safern @GrabYourPitchforks could you please help reviewing the change here. I hope I can merge it before the deadline tomorrow. So, if you have any comment, please tell if it is blocking this change or something can be done later in other PR. @GrabYourPitchforks I have changed some of the coding style in some methods which used |
src/libraries/System.Private.CoreLib/src/System/Globalization/CompareInfo.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Globalization/CompareInfo.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Globalization/OrdinalCasing.Icu.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Globalization/OrdinalCasing.Icu.cs
Outdated
Show resolved
Hide resolved
For completeness, here is the perf numbers on my WSL Ubuntu 18.04: Linux Baseline (3.1)
Linux Baseline (5.0.0-preview.8.20361.2)
Linux with the Optimization
|
for (int i = 0; i < 256; i++) | ||
{ | ||
// Unfortunately, to ensure one-to-one simple mapping we have to call u_toupper on every character. | ||
// Using string casing ICU APIs cannot give such results even when using NULL locale to force root behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, this is because Unicode itself doesn't have 1:1 case mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually if limit the functionality to UnicodeData.txt, it will be 1:1. Yes, I understand in general Unicode casing is not 1:1.
This is really impressive @tarekgh! 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! However, we're now carrying some of our own ICU data (especially with regard to surrogate handling), and with this comes the possibility that our own ICU data will become out-of-sync with the ICU data of the underlying operating system.
The first immediate consequence is that the two lines below might now return different values, depending on runtime version and underlying OS:
// assume 'foo' and 'bar' are strings or ROS<char>
bool areEqual1 = foo.ToUpperInvariant() == bar.ToUpperInvariant();
bool areEqual2 = string.Equals(foo, bar, StringComparison.OrdinalIgnoreCase);
Per MSDN, these two lines are guaranteed to produce the same result.
There was some discussion on this over at #30960, where we proposed making the string.Equals
method use simple case folding semantics rather than "convert to uppercase" semantics. One of the reasons given for pushback was that it could break this contract.
We might choose to say that it's ok to break this contract and that the two lines above shouldn't be considered equal. But if we make this claim we should do so consciously and deliberately.
Second (and this can come later), we should introduce a unit test that validates the data carried by OrdinalHelper
is always up-to-date with the other data like CharUnicodeData
that we carry within the runtime. A unit test for this might look like the following (see here for more info).
using System.Text.Unicode;
[Fact]
public void OrdinalIgnoreCaseTestForAllChars()
{
for (int i = 0; i < 0xD800; i++)
{
RunTest(i);
}
// skip unpaired surrogates
for (int i = 0xE000; i <= 0x10FFFF; i++)
{
RunTest(i);
}
static void RunTest(int codePoint)
{
int upperCodePoint = UnicodeData.GetData(codePoint).SimpleUppercaseMapping;
if (codePoint != upperCodePoint)
{
// 'codePoint' and 'upperCodePoint' should compare as case-insensitive equal
string s1 = new Rune(codePoint).ToString();
string s2 = new Rune(upperCodePoint).ToString();
Assert.True(string.Equals(s1, s2, StringComparison.OrdinalIgnoreCase));
Assert.Equal(0, string.Compare(s1, s2, StringComparison.OrdinalIgnoreCase));
}
}
}
src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs
Show resolved
Hide resolved
continue; | ||
} | ||
|
||
// we come here only if we have valid full surrogates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This comment isn't 100% correct. It's possible for the contents of the ROS<char>
input to have changed between the first read (line 235) and the second read (line 261), which makes this no longer a valid surrogate pair. The ToUpperSurrogate
appears is resilient to malformed surrogate pairs, but I wouldn't make such a solid "this is valid" statement in a comment.
(This applies to a few other places in this file, such as the pointer-based routine below.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WOW. I like the security thinking here :-) if someone change the buffer contents during the operation they already missed up anyway. I wouldn't care much about that. We may clarify the comment more later but I would avoid that as it may suggest allow changing the underlying buffer.
|
||
// s_casingTable is covering the Unicode BMP plane only. Surrogate casing is handled separately. | ||
// Every cell in the table is covering the casing of 256 characters in the BMP. | ||
// Every cell is array of 512 character for uppercasing mapping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worst case: this results in 26.5 KB of static cached data that never gets cleaned up during the lifetime of the application. Is this acceptable? Is it worth pointing out as a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am seeing this very reasonable for the worst case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tarekgh I do not think so. Because I have some IOT applications, these applications can use very little memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And you expect this application will perform the casing on all Unicode ranges? We carry data more than that and what I am allocating is in the memory which can get swapped to the paging files. ICU data file only is much bigger than that anyway.
Even before the change we were not consistent either. when upper/lower casing invariant we were using u_upper/u_lower. but when comparing the strings with invariant we were using ICU collations APIs which can have different results.
As I pointed before we were not really 100% conforming to the contract anyway. But it is a good call to update the docs to clarify the behavior. Thanks for pointing at that.
Fully agree. I was already thinking to do that but didn't have chance to fully do it. I'll track that for doing it later. Last, thanks for your review and thoughts. |
This is unused after #40910 and breaks building standalone System.Globalization.Native (`error G94D986AD: unused function 'AreEqualOrdinalIgnoreCase' [-Werror,-Wunused-function]`).
This is unused after #40910 and breaks building standalone System.Globalization.Native (`error G94D986AD: unused function 'AreEqualOrdinalIgnoreCase' [-Werror,-Wunused-function]`).
The changes here is to optimize the ordinal ignore case for all scenarios (e.g. String/Span compare, StartsWith, EndsWith, IndexOf, LastIndexOf...etc.) when using ICU. We consider NLS is the baseline comparing with ICU.
I am pasting here some perf numbers before and after the change. The perf numbers collected on Windows machine as this is main regression when switch from using NLS to ICU. The changes also is mainly for ordinal operations so there is no optimization done yet for linguistic operations.
Please note in some ASCII scenarios will find the numbers very close before and after the optimization, that is because we have some code to handle ASCII cases without calling the underlying NLS/ICU. But still you'll notice some minor improvements there too.
Also, this change include the initial refactoring for the ordinal operations. Introduced Ordinal classes that contains the ordinal operations but I didn't do full refactoring to avoid bigger code churn. Notice some ordinal scattered code moved to the new Ordinal classes.
NLS (Baseline) 5.0.100-preview.8.20362.3
ICU (Baseline) 5.0.100-preview.8.20362.3
(Baseline) 3.1
ICU (After optimization)
NLS (After optimization)