-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCRTbase.dll toupper() is 133x slower wall time than perl/msvcrt.dll #23037
Comments
https://bugs.python.org/issue35195 In 2018 Python identified this problem. Py ticket remains open ATM Feb 2025. IDK enough arch/API/design/tech info to understand all the comments in the cPy tickets if there is a proposed fix or reject fix or unfairly rejected fix in those 2 tickets. |
UCRT works; many bugs went away when we converted to use it. |
I'm not so worried about the performance of toupper() here, but there are a few other problems with this code:
Fixing all this would eliminate the toupper/isupper() calls, I don't know off-hand what the appropriate Win32 API would be. |
Forgot to add in the OP. Since 5.37.10 and commit 8a548d1 The P5P repo's .t files , and less so CPAN, will call Line 857 in 16196ae
make test . Copy pasted from a GH runner, blead perl has 1.2 million tests.
100K*4ms= 6.6 minutes faster core 55 millisecond is 1.7 frames at 30 frames per second. Blead perl currently has a 33,000 OP*s executed timer, before the first time it polls the Win32 GUI loop. Its crazy "link av.obj hv.obj perl.obj /delayload:user32.dll -o perl541.dll" really helps with blead perl core self So The question now is, does WinPerl selectivly replace cherry picked, problematic, slow, libc calls in Because perl.exe has the choice of which one to call at runtime, they both are available at all times inside a perl process. The call stacks, profiler reports, and my benchmarks show an ex^^^^ponential multiple orders of magnitude performance difference, between 2 difference implementations, of the same exact C standard lib function. Next question, why is WinPerl even C linking against MS's Would slurping/looping U8 values 0x00-0xFF, 1x on process start, through MS UCRT's Nobody can justify enumerating all 250 country codes on earth in a SQL DB/for loop+ You can't upper case an ASCII string, for each 8 bit character, you posting a new job ad on LinkedIn, interview and hiring a new developer and agree on a consulting contract and fee schedule, he reads the ASCII char and writes with a pen, 01000001, and hands you the paper with 01000001 written on it, and you hand him a check for $500, and his employment at you company terminates. He was paid $500 for 15-25 seconds of work. Great company to work for. 5 stars employer. Thats what UCRT is doing internally. 3 rd possible fix, the most difficult fix, which is beyond my expertise, figure out why The API docs for So did perl.exe/perl5xx.dll/perl5porters do something wrong and explicitly disable the cache logic inside ucrtbase.dll? Or this is a bug inside ucrtbase.dll, which only Microsoft can fix, and a member of the public must file a public bug ticket with MS, and MS devs must recompiling and publishing a new higher build number of ucrtbase.dll? Beyond scope for me to diag this. IDK enough. |
Maybe my PS I've spend 3 days searching ReactOS for what is the limit for U8's per "char" for a "MBCS" code page on a technical MS NLS C API level. I believe
BTW I believe
IDK enough. Maybe this toupper()/isupper() bug has something to do with that newish in Perl many reader single writer locking process global locale inter-OS thread serializing/anti-race code.
What are Perl in C's mandatory requirement for vendor C std lib toupper()/isupper() ? https://en.cppreference.com/w/cpp/string/byte/toupper says no
As you and me both agreed on IRC, there is some really poor quality Win32 only code, inside https://github.com/Perl/perl5/blob/blead/win32/perlhost.h that turns the But I'm less concerned about performance of creating ithread # 2 in a WinOS proc, vs perl interp executing this broken slow
|
Another idea, on WinPerl, is a codebase wide grep 9 stack That branch in If libperl.dll always passes a locale_t as arg 2, that Perl process-wide thread-wide locale settling race bug with WinPerl serializing multi-OS thread access, using a very poor DIY-ed by Perl re-implementation of MS's Slim reader/writer (SRW) API https://learn.microsoft.com/en-us/windows/win32/sync/slim-reader-writer--srw--locks that whole API thing, basically will disappear through macros/etc from WinPerl/libperl.dll, maybe the exported lock variables stay for less than perfect CPAN XS code, but nothing in libperl.dll will ever obtain that serialize lock ever again, And MS UCRT Devs probably can't even see the It doesn't matter in 2025, but IIRC |
We (probably @khwilliamson ) could change perl to use _create_locale() and the Unfortunately _create_locale() doesn't match the behaviour of POSIX newlocale, since you can't modify an existing locale object to mix locales the way you can with newlocale(). To behave the same we'd need to keep separate locale objects for each category, and that has problems for functions that work with more than one locale category (strftime at least), though I believe such mixing is usually a bad idea. But, even if isupper() is 133x slower, how much of an effect does that have on real code? It might be worth benchmarking and profiling related code (regexps?) to see. |
I didn't really look into this at all, but if there's a performance regression, perhaps it would make sense to report it to Microsoft? Technically, UCRT is part of the OS and reporting a bug in Windows requires a support contract (well, there's also the Feedback Hub, but no one reads it...). However, there's a workaround: UCRT bugs reported on Visual Studio Developer Community are often forwarded to the Windows team. That said, I imagine performance regressions are low priority for them. |
Some links https://learn.microsoft.com/en-us/cpp/c-runtime-library/recommendations-for-choosing-between-functions-and-macros?view=msvc-170
some limited quotes from UCRT headers,
There is more questionable MS code in the UCRT .cpp files that is important to look at. MS Devs occasionally put comments saying why they did things the way they did, or what end user hazard they were coding around were. But the UCRT/MSVC compiler has a source available license, not a FOSS license, so I'd rather not copy paste large methods/function bodies into this archival GH ticket. Steve Hay, Tonyc, etc I know all of you have the UCRT src code on your systems just like I do. Notice Also WinPerl use |
Do you have www links to the posix or msdn or p5p GH git repo APIs ? Follow Sarathy's principles for WinPerl, "POSIX" can't and never will exist inside WinPerl. Either P5P fakes it, or MS CRT fakes it. All of "libc" features/vm state are fictional and only exist inside 1 winperl process address space. There is no interop/ipc/IO communication with other procs using POSIX [tokens] APIs. Either P5P or the CRT, always converted the posix-ese into Kernel32.dll-exe. I'm wondering if WinPerl do
So that looks to me like WinOS and the concept of Locales run very deep into Ring 0 in Windows. Basically MS designed it so the text debugging log of a kernel sound card driver, will chance the thousands separator character in a sys admin debug log file, within 1000 microseconds/1 millisecond/1 video frame, after OS wide Maybe UCRT needs compliance with this, but Perl and PP state doesn't. Due to lack of knowledge IDK what Perl is trying to exactly fix in WinPerl, and who the consumer/end user is of the fix. And is the "fix" for actual observed defects in Perl end user production code, or is the 5.36/5.38/5.40 locale safety/serialization code, trying to fix a My generic instinct say, since APIs like Thats the reason I keep thinking
I caught this in a profiler to attached to
I'd have to recomp perl.dll with a bunch of There might be a P5P bug, that somehow PP or CPAN XS or p5p C, is passing locale "en-US; English; United States; NY; New York; 6 Empire Plaza; Suite 715;" which is "legal"' but a denormal local setting, that needs to be parsed/normalized in a loop O(n) against the NLS/Registry master disk data each time to normalize it to "en-US; English; United States;". IDK enough. |
other things I forgot, UCRT's 2nd, this is a comment from the CRT src code. I can't copy paste the whole thing here, but this comment is really important.
leaf function in locale usage terms My reading of that means its P5P's bug and P5P's defect for failing to
or in PerlXS API p5p is doing on a
|
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/configthreadlocale?view=msvc-170 I'd have to benchmark it to prove it, but I have a suspicion all of UCRT's internal cache code regarding parsing locale wide or ascii string names and its If UCRT's internal TLS structs don't "match", the ultimate source of truth for POSIX locales/is_/to_ after all the bloat and very badly written C++ classes, UCRT will be getting the truth for is_/to_ from kernel32's Since the concepts of Latin-1/7 bit ASCII/mbcs/8 bit characters, don't exist on Windows OS, except through the opaque casting function called |
similar src code problem in mingw 8.3.0/Strawberry 5.32's bundled gcc Reading Mingw's "ASCII to Wide" code, Im seeing the same O(n^2)-ness that UCRT is doing. Implementing each byte of the for loop as a dozen calls into kernel32.dll which is a dozen TLS calls best cast, worst case 100s of TLS calls and 100s of files that need to be decompressed and parsed. IsDBCSLeadByteEx() calls NlsValidateLocale() https://github.com/wine-mirror/wine/blob/8d40da7ffda5e8dde9200f733e7d2cebf0196bc3/dlls/kernelbase/locale.c#L723 and NlsValidateLocale() sets off a chain of decompressing and parsing a couple 100/1000 locales on disk/on a mmap 100s/1000s of permutations in there update: the IsDBCSLeadByteEx() byte by byte, algorithm seems to be copy paste from SO https://stackoverflow.com/a/27196334 and is probably repeated over and over by college students or young on certain unreliable tech social media sites, known for anonymous users, trying to ChatGPT their way to the next account ribbon/flair. |
maybe related to this ticket https://bugs.python.org/issue7442 https://bugs.python.org/issue31900 https://vstinner.github.io/python3-locales-encodings.html general tech prose, probably not related to whatever im benchmarking |
Module:
Description
A certain profiling call stack caught my eye and the final report from my profiler said 8% of all cpu time of perl is spent inside.
isupper()
/toupper()
from ucrtbase.dll, these are floating between place 4- place 8 as highest CPU hogs on random core .t'es. upper() Reaching # 1 was jaw dropping. Hence I investigated.some research this is 1 call about 1 U8 BTW, ::LocaleUpdate has 6 FlsGetValue calls (wraped with glerr preserving), toupper() fires::LocaleUpdate() every time, errorno in ucrt added another 4-5 FLSGV calls __acrt_LCMapStringA�() fires ::LocaleUpdate again ,
soon after
a few cpu ins addrs later (remember lines of code have loops)
kernelbase.dll tries building a tree of nodes or iterating all country codes on earth, data being searched by KernelBase.dll!GetNamedLocaleHashNode looks like
but this is raw memory with unprintables regexped out, i think its country codes but im not going rev eng it
benchmarks its horrible
with psudo threads 3 cores, idk enough if this is scaling or lock contention perl side or ms side is happening
Steps to Reproduce
Expected behavior
Half joke half serious, but remove UCRT from default build config win perl and link against msvcrt.dll.
Perl configuration
The text was updated successfully, but these errors were encountered: