Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCRTbase.dll toupper() is 133x slower wall time than perl/msvcrt.dll #23037

Open
bulk88 opened this issue Feb 27, 2025 · 14 comments
Open

UCRTbase.dll toupper() is 133x slower wall time than perl/msvcrt.dll #23037

bulk88 opened this issue Feb 27, 2025 · 14 comments
Assignees

Comments

@bulk88
Copy link
Contributor

bulk88 commented Feb 27, 2025

Module:

Description
A certain profiling call stack caught my eye and the final report from my profiler said 8% of all cpu time of perl is spent inside. isupper()/toupper() from ucrtbase.dll, these are floating between place 4- place 8 as highest CPU hogs on random core .t'es. upper() Reaching # 1 was jaw dropping. Hence I investigated.

Image

Image

some research this is 1 call about 1 U8 BTW, ::LocaleUpdate has 6 FlsGetValue calls (wraped with glerr preserving), toupper() fires::LocaleUpdate() every time, errorno in ucrt added another 4-5 FLSGV calls __acrt_LCMapStringA�() fires ::LocaleUpdate again ,

	KernelBase.dll!LCMapStringEx�()	Unknown
 kernel32.dll!LCMapStringExStub�()	Unknow
ucrtbase.dll!__acrt_LCMapStringA_stat()	Unknown
 ucrtbase.dll!__acrt_LCMapStringA�()	Unknown
 ucrtbase.dll!toupper�()	Unknown

soon after

KernelBase.dll!DefaultSortVersion()	Unknown
 KernelBase.dll!VersionValue�()	Unknown
 	KernelBase.dll!SortChangeCase�()	Unknown 
  KernelBase.dll!LCMapStringEx�()	Unknown

a few cpu ins addrs later (remember lines of code have loops)

MinSortChangeCase�()
KernelBase.dll!GetNamedLocaleHashNode()	Unknown
KernelBase.dll!SortChangeCase�()	Unknown
KernelBase.dll!LCMapStringEx�()	Unknown
kernel32.dll!LCMapStringExStub�()	Unknown
ucrtbase.dll!__acrt_LCMapStringA_stat()	Unknown
ucrtbase.dll!__acrt_LCMapStringA�()	Unknown
ucrtbase.dll!toupper�()	Unknown

kernelbase.dll tries building a tree of nodes or iterating all country codes on earth, data being searched by KernelBase.dll!GetNamedLocaleHashNode looks like

but this is raw memory with unprintables regexped out, i think its country codes but im not going rev eng it

benchmarks its horrible

C:\sources\crtslow\CRTSlow>perl -Mblib -MCRTSlow -e"CRT::Be();";
cache wake                       78063695 us Ln 337
tolower                        2440267443 us Ln 341
_tolower                         18608920 us Ln 345
toLOWER_A                        17415609 us Ln 349
toLOWER_L1                       17737359 us Ln 353
isgraph                        2562596668 us Ln 357
isGRAPH_A                        18712061 us Ln 361
isGRAPH_L1                       17790779 us Ln 365
isalnum                        2520004815 us Ln 369
isALPHANUMERIC_A                 18165949 us Ln 373
isALPHANUMERIC_L1                18033632 us Ln 377
isalnum msvcrt                  111663798 us Ln 392
isalnum msvcr100                 98945415 us Ln 402
isalnum msvcr120                 97687178 us Ln 412
C:\sources\crtslow\CRTSlow>

with psudo threads 3 cores, idk enough if this is scaling or lock contention perl side or ms side is happening

C:\sources\crtslow\CRTSlow>perl -Mblib -MCRTSlow -e"$r = fork(); $t = fork(); ex
it if !$t && $r;  CRT::Be();";
cache wake                       32082575 us Ln 337
cache wake                       33263148 us Ln 337
cache wake                       40918586 us Ln 337
tolower                        3157700393 us Ln 341
tolower                        3165074351 us Ln 341
_tolower                         22321171 us Ln 345
_tolower                         20705845 us Ln 345
toLOWER_A                        22430065 us Ln 349
toLOWER_A                        20589965 us Ln 349
toLOWER_L1                       22076263 us Ln 353
toLOWER_L1                       23926635 us Ln 353
tolower                        3348870216 us Ln 341
_tolower                         23873216 us Ln 345
toLOWER_A                        24134972 us Ln 349
toLOWER_L1                       23457776 us Ln 353
isgraph                        3397327541 us Ln 357
isGRAPH_A                        20434638 us Ln 361
isGRAPH_L1                       24083196 us Ln 365
isgraph                        3679713786 us Ln 357
isgraph                        3694650315 us Ln 357
isGRAPH_A                        24393851 us Ln 361
isGRAPH_A                        25253907 us Ln 361
isGRAPH_L1                       18637274 us Ln 365
isGRAPH_L1                       23057951 us Ln 365
isalnum                        3123988931 us Ln 369
isALPHANUMERIC_A                 23420793 us Ln 373
isALPHANUMERIC_L1                20040976 us Ln 377
isalnum                        3220601143 us Ln 369
isalnum                        3333259367 us Ln 369
isALPHANUMERIC_A                 26456259 us Ln 373
isALPHANUMERIC_A                 25220212 us Ln 373
isALPHANUMERIC_L1                19804287 us Ln 377
isALPHANUMERIC_L1                26882794 us Ln 377
isalnum msvcrt                  144513972 us Ln 392
isalnum msvcrt                  151367705 us Ln 392
isalnum msvcrt                  151673430 us Ln 392
isalnum msvcr100                133158199 us Ln 402
isalnum msvcr100                124220690 us Ln 402
isalnum msvcr100                123605543 us Ln 402
isalnum msvcr120                129325137 us Ln 412
isalnum msvcr120                118722581 us Ln 412
isalnum msvcr120                121761334 us Ln 412
C:\sources\crtslow\CRTSlow>

Steps to Reproduce

#define PERL_NO_GET_CONTEXT
#define  WIN32_LEAN_AND_MEAN
#include <windows.h>
#include "EXTERN.h"
#include "perl.h"
#include "XSUB.h"


/* Global Data */
LARGE_INTEGER Frequency = { 0 };
#define g_Frequency Frequency

START_MY_CXT


/* BTIME = BENCH TIME*/
#define BTIMESTART do { \
    LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds; NV nv1; NV nv2; \
    QueryPerformanceCounter(&StartingTime)

#define BTIMEEND(label) \
    QueryPerformanceCounter(&EndingTime); \
    ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart; \
    ElapsedMicroseconds.QuadPart *= 1000000000; \
    ElapsedMicroseconds.QuadPart = ElapsedMicroseconds.QuadPart           \
    /((LARGE_INTEGER*)(&Frequency))->QuadPart; \
    printf("%-30s %10I64u us Ln %u\n", label, ElapsedMicroseconds.QuadPart, __LINE__); \
} while(0)

#define VP(_vp) ((size_t)(_vp))
#define CP(_cp) ((char *)(_cp))
#define VPP(_p) ((void**)(_p))
#define LST for(n=0; n < 10; n++) { p = CP(low); while(p < CP(hi)){ c = *p;
#define LEND  c1 += r; p++;}   }
#  ifndef MIN
#    define MIN(a,b) ((a) < (b) ? (a) : (b))
#  endif
#  ifndef MAX
#    define MAX(a,b) ((a) > (b) ? (a) : (b))
#  endif

MODULE = CRT		PACKAGE = CRT

void
Be()
PPCODE:
  PUTBACK;
  HMODULE h_orig;
  GetModuleHandleExW(
  GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS,
  &PL_No[0],
    &h_orig
  );
      const IMAGE_DOS_HEADER* h = (const IMAGE_DOS_HEADER*)h_orig;
      PIMAGE_NT_HEADERS nt_header = (PIMAGE_NT_HEADERS const)(VP(h) + VP(h->e_lfanew));
	WORD NumberOfSections = nt_header->FileHeader.NumberOfSections;
  /* IMAGE_FIRST_SECTION() macro is universal, and corrects for OPTIONAL32 vs OPTIONAL64 */
	const IMAGE_SECTION_HEADER * sec = IMAGE_FIRST_SECTION(nt_header);
	const IMAGE_SECTION_HEADER * secEnd = sec + NumberOfSections;
    char * low =~0;
  char * hi = 0;
    char * lowi;
  char * hii;
  U32 r;
  U32 c1 = 0;
  U8 c;
  char * p;
  int n =0;
	for(; sec < secEnd ; sec++) {
    lowi = VP(h) + VP(sec->VirtualAddress);
    hii = lowi + VP(sec->SizeOfRawData);
   low = MIN(low,lowi);
   hi = MAX(hi,hii);
  }
 BTIMESTART;
LST r = c; LEND
BTIMEEND("cache wake");
BTIMESTART; LST
    tolower(c);
LEND
BTIMEEND("tolower");
BTIMESTART; LST
    _tolower(c);
    LEND
      BTIMEEND("_tolower");
    BTIMESTART; LST
    toLOWER_A(c);
    LEND
      BTIMEEND("toLOWER_A");
    BTIMESTART; LST
    toLOWER_L1(c);
    LEND
      BTIMEEND("toLOWER_L1");
    BTIMESTART; LST
    isgraph(c);
    LEND
      BTIMEEND("isgraph");
    BTIMESTART; LST
    isGRAPH_A(c);
    LEND
      BTIMEEND("isGRAPH_A");
    BTIMESTART; LST
    isGRAPH_L1(c);
    LEND
      BTIMEEND("isGRAPH_L1");
    BTIMESTART; LST
    isalnum(c);
    LEND
      BTIMEEND("isalnum");
    BTIMESTART; LST
    isALPHANUMERIC_A(c);
    LEND
      BTIMEEND("isALPHANUMERIC_A");
    BTIMESTART; LST
    isALPHANUMERIC_L1(c);
    LEND
      BTIMEEND("isALPHANUMERIC_L1");
  const char * dllname = 0;
  unsigned char flag;
  unsigned char f_type;
  unsigned char f_len;
  HANDLE h2;
  dllname = (char *)&ms_crt_dllnames;
  typedef int (__cdecl * isctypefn_t)(int);
    h2 = LoadLibrary("msvcrt");
    if(h2) {
      isctypefn_t pfn = (isctypefn_t)GetProcAddress(h2, "isalnum");
      if(pfn) {
            BTIMESTART; LST
            pfn(c);
            LEND
            BTIMEEND("isalnum msvcrt");
      }
    }
    h2 = LoadLibrary("msvcr100.dll");
    if(h2) {
      isctypefn_t pfn = (isctypefn_t)GetProcAddress(h2, "isalnum");
      if(pfn) {
            BTIMESTART; LST
            pfn(c);
            LEND
            BTIMEEND("isalnum msvcr100");
      }
    }
    h2 = LoadLibrary("msvcr120.dll");
    if(h2) {
      isctypefn_t pfn = (isctypefn_t)GetProcAddress(h2, "isalnum");
      if(pfn) {
            BTIMESTART; LST
            pfn(c);
            LEND
            BTIMEEND("isalnum msvcr120");
      }
    }
    iscntrl(c);
    isCNTRL_A(c);
    isCNTRL_L1(c);
    ispunct(c);
    isPUNCT_A(c);
    isPUNCT_L1(c);
    isspace(c);
    isSPACE_A(c) ;
    isSPACE_L1(c) ;
    isxdigit(c);
    isXDIGIT_A(c);
    isXDIGIT_L1(c);
    isdigit(c);
    isDIGIT_A(c);
    isDIGIT_L1(c);
    isalpha(c);
    isALPHA_A(c);
    isALPHA_L1(c);
    XSRETURN_IV(c1);
    return;
 

MODULE = CRTSlow		PACKAGE = CRTSlow		


BOOT:
{
    MY_CXT_INIT;
    /* If any of the fields in the my_cxt_t struct need
       to be initialised, do it here.
     */
    QueryPerformanceFrequency(&g_Frequency);
}

Expected behavior

Half joke half serious, but remove UCRT from default build config win perl and link against msvcrt.dll.

Perl configuration

Summary of my perl5 (revision 5 version 41 subversion 7) configuration:
  Derived from: 73172a67eaae5671dffc06b427f005810d151472
  Platform:
    osname=MSWin32
    osvers=6.1.7601
    archname=MSWin32-x64-multi-thread
    uname=''
    config_args='undef'
    hint=recommended
    useposix=true
    d_sigaction=undef
    useithreads=define
    usemultiplicity=define
    use64bitint=define
    use64bitall=undef
    uselongdouble=undef
    usemymalloc=n
    default_inc_excludes_dot=define
  Compiler:
    cc='cl'
    ccflags ='-nologo -GF -W3 -MD -TC -DWIN32 -D_CONSOLE -DNO_STRICT -DWIN64 -D
CRT_SECURE_NO_DEPRECATE -D_CRT_NONSTDC_NO_DEPRECATE -D_WINSOCK_DEPRECATED_NO_WA
NINGS -DPERL_TEXTMODE_SCRIPTS -DMULTIPLICITY -DPERL_IMPLICIT_SYS -DWIN32_NO_REG
STRY -DUSE_PERLIO'
    optimize='-O1 -Zi -GL -fp:precise'
    cppflags='-DWIN32'
    ccversion='19.36.32535'
    gccversion=''
    gccosandvers=''
    intsize=4
    longsize=4
    ptrsize=8
    doublesize=8
    byteorder=12345678
    doublekind=3
    d_longlong=undef
    longlongsize=8
    d_longdbl=define
    longdblsize=8
    longdblkind=0
    ivtype='__int64'
    ivsize=8
    nvtype='double'
    nvsize=8
    Off_t='__int64'
    lseeksize=8
    alignbytes=8
    prototype=define
  Linker and Libraries:
    ld='link'
    ldflags ='-nologo -nodefaultlib -debug -opt:ref,icf -ltcg -libpath:"c:\pb64
lib\CORE" -machine:AMD64 -subsystem:console,"5.02"'
    libpth="C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MS
C\14.36.32532\lib\x64"
    libs=oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.l
b advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_32.
ib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib comctl32.lib msvcrt.li
 vcruntime.lib ucrt.lib
    perllibs=oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg
2.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2
32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib comctl32.lib msvcr
.lib vcruntime.lib ucrt.lib
    libc=ucrt.lib
    so=dll
    useshrplib=true
    libperl=perl541.lib
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_win32.xs
    dlext=dll
    d_dlsymun=undef
    ccdlflags=' '
    cccdlflags=' '
    lddlflags='-dll -nologo -nodefaultlib -debug -opt:ref,icf -ltcg -libpath:"c
\pb64\lib\CORE" -machine:AMD64 -subsystem:console,"5.02"'


Characteristics of this binary (from libperl):
  Compile-time options:
    HAS_LONG_DOUBLE
    HAS_TIMES
    HAVE_INTERP_INTERN
    MULTIPLICITY
    PERLIO_LAYERS
    PERL_COPY_ON_WRITE
    PERL_DONT_CREATE_GVSV
    PERL_HASH_FUNC_SIPHASH13
    PERL_HASH_USE_SBOX32
    PERL_IMPLICIT_SYS
    PERL_MALLOC_WRAP
    PERL_OP_PARENT
    PERL_PRESERVE_IVUV
    PERL_USE_SAFE_PUTENV
    USE_64_BIT_INT
    USE_ITHREADS
    USE_LARGE_FILES
    USE_LOCALE
    USE_LOCALE_COLLATE
    USE_LOCALE_CTYPE
    USE_LOCALE_NUMERIC
    USE_LOCALE_TIME
    USE_NO_REGISTRY
    USE_PERLIO
    USE_PERL_ATOF
    USE_THREAD_SAFE_LOCALE
  Locally applied patches:
    uncommitted-changes
  Built under MSWin32
  Compiled at Dec 20 2024 10:03:46
  %ENV:
    PERL_DOBK="1"
    PERL_DOBP="1"
    PERL_DODB="1"
  @INC:
    C:/pb64/site/lib/MSWin32-x64-multi-thread
    C:/pb64/site/lib
    C:/pb64/lib

@bulk88
Copy link
Contributor Author

bulk88 commented Feb 27, 2025

python/cpython#79376

https://bugs.python.org/issue35195

In 2018 Python identified this problem. Py ticket remains open ATM Feb 2025. IDK enough arch/API/design/tech info to understand all the comments in the cPy tickets if there is a proposed fix or reject fix or unfairly rejected fix in those 2 tickets.

@khwilliamson
Copy link
Contributor

UCRT works; many bugs went away when we converted to use it.

@tonycoz
Copy link
Contributor

tonycoz commented Feb 27, 2025

I'm not so worried about the performance of toupper() here, but there are a few other problems with this code:

  • it's trying to emulate the case-insensitivity of the Win32 UTF-16 environment with localized 8-bit (possibly MBCS) strings
  • it's using locale sensitive functions to do it, so if you change locale the order may become invalid
  • the ::Add function pushes new strings on the end and the does a sort, making it $O(N \log N)$ when it could be $O(N)$ (the inserting would be $O(N)$, finding the right place to put it is $O(\log N)$ )
  • the copy constructor calls ::Add for each entry in the table, making it $O(N^2 \log N)$ when simply allocating a new array and copying over the entries would be $O(N)$.

Fixing all this would eliminate the toupper/isupper() calls, I don't know off-hand what the appropriate Win32 API would be.

@tonycoz tonycoz self-assigned this Feb 28, 2025
@bulk88
Copy link
Contributor Author

bulk88 commented Feb 28, 2025

Forgot to add in the OP.

Since 5.37.10 and commit 8a548d1 msvcrt.dll yes that file, C:\windows\System32\msvcrt.dll is static DLL linked into the perl process virtual addr space 100% of the time, through ws2_32.dll aka winsock static DLL linking against it. I am very dissatisfied at commit 8a548d1 since delay loading winsock was the best milliseconds saving optimization ever done to WinPerl. Especially how EU::MM makefiles have dozens and dozens of very short lived (15-100ms lifetime) perl processes to build a module.

The P5P repo's .t files , and less so CPAN, will call sub runperl() at

perl5/t/test.pl

Line 857 in 16196ae

sub runperl {
10Ks IDK 100Ks of times during a core blead make test. Copy pasted from a GH runner, blead perl has 1.2 million tests.

Files=2910, Tests=1193052, 418 wallclock secs (94.06 usr 10.78 sys + 493.74 cusr 65.30 csys = 663.88 CPU)

100K*4ms= 6.6 minutes faster core make test. 4ms is a wild guess, it will vary machine to machine user to user. Its probably 10 or 20 ms for me and my older equip. I have plans to return ws2_32.dll back to delay loading. My current returning delay loading winsock patch is failing a couple tests and needs some refactor to remove some irrelevant code edits. Its also stalled by another patch I have making user32.dll a delay load, which is not simple, b/c WinPerl's runloop always has monitored the Windows GUI event loop since day 1, and CPAN GUI libs can't be broken by WinPerl's runloop suddenly not dispatching SIGNALS Windows OS GUI messages from user32.dll. But I already protoyped and ran 3 implementations of delay-loaded user32.dll and have 4 more implementations brainstormed in my comments. Did you know 50% of perl.exe'es, in blead perl gmake test exit()/process terminate, before reaching 35/50 milliseconds point of their process run life?

55 millisecond is 1.7 frames at 30 frames per second. Blead perl currently has a 33,000 OP*s executed timer, before the first time it polls the Win32 GUI loop. Its crazy "link av.obj hv.obj perl.obj /delayload:user32.dll -o perl541.dll" really helps with blead perl core self gmake test. No other changes. A PR with just /delayload:user32.dll in it, is inappropriate IMO, b/c technical details I won't go into.

So UCRTbase.dll vs isupper() vs toupper() vs msvcrt.dll. Because msvcrt.dll is static linked into perl since 5.37.10. And my benchmarks in OP, were done with UCRTbase.dll and msvcrt.dll inside the same process, and the perl process already had msvcrt.dll in VA space, I didn't explicitly load it.

The question now is, does WinPerl selectivly replace cherry picked, problematic, slow, libc calls in ucrtbase.dll, with their same name libc calls in msvcrt.dll?

Because perl.exe has the choice of which one to call at runtime, they both are available at all times inside a perl process. The call stacks, profiler reports, and my benchmarks show an ex^^^^ponential multiple orders of magnitude performance difference, between 2 difference implementations, of the same exact C standard lib function.

Next question, why is WinPerl even C linking against MS's isupper()/toupper(), when Perl/P5P implements their own portable isupper()/toupper()? Is that itself a bug? My benchmarks included the Perl portable replacements for the platform vendor/OS/CC provided versions. They exist, they are fast/perfect.

Would slurping/looping U8 values 0x00-0xFF, 1x on process start, through MS UCRT's isupper()/toupper(), and caching the output 0/1 bits or U8s, work?

Nobody can justify enumerating all 250 country codes on earth in a SQL DB/for loop+strcmp(), loopmalloc()ing a red black tree, loop balancing it, loop walking the nodes until match, then loop free(), for each execution of if(isupper(char var_char)) {0;} else {0;} on a 8-bit integer.

You can't upper case an ASCII string, for each 8 bit character, you posting a new job ad on LinkedIn, interview and hiring a new developer and agree on a consulting contract and fee schedule, he reads the ASCII char and writes with a pen, 01000001, and hands you the paper with 01000001 written on it, and you hand him a check for $500, and his employment at you company terminates. He was paid $500 for 15-25 seconds of work. Great company to work for. 5 stars employer. Thats what UCRT is doing internally.

3 rd possible fix, the most difficult fix, which is beyond my expertise, figure out why ucrtbase.dll's cache logic are disabled at runtime inside a perl.exe process. I physically can see my ucrtbase.dll has caching mechanisms/conditional branches and shortcut mechanisms/conditional branches, but for unknown reasons, at runtime, every single time, where an if(){}else{} conditional jmp opportunity exists, the "heavy" no-cache branch of machine code was picked to execute, and not the "fast cache" array/C struct bitfield execution path. I'm not going to single step it again right now, but I counted atleast 3 if/else branch "fast cache, quick return" opportunities inside ucrtbase.dll, that didn't execute, before ucrtbase.dll handed over control to kernel32.dll, and kernel32.dll/kernelbase.dll LCMapStringEx() started searching a SQL DB of country codes, or digging through a memory mapped file of locales and country codes on my disk and inflating that "binary disk file" with variable length intergers, or gzip/huffman encoded vals, 0-indexed offsets, to CPU friendly C structs.

The API docs for LCMapStringEx()@kernel32.dll https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-lcmapstringex#remarks actually spell out, the user should NOT call LCMapStringEx() 1x per char, but pass LCMAP_SORTKEY, and obtain a reference counted (???) sort object from LCMapStringEx() and feed that "sort cache object" into some other API call. ucrtbase.dll is calling LCMapStringEx() 1x per char, which the API docs of LCMapStringEx() specifically say is a bad idea.

So did perl.exe/perl5xx.dll/perl5porters do something wrong and explicitly disable the cache logic inside ucrtbase.dll?

Or this is a bug inside ucrtbase.dll, which only Microsoft can fix, and a member of the public must file a public bug ticket with MS, and MS devs must recompiling and publishing a new higher build number of ucrtbase.dll? Beyond scope for me to diag this. IDK enough.

@bulk88
Copy link
Contributor Author

bulk88 commented Feb 28, 2025

I'm not so worried about the performance of toupper() here, but there are a few other problems with this code:

  • it's trying to emulate the case-insensitivity of the Win32 UTF-16 environment with localized 8-bit (possibly MBCS) strings

Maybe my ucrtbase.dll is going "crazy" because it thinks I'm passing asian MBCS upto 2^16 fake-ascii and it needs to be compliant with MBCS rules, even tho I live in USA, and if I have a 2nd locale or keyboard installed for personal reasons, it will be a locale/keyboard layout somewhere in Continental Europe, but quite far from London or Paris (not Latin-1 but still a normal 0x80-0xFF code page). I have no idea why a MBCS code path would ever trigger on my system, but cPython has the exact same bug I have but its reported and reproduced by other people. Its not unique to me.

PS I've spend 3 days searching ReactOS for what is the limit for U8's per "char" for a "MBCS" code page on a technical MS NLS C API level. I believe MultiByteToWideChar allows 1 MBCS character to be of length I32_MAX or input buffer length in U8, or U8 '\0'. You can do it by using a MBCS continuation mask U8 byte and repeating it forever to the end of the your input byte buffer. A 1 MB long character will become 2 bytes of output, aka UTF16 replacement char L'?'. The other direction, UTF16 to legacy MBCS can't generate such nonsense since UTF16's maximum output to legacy is, 4 bytes of UTF16 generates 4 bytes of Legacy CP or 4 bytes of UTF8 CP. Perhaps with surrogate abuse 4 bytes Wide becomes 6 bytes legacy output.

use Encode;
use Devel::Peek;
my $source = Encode::decode("UTF-16LE","\xF0\xDB\xF0\xDB");
Dump($source);
SV = PV(0x48e1f8) at 0x50a0a0
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x2208aa8 "\xEF\xBF\xBD\xEF\xBF\xBD"\0 [UTF8 "\x{fffd}\x{fffd}"]
  CUR = 6
  LEN = 16

BTW I believe RtlAnsiStringToUnicodeString/RtlMultiByteToUnicodeN and NTFS will not put up with MBCS continuation character abuse. And they are NOT identical to MultiByteToWideChar for all inputs. Benchmarking RtlAnsiStringToUnicodeString vs MultiByteToWideChar is on my todo list, ive never seen someone do it before. I strongly RtlAnsiStringToUnicodeString and siblings will be faster since they take alot of shortcuts, such as being incapable of stopping conversion to wide, at formally unassigned by spec characters in legacy 1 byte per 1 char CPs. "incapable of stopping conversion to wide" means of L"?" wchar_ts in output.

char_low = *char_ptr;
if(char_low & MBCS_CONTINUE) {
    char_high = *(char_ptr+1);
    wide = (char_hi <<16) | char_low;
    char_ptr += 2; /* game over */
}
  • it's using locale sensitive functions to do it, so if you change locale the order may become invalid

IDK enough. Maybe this toupper()/isupper() bug has something to do with that newish in Perl many reader single writer locking process global locale inter-OS thread serializing/anti-race code.

Fixing all this would eliminate the toupper/isupper() calls, I don't know off-hand what the appropriate Win32 API would be.

What are Perl in C's mandatory requirement for vendor C std lib toupper()/isupper() ?

https://en.cppreference.com/w/cpp/string/byte/toupper says no 2^32 or 2^64 abuse

If the value of ch is not representable as unsigned char and does not equal [EOF] , the behavior is undefined.

As you and me both agreed on IRC, there is some really poor quality Win32 only code, inside https://github.com/Perl/perl5/blob/blead/win32/perlhost.h that turns the toupper() O(n^2) (?) bug into a O(n^n) (?) bug in 1 rare place in WinPerl not UnixPerl and only if you make a 2nd ithread/use threads.pm; fork(); in WinPerl.

But I'm less concerned about performance of creating ithread # 2 in a WinOS proc, vs perl interp executing this broken slow toupper() from the run loop in very generic, very common, very production, PP code executing through very vanilla cross-platform Perl C code in Perl_pp_some_op();. Image # 2 in the first post, shows a callstack, where toupper() was executed by the interpreter, from very vanilla cross platform posix-ish Perl C code. Not #ifdef WIN32\n #endif code. Vanilla PP code executing toupper() at a high frequency from generic typical PP code scares me alot more. This is report generated from a C symbol file of all callers of these problem UCRT functions in perl5.41.10.dll. 95% of callers are cross platform perl code, half of them are extreme frequency runloop Perl_pp_*() ops.

call    cs:__imp_isalnum S_find_byclass
call    cs:__imp_isalnum S_isFOO_lc
call    cs:__imp_isalnum S_new_ctype
call    cs:__imp_isalnum S_regmatch
call    cs:__imp_isalpha S_isFOO_lc
call    cs:__imp_isalpha S_new_ctype
call    cs:__imp_iscntrl Perl_mem_collxfrm_
call    cs:__imp_iscntrl Perl_pp_fttext
call    cs:__imp_iscntrl Perl_yyerror_pvn
call    cs:__imp_iscntrl S_find_byclass
call    cs:__imp_iscntrl S_isFOO_lc
call    cs:__imp_iscntrl S_new_ctype
call    cs:__imp_iscntrl S_regmatch
call    cs:__imp_iscntrl S_sv_display
call    cs:__imp_isdigit S_isFOO_lc
call    cs:__imp_isdigit S_new_ctype
call    cs:__imp_isgraph S_isFOO_lc
call    cs:__imp_isgraph S_new_ctype
call    cs:__imp_islower S_isFOO_lc
call    cs:__imp_islower S_new_ctype
call    cs:__imp_isprint Perl_pp_fttext
call    cs:__imp_isprint Perl_yyerror_pvn
call    cs:__imp_isprint S_isFOO_lc
call    cs:__imp_isprint S_new_ctype
call    cs:__imp_isprint S_sv_display
call    cs:__imp_ispunct S_find_byclass
call    cs:__imp_ispunct S_isFOO_lc
call    cs:__imp_ispunct S_new_ctype
call    cs:__imp_ispunct S_regmatch
call    cs:__imp_isspace Perl_pp_fttext
call    cs:__imp_isspace Perl_pp_split
call    cs:__imp_isspace S_isFOO_lc
call    cs:__imp_isspace S_new_ctype
call    cs:__imp_isupper S_isFOO_lc
call    cs:__imp_isupper S_new_ctype
call    cs:__imp_isxdigit S_isFOO_lc
call    cs:__imp_isxdigit S_new_ctype
call    cs:__imp_tolower Perl__to_utf8_fold_flags
call    cs:__imp_tolower Perl__to_utf8_lower_flags
call    cs:__imp_tolower Perl_pp_fc
call    cs:__imp_tolower Perl_pp_lc
call    cs:__imp_tolower Perl_pp_ucfirst
call    cs:__imp_tolower S_new_ctype
call    cs:__imp_toupper Perl__to_utf8_title_flags
call    cs:__imp_toupper Perl__to_utf8_upper_flags
call    cs:__imp_toupper Perl_pp_uc
call    cs:__imp_toupper Perl_pp_ucfirst
call    cs:__imp_toupper S_new_ctype
call    cs:__imp_toupper compare
call    cs:__imp_toupper lookup

@bulk88
Copy link
Contributor Author

bulk88 commented Feb 28, 2025

Another idea, on WinPerl, is a codebase wide grep isupper() -> isupper_l() needed and 100% removal of isupper() forever? If you use isupper_l() and pass a locale ptr in arg 2, The horribly inefficient LocaleUpdate::_LocaleUpdate which calls FlsGetValue() atleast 5x or more. Will take a 2 CPU instructions big shortcut and instant return.

9 stack push non-vol reg+test+cond_jmp+sse_mov_ptr_to_reg+sse_mov_reg_to_ptr+abs_jmp+9 stack pop non-vol reg+x86 ret

That branch in LocaleUpdate::_LocaleUpdate is a NOOP compared to what LocaleUpdate::_LocaleUpdate workload/cpu op countwise/runtime overhead wise, does right now inside the perl.exe proc.

If libperl.dll always passes a locale_t as arg 2, that Perl process-wide thread-wide locale settling race bug with WinPerl serializing multi-OS thread access, using a very poor DIY-ed by Perl re-implementation of MS's Slim reader/writer (SRW) API https://learn.microsoft.com/en-us/windows/win32/sync/slim-reader-writer--srw--locks that whole API thing, basically will disappear through macros/etc from WinPerl/libperl.dll, maybe the exported lock variables stay for less than perfect CPAN XS code, but nothing in libperl.dll will ever obtain that serialize lock ever again,

And MS UCRT Devs probably can't even see the LocaleUpdate::_LocaleUpdate method call since its a "POD" void * class member field in their source code and they forgot its a getter setter op overload method and not an machine type, and LocaleUpdate::_LocaleUpdate can't constant fold anything b/c rules of C/C++ (FlsGetValue() is a C symbol, you multi evaled it, inline and constant folding it is illegal in C).

It doesn't matter in 2025, but IIRC ucrtbase.dll on WinXP with VC2013, calls GetProcAddress("FlsGetValue"), GPA returns NULL, then ucrtbase.dll executes TlsGetValue() every single time (5x-6x).

@tonycoz
Copy link
Contributor

tonycoz commented Mar 11, 2025

We (probably @khwilliamson ) could change perl to use _create_locale() and the _is*_l() functions instead of the C standard functions, though this would be yet more Win32 specific code.

Unfortunately _create_locale() doesn't match the behaviour of POSIX newlocale, since you can't modify an existing locale object to mix locales the way you can with newlocale().

To behave the same we'd need to keep separate locale objects for each category, and that has problems for functions that work with more than one locale category (strftime at least), though I believe such mixing is usually a bad idea.

But, even if isupper() is 133x slower, how much of an effect does that have on real code?

It might be worth benchmarking and profiling related code (regexps?) to see.

@xenu
Copy link
Member

xenu commented Mar 11, 2025

I didn't really look into this at all, but if there's a performance regression, perhaps it would make sense to report it to Microsoft?

Technically, UCRT is part of the OS and reporting a bug in Windows requires a support contract (well, there's also the Feedback Hub, but no one reads it...).

However, there's a workaround: UCRT bugs reported on Visual Studio Developer Community are often forwarded to the Windows team.

That said, I imagine performance regressions are low priority for them.

@bulk88
Copy link
Contributor Author

bulk88 commented Mar 13, 2025

I didn't really look into this at all, but if there's a performance regression, perhaps it would make sense to report it to Microsoft?

Technically, UCRT is part of the OS and reporting a bug in Windows requires a support contract (well, there's also the Feedback Hub, but no one reads it...).

However, there's a workaround: UCRT bugs reported on Visual Studio Developer Community are often forwarded to the Windows team.

That said, I imagine performance regressions are low priority for them.

Some links

https://learn.microsoft.com/en-us/cpp/c-runtime-library/recommendations-for-choosing-between-functions-and-macros?view=msvc-170
https://learn.microsoft.com/en-us/cpp/c-runtime-library/global-state?view=msvc-170

setlocale is so "unsafe"/"unstable", notice MS sandboxed user mode code from MS user mode code in same addr space with the __o_() feature.

some limited quotes from UCRT headers,

_Check_return_ _CRT_JIT_INTRINSIC _ACRTIMP int __cdecl isupper(_In_ int _C);
_Check_return_ _ACRTIMP int __cdecl _isupper_l(_In_ int _C, _In_opt_ _locale_t _Locale);
#if defined _CRT_DISABLE_PERFCRIT_LOCKS && !defined _DLL && !defined __cplusplus
        #define isupper(c)  (MB_CUR_MAX > 1 ? _isctype(c, _UPPER) : __chvalidchk(c, _UPPER))
#endif
_When_(_Param_(1) == 0, _Post_equal_to_(0))
_Check_return_ _CRT_JIT_INTRINSIC _ACRTIMP int __cdecl toupper(_In_ int _C);
extern "C" extern __inline int (__cdecl _isupper_l)(int const c, _locale_t const locale){
    _LocaleUpdate locale_update(locale);
    return _isupper_l(c, locale_update.GetLocaleT());
}
extern "C" extern __inline int (__cdecl isupper) (int const c) {
    return __acrt_locale_changed() ? (_isupper_l)(c, nullptr) : fast_check(c, _UPPER);
}

There is more questionable MS code in the UCRT .cpp files that is important to look at. MS Devs occasionally put comments saying why they did things the way they did, or what end user hazard they were coding around were. But the UCRT/MSVC compiler has a source available license, not a FOSS license, so I'd rather not copy paste large methods/function bodies into this archival GH ticket. Steve Hay, Tonyc, etc I know all of you have the UCRT src code on your systems just like I do.

Notice extern __inline and _CRT_JIT_INTRINSIC tags, MS devs maybe thing, or in C++ mode on their projects, the perf bug in this ticket is inline/optimized away on C++ code or MS in-house or compliant UWP code, but WinPerl is doing something wrong/different, and WinPerl at CC time hit a extern "C" linker symbol which stop "constant folding".

Also WinPerl use -O1, not -O2 (TODO this really needs to be fixed since 1996, fix is NOT adding -O2 , but a carefully decided '-O1 -Gw -Gy -Oi -Os -Ob******. Carefully picking the -Ob***** is why I haven't written such a PR yet. The wrong '-Ob', or plain '-O2', will inline ALL staticdecl-ed functions in Perl 5 in every.o. Including all the Perl_debug_log` and ice cold panic/debugging/assert C code we don't care about perf wise.

@bulk88
Copy link
Contributor Author

bulk88 commented Mar 13, 2025

We (probably @khwilliamson ) could change perl to use _create_locale() and the _is*_l() functions instead of the C standard functions, though this would be yet more Win32 specific code.

Unfortunately _create_locale() doesn't match the behaviour of POSIX newlocale, since you can't modify an existing locale object to mix locales the way you can with newlocale().

Do you have www links to the posix or msdn or p5p GH git repo APIs ?

Follow Sarathy's principles for WinPerl, "POSIX" can't and never will exist inside WinPerl. Either P5P fakes it, or MS CRT fakes it. All of "libc" features/vm state are fictional and only exist inside 1 winperl process address space. There is no interop/ipc/IO communication with other procs using POSIX [tokens] APIs. Either P5P or the CRT, always converted the posix-ese into Kernel32.dll-exe.

I'm wondering if WinPerl do setlocale by calling K32/NLS APIs directly, since UCRT refuses to use its UCRT locale cache (P5P bug!), and keeps calling out to K32 all the time. So if WinPerl does a setlocale with raw MS NLS API/K32 API, it will automatically appear inside ucrtbase.dll state, and perhaps ucrtbase.dll's locale cache WILL WORK if it data from K32/NTDLL state and not end user libc UCRT public API state. K32/NLS/NT kernel/ring 0 support for locales is unusually high in amateur opinion. ntdll.dll has

NtQueryDefaultLocale
NtSetDefaultLocale
RtlLocaleNameToLcid
RtlGetLocaleFileMappingAddress
RtlpQueryDefaultUILanguage
NtQueryDefaultUILanguage
NtQueryInstallUILanguage
RtlGetProcessPreferredUILanguages
RtlGetSystemPreferredUILanguages
RtlGetThreadPreferredUILanguages
RtlGetUserPreferredUILanguages
RtlpVerifyAndCommitUILanguageSettings

So that looks to me like WinOS and the concept of Locales run very deep into Ring 0 in Windows. Basically MS designed it so the text debugging log of a kernel sound card driver, will chance the thousands separator character in a sys admin debug log file, within 1000 microseconds/1 millisecond/1 video frame, after OS wide RtlTheRealSetLocale() executes from user mode.

Maybe UCRT needs compliance with this, but Perl and PP state doesn't. Due to lack of knowledge IDK what Perl is trying to exactly fix in WinPerl, and who the consumer/end user is of the fix. And is the "fix" for actual observed defects in Perl end user production code, or is the 5.36/5.38/5.40 locale safety/serialization code, trying to fix a fuzz testing/spectre race condition khw discovered in Windows?

My generic instinct say, since APIs like toupper() no _l() and setlocale are inter-thread proc wide nasty static buffer APIs like localetime or asctime, MS devs don't perf test or pay src code attention to C functions they aren't allowed to use/commit code with in-house, and automation blocks those unsafe C functions (think SDL policy). I'll guess the only calls to toupper() no _l() and setlocale in MS in-house code are unit test files, that a manager sign off on to whitelist that mention to toupper() b/c its inside a unit test.

Thats the reason I keep thinking touppper_l(obj) is the way out, b/c thats how all MS devs have to write their in house code, so toupper_l(obj) gets benchmarked/regresssion-ed at MS, toupper() and toupper_l(NULL) dont.

To behave the same we'd need to keep separate locale objects for each category, and that has problems for functions that work with more than one locale category (strftime at least), though I believe such mixing is usually a bad idea.
But, even if isupper() is 133x slower, how much of an effect does that have on real code?

I caught this in a profiler to attached to perl.exe when doing various .t'es from perl core make test. And sometimes on personal/cpan PP code making the same slow backtraces and then identified by the profiler. This isn't trying to optimize a die() or croak() branch.

It might be worth benchmarking and profiling related code (regexps?) to see.

I'd have to recomp perl.dll with a bunch of __debugbreak()s and spend alot of time to figure out what PP or CPAN XS code caused crawling slow UCRT toupper() to be used instead of P5P fast native ToUPPER(). But locale APIs I barely know anything about so im least qualified to say what is right/wrong/bug/less than perfect/perfect in that topic.

There might be a P5P bug, that somehow PP or CPAN XS or p5p C, is passing locale "en-US; English; United States; NY; New York; 6 Empire Plaza; Suite 715;" which is "legal"' but a denormal local setting, that needs to be parsed/normalized in a loop O(n) against the NLS/Registry master disk data each time to normalize it to "en-US; English; United States;". IDK enough.

@bulk88
Copy link
Contributor Author

bulk88 commented Mar 13, 2025

other things I forgot,

UCRT's _strlwr() calls __acrt_locale_changed()/LocaleUpdate::_LocaleUpdate only ONCE. not O(n) or O(n*15). If perl used that and passed a proper longish char* instead of tolower, problem gone.

2nd, this is a comment from the CRT src code. I can't copy paste the whole thing here, but this comment is really important.

Note that we don't need _LocaleUpdate in this function. The main reason being, that this is a leaf function in locale usage terms.

leaf function in locale usage terms

My reading of that means its P5P's bug and P5P's defect for failing to

LOC_T* lobj = get_globale_OS_proc_locale_obj();
for(i=0;i<end;i++} {
    arr[i]=toupper_l(arr[i], lobj);
}
free_loc_obj(lobj);

or in PerlXS API p5p is doing on a tied() TIESCALAR

SV* sv = ST(0);
for(i=0;i<SvCUR(sv);i++} {
    SvGETMAGIC(sv);
    arr = SvPV_force_nolen(sv);
    arr[i]=  arr[i]  >= 'a' && arr[i]  <= 'z' ? arr[i] - ('a' - 'A') : arr[i] ;
    SvSETMAGIC(sv);
}

@bulk88
Copy link
Contributor Author

bulk88 commented Mar 15, 2025

https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/configthreadlocale?view=msvc-170

I'd have to benchmark it to prove it, but I have a suspicion all of UCRT's internal cache code regarding parsing locale wide or ascii string names and its ctype table cache, needs configthreadlocale() to be called to manipulate UCRT's private TLS struct and a UCRT C global var called __globallocalestatus, that ultimately controls calling sync_legacy_variables_lk, but IDK what sync_legacy_variables_lk becomes at runtime, since it is inlined away (but not constant folded), and doesn't appear in MS's .pdb file as a string, so it can't be seen in a MSVC call stack. I think sync_legacy_variables_lk optimizes to 4 useless calls in a row to a symbol called __crt_state_management::get_current_state_index() which is a TlsGetValue/FlsGetValue getter. The retval of __crt_state_management::get_current_state_index() which is a pointer (reg RAX) is never checked or used by the 2 callers, which are setlocale and wsetlocale.

If UCRT's internal TLS structs don't "match", the ultimate source of truth for POSIX locales/is_/to_ after all the bloat and very badly written C++ classes, UCRT will be getting the truth for is_/to_ from kernel32's GetStringTypeW https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-getstringtypew#remarks and CT_CTYPE1/C1_UPPER/C1_LOWER, indirectly through GetStringTypeA https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getstringtypea and obviously GetStringTypeA isn't a real function since this is WinNT v6-v10, not Win95, and all the *A functions are autogenerated C++ templates inside kernel32/kernelbase, and rarely autogenerated inside the UCRT (UWP app store prohibits linking to *A symbols or something like that).

Since the concepts of Latin-1/7 bit ASCII/mbcs/8 bit characters, don't exist on Windows OS, except through the opaque casting function called MultiByteToWideChar, the whitespace/upper/lower/digit status of a IEEE 754 double 7 bit ASCII character can't be determined except by casting the IEEE 754 double 7 bit ASCII character to a Unicode character with MultiByteToWideChar.

@bulk88
Copy link
Contributor Author

bulk88 commented Mar 27, 2025

similar src code problem in mingw 8.3.0/Strawberry 5.32's bundled gcc

https://github.com/mingw-w64/mingw-w64/blob/90da6c6535c503159e952dc8b44f8778bed6f622/mingw-w64-crt/misc/mbrtowc.c#L4

Reading Mingw's "ASCII to Wide" code, Im seeing the same O(n^2)-ness that UCRT is doing.

Implementing each byte of the for loop as a dozen calls into kernel32.dll which is a dozen TLS calls best cast, worst case 100s of TLS calls and 100s of files that need to be decompressed and parsed.

IsDBCSLeadByteEx() calls NlsValidateLocale() https://github.com/wine-mirror/wine/blob/8d40da7ffda5e8dde9200f733e7d2cebf0196bc3/dlls/kernelbase/locale.c#L723 and NlsValidateLocale() sets off a chain of decompressing and parsing a couple 100/1000 locales on disk/on a mmap

https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/70feba9f-294e-491e-b6eb-56532684c37f

100s/1000s of permutations in there

update:

the IsDBCSLeadByteEx() byte by byte, algorithm seems to be copy paste from SO https://stackoverflow.com/a/27196334 and is probably repeated over and over by college students or young on certain unreliable tech social media sites, known for anonymous users, trying to ChatGPT their way to the next account ribbon/flair.

@bulk88
Copy link
Contributor Author

bulk88 commented Mar 27, 2025

python/cpython#79376

https://bugs.python.org/issue35195

In 2018 Python identified this problem. Py ticket remains open ATM Feb 2025. IDK enough arch/API/design/tech info to understand all the comments in the cPy tickets if there is a proposed fix or reject fix or unfairly rejected fix in those 2 tickets.

maybe related to this ticket

https://bugs.python.org/issue7442

https://bugs.python.org/issue31900

https://vstinner.github.io/python3-locales-encodings.html general tech prose, probably not related to whatever im benchmarking

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants