Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract hangs in std::regex when loading in UCRT with native UTF-8 locale on Windows #3830

Closed
jeroen opened this issue May 27, 2022 · 16 comments · Fixed by r-windows/rtools-packages#257

Comments

@jeroen
Copy link
Contributor

jeroen commented May 27, 2022

Environment

  • Tesseract Version: 5.1.0
  • Platform: Windows 10 or 11, running in native UTF-8

Current Behavior:

I maintain tesseract bindings for the R programming language.

The R project recently switched to ucrt compilers, and defaults to running in native UTF-8 locale on supported versions of Windows (Server 2022 Windows 10 / version 1903 and up). See details.

Users have reported that the tesseract bindings hang and/or crash when initiated on Windows in a process running UTF-8 locale. I have indeed observed on one Windows 11 vm that the process goes to 100% cpu and maxes out RAM on load.

The problem seems similar to: #2420

@stweil
Copy link
Member

stweil commented May 27, 2022

Can you attach a debugger to the running process and get a stack trace which shows the code location or function(s) which run in a loop?

@jeroen
Copy link
Contributor Author

jeroen commented May 27, 2022

It seems to happen in the Init() call below in my bindings at least, but that doesn't tell much. I'm going to try and build a version of tesseract with debug symbols.

  tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
  int err = api->Init(path, lang, tesseract::OEM_DEFAULT, configs, confpaths.length(), &params, &values, false);

Screen Shot 2022-05-27 at 3 02 59 PM

@stweil
Copy link
Member

stweil commented May 27, 2022

maxes out RAM on load.

That sounds like an endless recursion which should be visible in a stack trace.

@stweil
Copy link
Member

stweil commented May 27, 2022

Does a tesseract.exe which was built with UCRT work fine?
How can I get a "native UTF-8 locale" for local tests on Windows 10? Is it possible and sufficient to set LANG=en_US.UTF-8 (or similar) in a shell like on Linux?

@jeroen
Copy link
Contributor Author

jeroen commented May 27, 2022

I think you need to declare this in the "manifest": https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page

I'm not sure if it is possible to set a single process to UTF-8 with an environment variable. @kalibera probably knows.

@kalibera
Copy link

That depends on what you want to do. Via the manifest, you can change the ACP, as done by R front-ends:
https://svn.r-project.org/R/trunk/src/gnuwin32/front-ends/Rscript64.exe.manifest

Without that, the ACP would be what is the Windows default (e.g. Latin-1, but definitely not UTF-8, unless you've changed that system-wide in your Windows installation).

Then there is the encoding of the C runtime. You can change that alone by setlocale() without changing the ACP, but it is not a good idea in principle, because some calls with then use one encoding and some the other, and it is hard to keep track which is which, it is essentially impossible if you have contributed code.

When you change the ACP via the manifest file, as R does, it will typically end up changing also the C runtime (depending on how your application works, how it calls setlocale).

@jeroen
Copy link
Contributor Author

jeroen commented May 28, 2022

#0  0x00007ffc4137e2ae in ucrtbase!memcmp () from C:\Windows\System32\ucrtbase.dll
#1  0x00007ffbe0ac8f3e in std::char_traits<char>::copy (__n=2147483647, __s2=0x2b345eff040 "", __s1=0x2b3c5f16040 "")
    at D:/a/_temp/msys64/ucrt64/include/c++/10.3.0/bits/char_traits.h:409
#2  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_S_copy (__n=2147483647, __s=0x2b345eff040 "", __d=0x2b3c5f16040 "")
    at D:/a/_temp/msys64/ucrt64/include/c++/10.3.0/bits/basic_string.h:351
#3  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_mutate (this=this@entry=0xbd03fc6090, __pos=0, __len1=__len1@entry=0, 
    __s=0x2b345eff040 "", __len2=2147483647) at D:/a/_temp/msys64/ucrt64/include/c++/10.3.0/bits/basic_string.tcc:322
#4  0x00007ffbe0ac8d2c in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_append (this=0xbd03fc6090, __s=<optimized out>, 
    __n=<optimized out>) at D:/a/_temp/msys64/ucrt64/include/c++/10.3.0/bits/basic_string.tcc:370
#5  0x00007ffbe0a32e5a in std::__cxx11::collate<char>::do_transform(char const*, char const*) const ()
   from C:\Users\User\AppData\Local\R\win-library\4.2\tesseract\libs\x64\tesseract.dll
#6  0x00007ffbe0a31d79 in std::__cxx11::collate<char>::transform (__hi=<optimized out>, __lo=<optimized out>, this=0x7ffbe0b09f40 <(anonymous namespace)::collate_c>)
    at D:/a/_temp/msys64/ucrt64/include/c++/10.3.0/bits/locale_classes.h:722
#7  std::__cxx11::regex_traits<char>::transform<char*> (__last=0x2b344d0da71 '«' <repeats 16 times>, "þîþîþîþîþîþîþîþ", 
    __first=0x2b344d0da70 "„", '«' <repeats 16 times>, "þîþîþîþîþîþîþîþ", this=0x2b34470b258) at D:/a/_temp/msys64/ucrt64/include/c++/10.3.0/bits/regex.h:230
#8  std::__cxx11::regex_traits<char>::transform_primary<char const*> (this=0x2b34470b258, __first=__first@entry=0xbd03fc6108 "„&ZC³\002", 
    __last=__last@entry=0xbd03fc6109 "&ZC³\002") at D:/a/_temp/msys64/ucrt64/include/c++/10.3.0/bits/regex.h:261
#9  0x00007ffbe0b0148f in std::__detail::_BracketMatcher<std::__cxx11::regex_traits<char>, false, false>::_M_apply(char, std::integral_constant<bool, false>) const::{lambda()#1}::operator()() const (this=0xbd03fc6260) at D:/a/_temp/msys64/ucrt64/include/c++/10.3.0/bits/regex_compiler.tcc:628
#10 0x00007ffbe0a5fd8b in std::__detail::_BracketMatcher<std::__cxx11::regex_traits<char>, false, false>::_M_apply (this=this@entry=0xbd03fc6260, __ch=__ch@entry=-124 '„')
    at D:/a/_temp/msys64/ucrt64/include/c++/10.3.0/bits/regex_compiler.tcc:636
#11 0x00007ffbe0add55b in std::__detail::_BracketMatcher<std::__cxx11::regex_traits<char>, false, false>::_M_make_cache (this=0xbd03fc6260)
    at D:/a/_temp/msys64/ucrt64/include/c++/10.3.0/bits/regex_compiler.h:533
#12 std::__detail::_BracketMatcher<std::__cxx11::regex_traits<char>, false, false>::_M_ready (this=this@entry=0xbd03fc6260)
    at D:/a/_temp/msys64/ucrt64/include/c++/10.3.0/bits/regex_compiler.h:504
#13 0x00007ffbe0ae5418 in std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_M_insert_bracket_matcher<false, false> (this=this@entry=0xbd03fc6800, 
    __neg=__neg@entry=true) at D:/a/_temp/msys64/ucrt64/include/c++/10.3.0/bits/regex_compiler.tcc:446
#14 0x00007ffbe0ae4f2e in std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_M_bracket_expression (this=this@entry=0xbd03fc6800)
    at D:/a/_temp/msys64/ucrt64/include/c++/10.3.0/bits/regex_compiler.tcc:365
#15 0x00007ffbe0ae88bc in std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_M_atom (this=this@entry=0xbd03fc6800)
    at D:/a/_temp/msys64/ucrt64/include/c++/10.3.0/bits/regex_compiler.tcc:351
#16 0x00007ffbe0ae88f5 in std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_M_term (this=this@entry=0xbd03fc6800)
    at D:/a/_temp/msys64/ucrt64/include/c++/10.3.0/bits/regex_compiler.tcc:141
#17 0x00007ffbe0ae2dbe in std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_M_alternative (this=this@entry=0xbd03fc6800)
    at D:/a/_temp/msys64/ucrt64/include/c++/10.3.0/bits/regex_compiler.tcc:123
#18 0x00007ffbe0ae2e30 in std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_M_alternative (this=this@entry=0xbd03fc6800)
    at D:/a/_temp/msys64/ucrt64/include/c++/10.3.0/bits/alloc_traits.h:527
#19 0x00007ffbe0ae2e30 in std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_M_alternative (this=this@entry=0xbd03fc6800)
    at D:/a/_temp/msys64/ucrt64/include/c++/10.3.0/bits/alloc_traits.h:527
#20 0x00007ffbe0ae3037 in std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_M_disjunction (this=this@entry=0xbd03fc6800)
    at D:/a/_temp/msys64/ucrt64/include/c++/10.3.0/bits/regex_compiler.tcc:99
#21 0x00007ffbe0ae8bb3 in std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_Compiler (this=this@entry=0xbd03fc6800, 
    __b=__b@entry=0x7ffbe0b1329f <tesseract::ASSERT_FAILED+2495> "(.*)/[^/]*", __e=__e@entry=0x7ffbe0b132a9 <tesseract::ASSERT_FAILED+2505> "", __loc=..., 
    __flags=(unknown: 0x10)) at D:/a/_temp/msys64/ucrt64/include/c++/10.3.0/bits/regex_compiler.tcc:84
#22 0x00007ffbe0576259 in std::__detail::__compile_nfa<std::__cxx11::regex_traits<char>, char const*> (__flags=<optimized out>, __loc=..., 
    __last=0x7ffbe0b132a9 <tesseract::ASSERT_FAILED+2505> "", __first=0x7ffbe0b1329f <tesseract::ASSERT_FAILED+2495> "(.*)/[^/]*")
    at D:/a/_temp/msys64/ucrt64/include/c++/10.3.0/bits/regex_compiler.h:183
#23 std::__cxx11::basic_regex<char, std::__cxx11::regex_traits<char> >::basic_regex<char const*> (__f=(unknown: 0x10), __loc=..., 
    __last=0x7ffbe0b132a9 <tesseract::ASSERT_FAILED+2505> "", __first=0x7ffbe0b1329f <tesseract::ASSERT_FAILED+2495> "(.*)/[^/]*", this=0xbd03fc69d0)
    at D:/a/_temp/msys64/ucrt64/include/c++/10.3.0/bits/regex.h:764
#24 std::__cxx11::basic_regex<char, std::__cxx11::regex_traits<char> >::basic_regex<char const*> (__f=(unknown: 0x10), __last=0x7ffbe0b132a9 <tesseract::ASSERT_FAILED+2505> "", __first=0x7ffbe0b1329f <tesseract::ASSERT_FAILED+2495> "(.*)/[^/]*", this=0xbd03fc69d0) at D:/a/_temp/msys64/ucrt64/include/c++/10.3.0/bits/regex.h:507
#25 std::__cxx11::basic_regex<char, std::__cxx11::regex_traits<char> >::basic_regex (__f=(unknown: 0x10), __p=0x7ffbe0b1329f <tesseract::ASSERT_FAILED+2495> "(.*)/[^/]*", this=0xbd03fc69d0) at D:/a/_temp/msys64/ucrt64/include/c++/10.3.0/bits/regex.h:440
#26 tesseract::Tesseract::ParseLanguageString (this=this@entry=0x2b3446787e0, lang_str=..., to_load=to_load@entry=0xbd03fc6b20, not_to_load=not_to_load@entry=0xbd03fc6b00) at ../tesseract-5.1.0/src/ccmain/tessedit.cpp:251
#27 0x00007ffbe0576b2a in tesseract::Tesseract::init_tesseract (this=this@entry=0x2b3446787e0, arg0=..., textbase=..., language=..., oem=oem@entry=tesseract::OEM_DEFAULT, configs=configs@entry=0x0, configs_size=configs_size@entry=0, vars_vec=vars_vec@entry=0x0, vars_values=vars_values@entry=0x0, set_only_non_debug_params=set_only_non_debug_params@entry=false, mgr=mgr@entry=0xbd03fc6c50) at ../tesseract-5.1.0/src/ccmain/tessedit.cpp:298
#28 0x00007ffbe053f474 in tesseract::TessBaseAPI::Init (this=this@entry=0x2b343372310, data=<optimized out>, data@entry=0x0, data_size=data_size@entry=0, language=language@entry=0x2b344374088 "eng", oem=oem@entry=tesseract::OEM_DEFAULT, configs=configs@entry=0x0, configs_size=configs_size@entry=0, vars_vec=vars_vec@entry=0x0, vars_values=vars_values@entry=0x0, set_only_non_debug_params=set_only_non_debug_params@entry=false, reader=reader@entry=0x0) at ../tesseract-5.1.0/src/api/baseapi.cpp:415
#29 0x00007ffbe053f718 in tesseract::TessBaseAPI::Init (this=this@entry=0x2b343372310, datapath=datapath@entry=0x0, language=language@entry=0x2b344374088 "eng", oem=oem@entry=tesseract::OEM_DEFAULT, configs=configs@entry=0x0, configs_size=configs_size@entry=0, vars_vec=vars_vec@entry=0x0, vars_values=vars_values@entry=0x0, set_only_non_debug_params=set_only_non_debug_params@entry=false) at ../tesseract-5.1.0/src/api/baseapi.cpp:371
#30 0x00007ffbe0539d3a in tesseract::TessBaseAPI::Init (oem=tesseract::OEM_DEFAULT, language=0x2b344374088 "eng", datapath=0x0, this=0x2b343372310) at C:/RBuildTools/4.0/ucrt64/include/tesseract/baseapi.h:212
#31 tesseract_engine_internal (datapath=..., language=..., confpaths=..., opt_names=..., opt_values=...) at tesseract.cpp:74
#32 0x00007ffbe0532228 in _tesseract_tesseract_engine_internal (datapathSEXP=0x2b34492f770, languageSEXP=0xbd03fc7160, confpathsSEXP=0xbd03fc7180, opt_namesSEXP=0xbd03fc71a0, opt_valuesSEXP=0x2b344c92260) at RcppExports.cpp:35
#33 0x00007ffbf6056cf3 in Rf_NewFrameConfirm () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#34 0x00007ffbf605744d in Rf_NewFrameConfirm () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#35 0x00007ffbf609ba84 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#36 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#37 0x00007ffbf60bb6ec in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#38 0x00007ffbf60bcafa in Rf_applyClosure () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#39 0x00007ffbf60a9144 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#40 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#41 0x00007ffbf60bb6ec in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#42 0x00007ffbf60bcafa in Rf_applyClosure () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#43 0x00007ffbf60a9144 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#44 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#45 0x00007ffbf60bb6ec in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#46 0x00007ffbf60bcafa in Rf_applyClosure () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#47 0x00007ffbf60a9144 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#48 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#49 0x00007ffbf60b6f5b in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#50 0x00007ffbf60b7470 in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#51 0x00007ffbf60a3cc4 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#52 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#53 0x00007ffbf60b6f5b in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#54 0x00007ffbf60b7470 in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#55 0x00007ffbf60a3cc4 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#56 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#57 0x00007ffbf60b6f5b in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#58 0x00007ffbf60b7470 in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#59 0x00007ffbf60a3cc4 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#60 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#61 0x00007ffbf60b6f5b in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#62 0x00007ffbf60b7470 in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#63 0x00007ffbf60a3cc4 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#64 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#65 0x00007ffbf60bb6ec in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#66 0x00007ffbf60bcafa in Rf_applyClosure () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#67 0x00007ffbf60a9144 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#68 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#69 0x00007ffbf60bb6ec in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#70 0x00007ffbf60bcafa in Rf_applyClosure () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#71 0x00007ffbf60a9144 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#72 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#73 0x00007ffbf60bb6ec in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#74 0x00007ffbf60bcafa in Rf_applyClosure () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#75 0x00007ffbf60a9144 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#76 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#77 0x00007ffbf60bb6ec in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#78 0x00007ffbf60bcafa in Rf_applyClosure () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#79 0x00007ffbf60a9144 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#80 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#81 0x00007ffbf60bb6ec in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#82 0x00007ffbf60bcafa in Rf_applyClosure () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#83 0x00007ffbf60a9144 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#84 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#85 0x00007ffbf60bb6ec in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#86 0x00007ffbf60bcafa in Rf_applyClosure () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#87 0x00007ffbf60a9144 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#88 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#89 0x00007ffbf60b6f5b in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#90 0x00007ffbf60b7470 in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#91 0x00007ffbf60a3cc4 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#92 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#93 0x00007ffbf60b6f5b in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#94 0x00007ffbf60b7470 in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#95 0x00007ffbf60a3cc4 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#96 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#97 0x00007ffbf60b6f5b in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#98 0x00007ffbf60b7470 in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#99 0x00007ffbf60a3cc4 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#100 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#101 0x00007ffbf60b6f5b in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#102 0x00007ffbf60b7470 in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#103 0x00007ffbf60a3cc4 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#104 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#105 0x00007ffbf60bb6ec in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#106 0x00007ffbf60bcafa in Rf_applyClosure () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#107 0x00007ffbf60a9144 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#108 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#109 0x00007ffbf60bb6ec in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#110 0x00007ffbf60bcafa in Rf_applyClosure () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#111 0x00007ffbf60a9144 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#112 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#113 0x00007ffbf60bb6ec in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#114 0x00007ffbf60bcafa in Rf_applyClosure () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#115 0x00007ffbf60a9144 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#116 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#117 0x00007ffbf60bb6ec in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#118 0x00007ffbf60bcafa in Rf_applyClosure () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#119 0x00007ffbf60a9144 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#120 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#121 0x00007ffbf60bb6ec in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#122 0x00007ffbf60bcafa in Rf_applyClosure () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#123 0x00007ffbf60a9144 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#124 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#125 0x00007ffbf60bb6ec in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#126 0x00007ffbf60bcafa in Rf_applyClosure () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#127 0x00007ffbf60a9144 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#128 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#129 0x00007ffbf60b6f5b in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#130 0x00007ffbf60b7470 in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#131 0x00007ffbf60a3cc4 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#132 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#133 0x00007ffbf60b6f5b in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#134 0x00007ffbf60b7470 in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#135 0x00007ffbf60a3cc4 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#136 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#137 0x00007ffbf60b6f5b in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#138 0x00007ffbf60b7470 in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#139 0x00007ffbf60a3cc4 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#140 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#141 0x00007ffbf60b6f5b in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#142 0x00007ffbf60b7470 in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#143 0x00007ffbf60a3cc4 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#144 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#145 0x00007ffbf60bb6ec in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#146 0x00007ffbf60bcafa in Rf_applyClosure () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#147 0x00007ffbf60a9144 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#148 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#149 0x00007ffbf60bb6ec in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#150 0x00007ffbf60bcafa in Rf_applyClosure () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#151 0x00007ffbf60a9144 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#152 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#153 0x00007ffbf60bb6ec in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#154 0x00007ffbf60bcafa in Rf_applyClosure () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#155 0x00007ffbf60a9144 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#156 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#157 0x00007ffbf60bb6ec in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#158 0x00007ffbf60bcafa in Rf_applyClosure () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#159 0x00007ffbf60a9144 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#160 0x00007ffbf60b4e71 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#161 0x00007ffbf60bb6ec in R_cmpfun1 () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#162 0x00007ffbf60bcafa in Rf_applyClosure () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#163 0x00007ffbf60b4fb3 in R_initAssignSymbols () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#164 0x00007ffbf60e357d in Rf_ReplIteration () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#165 0x00007ffbf60e3938 in Rf_ReplIteration () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#166 0x00007ffbf60e39d2 in run_Rmainloop () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#167 0x00007ffbf60e3a6e in Rf_mainloop () from C:\Program Files\R\R-4.2.0\bin\x64\R.dll
#168 0x00007ff6673f171c in ?? ()
#169 0x00007ff6673f1567 in ?? ()
#170 0x00007ff6673f13c1 in ?? ()
#171 0x00007ff6673f14f6 in ?? ()
#172 0x00007ffc435054e0 in KERNEL32!BaseThreadInitThunk () from C:\Windows\System32\kernel32.dll
#173 0x00007ffc439a485b in ntdll!RtlUserThreadStart () from C:\Windows\SYSTEM32\ntdll.dll
#174 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

@stweil
Copy link
Member

stweil commented May 28, 2022

So it fails in this line:

std::regex e("(.*)/[^/]*");

@stweil stweil changed the title Tesseract hangs on Windows when loading in UCRT with native UTF-8 locale Tesseract hangs in std::regex when loading in UCRT with native UTF-8 locale on Windows May 28, 2022
@stweil
Copy link
Member

stweil commented May 28, 2022

I'd like to reproduce the issue. Do you build natively on Windows, or is there also a cross compiler for Linux which can target UCRT? I usually build on Linux, but Debian only has i686-w64-mingw32-gcc and x86_64-w64-mingw32-gcc.

@stweil
Copy link
Member

stweil commented May 28, 2022

I tried to reproduce the issue with this small test code:

#include <iostream>
#include <regex>
#include <string>

int main(int argc, char *argv[]) {
  std::string lang = argc < 2 ? "test" : argv[1];
  std::regex e("(.*)/[^/]*");
  std::cmatch cm;
  std::string prefix;
  if (std::regex_match(lang.c_str(), cm, e, std::regex_constants::match_default)) {
    // A prefix was found.
    prefix = cm[1].str() + "/";
    std::cout << "prefix = " << prefix << std::endl;
  }
  size_t found = lang.find_last_of('/');
  if (found != std::string::npos) {
    prefix = lang.substr(0, found) + "/";
    std::cout << "prefix = " << prefix << std::endl;
  }
  return 0;
}

My manifest file looks like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
  <assemblyIdentity type="win32" name="a" version="6.0.0.0"/>
  <application>
    <windowsSettings>
      <activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
    </windowsSettings>
  </application>
</assembly>

That test program works fine, but I had a newer compiler (x86_64-w64-mingw32-g++ 12.1.0 from package mingw-w64-ucrt-x86_64-gcc).

@kalibera
Copy link

I think you should initialize the C runtime locale using setlocale(LC_ALL, "") to match the ACP. You should also print the current locale and GetACP() to verify it is set as intended. You need recent Windows 10 for this to work, otherwise the manifest part will be ignored. The R initialization re locale is in https://svn.r-project.org/R/trunk/src/main/main.c.

Yes, there is also a cross-compiler which you can run on Linux, in addition to a native compiler, for use with R. See https://cran.r-project.org/bin/windows/base/howto-R-devel.html. Normally external libraries are cross-compiled, but R and R packages are compiled natively. Details are in that document.

@stweil
Copy link
Member

stweil commented May 29, 2022

GetACP() returns 1252, so it still does not use UTF-8.

But after adding setlocale(LC_ALL, ".UTF8"), I now also get a "hanging" program: it takes about 2 minutes to finish successfully. During those 2 minutes the memory usage increases and decreases 9 times by about 3 or 4 GB (like a sawtooth). This also happens without the manifest.

@stweil
Copy link
Member

stweil commented May 29, 2022

Although this is more an issue of UCRT (where I see little chances to get it fix soon) it affects Tesseract. As it is easy to replace the std::regex code for Tesseract, I think doing that is the best fix for this issue.

stweil added a commit that referenced this issue May 29, 2022
On Windows with UCRT and a UTF-8 locale std::regex takes a lot of time
(several minutes!). Replacing it avoids that bottleneck.

Signed-off-by: Stefan Weil <[email protected]>
@stweil
Copy link
Member

stweil commented May 29, 2022

The issue should be fixed now in git master, see commit 64bcdce.

@jeroen
Copy link
Contributor Author

jeroen commented May 29, 2022

thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants