-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tesseract hangs in std::regex when loading in UCRT with native UTF-8 locale on Windows #3830
Comments
Can you attach a debugger to the running process and get a stack trace which shows the code location or function(s) which run in a loop? |
It seems to happen in the tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
int err = api->Init(path, lang, tesseract::OEM_DEFAULT, configs, confpaths.length(), ¶ms, &values, false); |
That sounds like an endless recursion which should be visible in a stack trace. |
Does a |
I think you need to declare this in the "manifest": https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page I'm not sure if it is possible to set a single process to UTF-8 with an environment variable. @kalibera probably knows. |
That depends on what you want to do. Via the manifest, you can change the ACP, as done by R front-ends: Without that, the ACP would be what is the Windows default (e.g. Latin-1, but definitely not UTF-8, unless you've changed that system-wide in your Windows installation). Then there is the encoding of the C runtime. You can change that alone by setlocale() without changing the ACP, but it is not a good idea in principle, because some calls with then use one encoding and some the other, and it is hard to keep track which is which, it is essentially impossible if you have contributed code. When you change the ACP via the manifest file, as R does, it will typically end up changing also the C runtime (depending on how your application works, how it calls setlocale). |
|
So it fails in this line:
|
I'd like to reproduce the issue. Do you build natively on Windows, or is there also a cross compiler for Linux which can target UCRT? I usually build on Linux, but Debian only has i686-w64-mingw32-gcc and x86_64-w64-mingw32-gcc. |
I tried to reproduce the issue with this small test code:
My manifest file looks like this:
That test program works fine, but I had a newer compiler (x86_64-w64-mingw32-g++ 12.1.0 from package mingw-w64-ucrt-x86_64-gcc). |
I think you should initialize the C runtime locale using setlocale(LC_ALL, "") to match the ACP. You should also print the current locale and GetACP() to verify it is set as intended. You need recent Windows 10 for this to work, otherwise the manifest part will be ignored. The R initialization re locale is in https://svn.r-project.org/R/trunk/src/main/main.c. Yes, there is also a cross-compiler which you can run on Linux, in addition to a native compiler, for use with R. See https://cran.r-project.org/bin/windows/base/howto-R-devel.html. Normally external libraries are cross-compiled, but R and R packages are compiled natively. Details are in that document. |
GetACP() returns 1252, so it still does not use UTF-8. But after adding |
Although this is more an issue of UCRT (where I see little chances to get it fix soon) it affects Tesseract. As it is easy to replace the |
On Windows with UCRT and a UTF-8 locale std::regex takes a lot of time (several minutes!). Replacing it avoids that bottleneck. Signed-off-by: Stefan Weil <[email protected]>
The issue should be fixed now in git master, see commit 64bcdce. |
thank you! |
Environment
Current Behavior:
I maintain tesseract bindings for the R programming language.
The R project recently switched to ucrt compilers, and defaults to running in native UTF-8 locale on supported versions of Windows (Server 2022 Windows 10 / version 1903 and up). See details.
Users have reported that the tesseract bindings hang and/or crash when initiated on Windows in a process running UTF-8 locale. I have indeed observed on one Windows 11 vm that the process goes to 100% cpu and maxes out RAM on load.
The problem seems similar to: #2420
The text was updated successfully, but these errors were encountered: