Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiled-in TESSDATA_PREFIX unused on Windows #3767

Closed
CSBVision opened this issue Mar 11, 2022 · 11 comments · Fixed by #3768
Closed

Compiled-in TESSDATA_PREFIX unused on Windows #3767

CSBVision opened this issue Mar 11, 2022 · 11 comments · Fixed by #3768
Labels

Comments

@CSBVision
Copy link
Contributor

Hello!

We tried to compile Tesseract including a compiled-in TESSDATA_PREFIX using a respective preprocessor definition.
However, compiling ccutil.cpp on Windows with defined _WIN32 always causes the code

#if defined(_WIN32)
  else if (datadir.empty() || _access(datadir.c_str(), 0) != 0)
  {
    ...
  }
#endif /* _WIN32 */

to be executed and, if datadir is currently empty, it will be initialized as <Path to binary>/tessdata.
Thus, the code that would initialize datadir according to the compile-time constant TESSDATA_PREFIX is not executed. Most likely on Unix systems this bug is not present because here, _WIN32 is not defined.


Environment

  • Tesseract Version: 5.1.0
  • Commit Number: released version
  • Platform: MSYS_NT-10.0-19043 64-bit

Current Behavior:

Compiling Tesseract with defined TESSDATA_PREFIX=C:\Path\to\somewhere and starting tesseract.exe with an attached debugger as well as only the supplied command line argument --list-langs does not find Tesseract's language files, even if they exist in a folder 'tessdata' in the respective compiled-in directory.

Expected Behavior:

As long as neither an other data directory is explicitly given nor does the directory tessdata next to the executing binary exist but the compiled-in TESSDATA_PREFIX does, it should be used.

Suggested Fix:

In my understanding of the code, the tessdata directory should be resolved according to the following rules:

  1. Explicitly set by argv0 overrules everything.
  2. Thereafter, the TESSDATA_PREFIX environment variable is used.
  3. Next, a tessdata directory next to the executing binary is used.
  4. As the last possible solution, the compiled-in TESSDATA_PREFIX should be checked.
  5. Return an initialization error if all four alternatives fail (redirect to ./).

I think the easiest option is to move alternative 3 into a function that on Windows does the same as the #if defined(_WIN32) part while on Unix, it does a similar check (even though this is not currently implemented). If the tessdata directory next to the binary exists, it is used and saved in the datadir variable and the function returns true. Then, the second else if simply calls this function and if it returns false, the additional else part used the compiled-in TESSDATA_PREFIX if this is set.

@zdenop
Copy link
Contributor

zdenop commented Mar 11, 2022

Initialization of datadir as <Path to binary>/tessdata on Windows is not the bug, but the feature requested by windows users/ developer that prefer to have have everything in one place/folder.

@CSBVision
Copy link
Contributor Author

If there is a <Path to binary>/tessdata directory, this should overrule the compiled-in TESSDATA_PREFIX, at least I understand the code in that way. And I think, this is a good solution.
But the bug is that this always overrules the compiled-in TESSDATA_PREFIX directory, even if no <Path to binary>/tessdata directory exists and the compiled-in TESSDATA_PREFIX is not used, even though it exists and contains the required files.

@stweil
Copy link
Member

stweil commented Mar 11, 2022

Your are right. This is easy to fix. Do you want to send a pull request which does that?

@stweil stweil added the bug label Mar 11, 2022
@CSBVision
Copy link
Contributor Author

Yes no problem, I will prepare a fix.

CSBVision added a commit to CSBVision/tesseract that referenced this issue Mar 11, 2022
stweil added a commit that referenced this issue Mar 11, 2022
Fixes #3767.

Co-authored-by: Stefan Weil <[email protected]>
@zdenop
Copy link
Contributor

zdenop commented Mar 11, 2022

One again: it is not but but intentional design.
While compiled-in TESSDATA_PREFIX make sense on linux like system (e.g. system packages), it has not sense on windows (can be installed anywhere).
This design we have for years and nobody complains...

@CSBVision
Copy link
Contributor Author

On any platform I can copy the data to any path and forward this to the library. On any platform, I can set the TESSDATA_PREFIX environment variable. So why should the compiled-in variant be Unix-only? The resources I found about the compiled-in TESSDATA_PREFIX stated nothing about Unix-only, therefore it seemed to be a bug to me. Actually, I think it's a good solution to offer a compile-time fallback, even on Windows. And for Unix users, nothing changes.

@stweil
Copy link
Member

stweil commented Mar 11, 2022

This design we have for years and nobody complains...

That's correct for 99 % of Windows users. But if someone wants to build a personal Tesseract which installs tessdata in a fixed location c:\tesseract\tessdata (or any other choice), we did not document that this was unsupported. And there is also no good reason for not supporting that, as long as the current behaviour also works and has higher priority.

@amitdo
Copy link
Collaborator

amitdo commented Mar 2, 2023

Current behavior:
https://github.com/amitdo/tesseract-datadir/

It seems the previous behavior was:
Compiled data dir had the top priority.

Am I right?

@amitdo
Copy link
Collaborator

amitdo commented Mar 2, 2023

'top priority': number 1 out of 5 options.

@CSBVision
Copy link
Contributor Author

Before our fix, the Windows-exclusive behavior to try to load the files from the <Path to binary>/tessdata directory caused the compiled-in TESSDATA_PREFIX to be ignored even though it exists and contains the configuration files while the <Path to binary>/tessdata directory does not exist, cf #3767 (comment).

Now, the following priorities are used:

  1. explicitly defined, e.g. by argv0
  2. TESSDATA_PREFIX environment variable
  3. tessdata directory in the executable path (Windows-only!)
  4. compiled-in TESSDATA_PREFIX

Before, option 4 was excluded just by checking option 3 and thus Windows-exclusive. Therefore, for non-Windows platforms the behavior is the same as without this fix.

@amitdo
Copy link
Collaborator

amitdo commented Mar 3, 2023

OK. After looking again, I now see that my analysis of the previous state was wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants