combine_lang_model does not print correct usage help #1375

Shreeshrii · 2018-03-12T11:23:05Z

Usage instructions are given in https://github.com/tesseract-ocr/tesseract/blob/master/training/combine_lang_model.cpp#L43-58

// Check validity of input flags.
 if (FLAGS_input_unicharset.empty() || FLAGS_script_dir.empty() ||
     FLAGS_output_dir.empty() || FLAGS_lang.empty()) {
   tprintf("Usage: %s --input_unicharset filename --script_dir dirname\n",
           argv[0]);
   tprintf("  --output_dir rootdir --lang lang [--lang_is_rtl]\n");
   tprintf("  [--words file --puncs file --numbers file]\n");
   tprintf("Sets properties on the input unicharset file, and writes:\n");
   tprintf("rootdir/lang/lang.charset_size=ddd.txt\n");
   tprintf("rootdir/lang/lang.traineddata\n");
   tprintf("rootdir/lang/lang.unicharset\n");
   tprintf("If the 3 word lists are provided, the dawgs are also added to");
   tprintf(" the traineddata file.\n");
   tprintf("The output unicharset and charset_size files are just for human");
   tprintf(" readability.\n");

However, the actual info displayed is

USAGE: combine_lang_model
  --lang_is_rtl  True if lang being processed is written right-to-left  (type:bool default:false)
  --pass_through_recoder  If true, the recoder is a simple pass-through of the unicharset. Otherwise, potentially a compre
ssion of it  (type:bool default:false)
  --input_unicharset  Unicharset to complete and use in encoding  (type:string default:)
  --script_dir  Directory name for input script unicharsets  (type:string default:)
  --words  File listing words to use for the system dictionary  (type:string default:)
  --puncs  File listing punctuation patterns  (type:string default:)
  --numbers  File listing number patterns  (type:string default:)
  --output_dir  Root directory for output files  (type:string default:)
  --version_str  Version string to add to traineddata file  (type:string default:)
  --lang  Name of language being processed  (type:string default:)

So, it looks like that the program is calling a common training argument parser and exiting.

https://github.com/tesseract-ocr/tesseract/blob/master/training/combine_lang_model.cpp#L40

int main(int argc, char** argv) {
  tesseract::ParseCommandLineFlags(argv[0], &argc, &argv, true);

Related: #1297

The text was updated successfully, but these errors were encountered:

zdenop · 2018-10-01T19:14:43Z

@Shreeshrii : if you read it carefully you would see that it print "almost" the same information but in different order. Only additional information (not relevant to run command are:

Sets properties on the input unicharset file, and writes:
rootdir/lang/lang.charset_size=ddd.txt
rootdir/lang/lang.traineddata
rootdir/lang/lang.unicharset
If the 3 word lists are provided, the dawgs are also added to the traineddata file.
The output unicharset and charset_size files are just for human readability.

zdenop · 2018-10-01T19:25:43Z

I remove duplicate help. Please check if everything works as expected.

Shreeshrii · 2018-10-02T00:59:10Z

Thanks!

Shreeshrii changed the title ~~combine_lang_model does not print the custom usage info~~ combine_lang_model does not print correct usage help Mar 12, 2018

Shreeshrii mentioned this issue Apr 30, 2018

RFC: Tesseract 4.0.0 – open tasks #1423

Closed

Shreeshrii closed this as completed Oct 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

combine_lang_model does not print correct usage help #1375

combine_lang_model does not print correct usage help #1375

Shreeshrii commented Mar 12, 2018 •

edited

Loading

zdenop commented Oct 1, 2018

zdenop commented Oct 1, 2018

Shreeshrii commented Oct 2, 2018

combine_lang_model does not print correct usage help #1375

combine_lang_model does not print correct usage help #1375

Comments

Shreeshrii commented Mar 12, 2018 • edited Loading

zdenop commented Oct 1, 2018

zdenop commented Oct 1, 2018

Shreeshrii commented Oct 2, 2018

Shreeshrii commented Mar 12, 2018 •

edited

Loading