[Support] Is there an english version of the docs #274

CaptainDario · 2022-07-17T22:22:24Z

Thank you for this great project!

I really like this project and would like to understand its capabilities better. Therefore I am wondering if there is an English version available of the docs?

ikawaha · 2022-07-18T00:11:14Z

There is no English documentation available.

CaptainDario · 2022-07-18T16:09:15Z

Could you explain to me how to use the different segmentation modes?
I do not understand where I would need to add it to the example.

func main() {
	t, err := tokenizer.New(ipa.Dict(), tokenizer.OmitBosEos())
	if err != nil {
		panic(err)
	}
	// wakati
	fmt.Println("---wakati---")
	seg := t.Wakati("すもももももももものうち")
	fmt.Println(seg)

	// tokenize
	fmt.Println("---tokenize---")
	tokens := t.Tokenize("すもももももももものうち")
	for _, token := range tokens {
		features := strings.Join(token.Features(), ",")
		fmt.Printf("%s\t%v\n", token.Surface, features)
	}
}

Could you also tell me what are the pros/cons of using the different dictionaries?

Thank you very much!

CaptainDario · 2022-07-18T18:15:28Z

Ok, figured the segmentation modes out myself.
I am using tokenizer.Analyze()

KEINOS · 2022-07-26T13:56:34Z

@CaptainDario

Could you explain to me how to use the different segmentation modes?
I do not understand where I would need to add it to the example.

As you may know, most Asian texts are not word-separated. The word "wakati" means "word divide" in Japanese. Thus, wakati helps to divide the text into word tokens. Imagine the following.

Wakati("thistextwritingissomewhatsimilartotheasianstyle.") --> this text writing is somewhat similar to the asian style.

The Tokenizer.Wakati() is used to simply divide the text into space-separated-words. Used to create a meta data for a Full-text search. E.g. FTS5 in SQLite3.

The Tokenizer.Tokenize() is similar to Wakati(). But each wakatized(?) chunks contains more information. Mostly used to analyze the grammar, text-lint and etc.

Could you also tell me what are the pros/cons of using the different dictionaries?

In order to do the wakati thing, a word dictionary is needed to determine the proper names, nouns, etc. of a word.

The difference between dictionaries is simply the number of words. The default built-in dictionary supports most of the important proper names, nouns, verbs, etc.

The "pros" of using different dictionaries is, therefore, that they can separate words more accurately. Imagine the following.

mr.mcintoshandmr.mcnamara --> Mr. Mc into sh and Mr. Mc namara or Mr. McIntosh and Mr. McNamara

And the "cons" would be memory usage and slowness. I hope this helps. 🤞

CaptainDario · 2022-07-28T12:00:05Z

@KEINOS Thank you very much!
Maybe this is obvious stuff and one is expected to know this, but I think it would be nice to include something like your comment in the README.

KEINOS · 2022-08-04T00:14:34Z

@CaptainDario Indeed. There is nothing better than better documentation!

@ikawaha, if the above explanation is ok, I would like to PR somewhere, where should I write? In the Wiki, maybe?

CaptainDario closed this as completed Jul 28, 2022

ikawaha mentioned this issue Aug 4, 2022

There is nothing better than better documentation #277

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Support] Is there an english version of the docs #274

[Support] Is there an english version of the docs #274

CaptainDario commented Jul 17, 2022

ikawaha commented Jul 18, 2022

CaptainDario commented Jul 18, 2022

CaptainDario commented Jul 18, 2022 •

edited

Loading

KEINOS commented Jul 26, 2022

CaptainDario commented Jul 28, 2022

KEINOS commented Aug 4, 2022

[Support] Is there an english version of the docs #274

[Support] Is there an english version of the docs #274

Comments

CaptainDario commented Jul 17, 2022

ikawaha commented Jul 18, 2022

CaptainDario commented Jul 18, 2022

CaptainDario commented Jul 18, 2022 • edited Loading

KEINOS commented Jul 26, 2022

CaptainDario commented Jul 28, 2022

KEINOS commented Aug 4, 2022

CaptainDario commented Jul 18, 2022 •

edited

Loading