Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Support] Is there an english version of the docs #274

Closed
CaptainDario opened this issue Jul 17, 2022 · 6 comments
Closed

[Support] Is there an english version of the docs #274

CaptainDario opened this issue Jul 17, 2022 · 6 comments

Comments

@CaptainDario
Copy link

Thank you for this great project!

I really like this project and would like to understand its capabilities better. Therefore I am wondering if there is an English version available of the docs?

@ikawaha
Copy link
Owner

ikawaha commented Jul 18, 2022

There is no English documentation available.

@CaptainDario
Copy link
Author

Could you explain to me how to use the different segmentation modes?
I do not understand where I would need to add it to the example.

func main() {
	t, err := tokenizer.New(ipa.Dict(), tokenizer.OmitBosEos())
	if err != nil {
		panic(err)
	}
	// wakati
	fmt.Println("---wakati---")
	seg := t.Wakati("すもももももももものうち")
	fmt.Println(seg)

	// tokenize
	fmt.Println("---tokenize---")
	tokens := t.Tokenize("すもももももももものうち")
	for _, token := range tokens {
		features := strings.Join(token.Features(), ",")
		fmt.Printf("%s\t%v\n", token.Surface, features)
	}
}

Could you also tell me what are the pros/cons of using the different dictionaries?

Thank you very much!

@CaptainDario
Copy link
Author

CaptainDario commented Jul 18, 2022

Ok, figured the segmentation modes out myself.
I am using tokenizer.Analyze()

@KEINOS
Copy link
Contributor

KEINOS commented Jul 26, 2022

@CaptainDario

Could you explain to me how to use the different segmentation modes?
I do not understand where I would need to add it to the example.

As you may know, most Asian texts are not word-separated. The word "wakati" means "word divide" in Japanese. Thus, wakati helps to divide the text into word tokens. Imagine the following.

  • Wakati("thistextwritingissomewhatsimilartotheasianstyle.") --> this text writing is somewhat similar to the asian style.

The Tokenizer.Wakati() is used to simply divide the text into space-separated-words. Used to create a meta data for a Full-text search. E.g. FTS5 in SQLite3.

The Tokenizer.Tokenize() is similar to Wakati(). But each wakatized(?) chunks contains more information. Mostly used to analyze the grammar, text-lint and etc.

Could you also tell me what are the pros/cons of using the different dictionaries?

In order to do the wakati thing, a word dictionary is needed to determine the proper names, nouns, etc. of a word.

The difference between dictionaries is simply the number of words. The default built-in dictionary supports most of the important proper names, nouns, verbs, etc.

The "pros" of using different dictionaries is, therefore, that they can separate words more accurately. Imagine the following.

  • mr.mcintoshandmr.mcnamara --> Mr. Mc into sh and Mr. Mc namara or Mr. McIntosh and Mr. McNamara

And the "cons" would be memory usage and slowness. I hope this helps. 🤞

@CaptainDario
Copy link
Author

@KEINOS Thank you very much!
Maybe this is obvious stuff and one is expected to know this, but I think it would be nice to include something like your comment in the README.

@KEINOS
Copy link
Contributor

KEINOS commented Aug 4, 2022

@CaptainDario Indeed. There is nothing better than better documentation!

@ikawaha, if the above explanation is ok, I would like to PR somewhere, where should I write? In the Wiki, maybe?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants