When using cjk analyzer with `output_unigram`, an unigram before Japanese punctuation is not indexed #1724

kaakaa · 2022-08-24T05:31:03Z

Summary

When I do indexing and searching Japanese text with cjk analyzer with output_unigram config, the word before Japanese punctuation characters (e.g.: 、,。) cannot be searched. For example, I got no results with the search query ちは from sentence of こんにちは、世界, but I got a result with the search query ちは from sentence of こんにちは世界.

I look into this, and I found that CJKBigramFilter doesn't flush unigram before punctuations correctly. This PR fixed it.

Sample code to reproduce this issue is here:
https://gist.github.com/kaakaa/ca8c20821ef610b098851f487eb61ea5

abhinavdangeti · 2022-11-16T20:12:34Z

Thanks for this contribution @kaakaa .
Let's wait for at least one more review from the team.

fix: make the value of itemsInRing strict

fb70094

abhinavdangeti added this to the v2.3.6 milestone Nov 8, 2022

abhinavdangeti requested review from abhinavdangeti, metonymic-smokey, moshaad7 and Thejas-bhat November 8, 2022 23:13

abhinavdangeti approved these changes Nov 16, 2022

View reviewed changes

abhinavdangeti merged commit 5728b8a into blevesearch:master Dec 1, 2022

kaakaa deleted the fix-cjk branch December 16, 2022 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using cjk analyzer with `output_unigram`, an unigram before Japanese punctuation is not indexed #1724

When using cjk analyzer with `output_unigram`, an unigram before Japanese punctuation is not indexed #1724

kaakaa commented Aug 24, 2022

abhinavdangeti commented Nov 16, 2022

When using cjk analyzer with output_unigram, an unigram before Japanese punctuation is not indexed #1724

When using cjk analyzer with output_unigram, an unigram before Japanese punctuation is not indexed #1724

Conversation

kaakaa commented Aug 24, 2022

Summary

abhinavdangeti commented Nov 16, 2022

When using cjk analyzer with `output_unigram`, an unigram before Japanese punctuation is not indexed #1724

When using cjk analyzer with `output_unigram`, an unigram before Japanese punctuation is not indexed #1724