Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long expressions #319

Open
nicolasmaia opened this issue Jul 7, 2020 · 8 comments
Open

Long expressions #319

nicolasmaia opened this issue Jul 7, 2020 · 8 comments

Comments

@nicolasmaia
Copy link

For some reason, Rikai can't recognize the expression 先生と言われるほどの馬鹿でなし.

It's been in JMdict since last year, though: https://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&e=2070202

@birtles
Copy link
Member

birtles commented Jul 7, 2020

Actually, ignore my previous comment. We have a hard-coded limit for how long a string we look up. It's currently set at 13 characters.

@nicolasmaia
Copy link
Author

Can we extend it?

@birtles
Copy link
Member

birtles commented Jul 7, 2020

Sure. We should probably do some analysis on the database to see what the longest entries are.

There will be some performance impact too since we'll do a number of more lookups than we need to, but hopefully it's acceptable? It would be nice if we could do some performance comparison though.

@SaltfishAmi
Copy link
Contributor

SaltfishAmi commented Jul 8, 2020

1       3193
2       58979
3       61481
4       75303
5       49891
6       39011
7       30063
8       21971
9       13759
10      8512
11      5391
12      3409
13      2072
14      1332
15      889
16      554
17      371
18      236
19      168
20      95
21      62
22      51
23      35
24      17
25      16
26      11
27      7
28      3
29      3
30      2
31      2
32      1
33      3
34      2
35      0
36      0
37      1

This is from the bundled dict.idx. Didn't find anything longer than 37 full width characters.
The numbers include both kana-only and mixed kana-kanji entries.

@birtles
Copy link
Member

birtles commented Jul 9, 2020

Thanks! That's super useful.

I'm concerned that once I (finally) switch to storing the data in IndexedDB the extra lookup times for each substring are going produce a pretty significant performance impact.

Perhaps we could extend the limit to 16 for now? 19 at most?

@SaltfishAmi
Copy link
Contributor

Thanks! That's super useful.

I'm concerned that once I (finally) switch to storing the data in IndexedDB the extra lookup times for each substring are going produce a pretty significant performance impact.

Perhaps we could extend the limit to 16 for now? 19 at most?

Or maybe this should be adjustable in the options page?

@Lebon14
Copy link

Lebon14 commented Aug 4, 2020

Thanks! That's super useful.
I'm concerned that once I (finally) switch to storing the data in IndexedDB the extra lookup times for each substring are going produce a pretty significant performance impact.
Perhaps we could extend the limit to 16 for now? 19 at most?

Or maybe this should be adjustable in the options page?

This!

@birtles
Copy link
Member

birtles commented Aug 11, 2020

I've updated the hard-coded limit to 16 for now in: 91846e1

I try really hard to avoid adding options so I'd prefer to wait until I can profile the performance difference properly and, if there's no noticeable impact, just increase the limit to 37.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants