Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Bunpro deck support #1383

Closed
Tracked by #1465
simias opened this issue Oct 22, 2023 · 6 comments · Fixed by #1477
Closed
Tracked by #1465

Add Bunpro deck support #1383

simias opened this issue Oct 22, 2023 · 6 comments · Fixed by #1477

Comments

@simias
Copy link

simias commented Oct 22, 2023

Hello and thank you for the great project!

I compiled a list of all vocabulary and grammar points in every JLPT Bunpro decks and used that to generate a Yomichan dictionary to show a little badge next to an entry if it's found in one of Bunpro's decks. I thought I would try to do the same for 10ten but the process seems a little more involved and I admit that I'm a bit out of my depth here.

At any rate if somebody feels like doing the hard work, here's the raw data in JSON: https://gitlab.com/flio/wkanki/-/blob/main/bunpro/deck_index.json?ref_type=heads

Each term is given a deck_type (either Grammar or Vocab) and then the deck name which are currently N5, N4, N3, N2 or N1 (but there may be more in the future, I think this field should be treated like an opaque text string).

The grammar deck data is a bit difficult to handle because many entries have either sentence fragments like "に堪えない" which don't really make sense in a dictionary application, or even English descriptions like "い-Adjectives" which will work even more poorly.

The vocab decks on the other hand should just work out of the box since all words are in dictionary form.

There's a lot more data I could add to this dump such as the direct URL to the various entries but I don't know if it would be useful.

@birtles
Copy link
Member

birtles commented Oct 23, 2023

Hi! Thank you for filing this issue.

Yes, it's a little more difficult to add this data to 10ten at the moment because we preprocess all this data on the server (merging in pitch accent data, WaniKani data, Conning references, updated 漢検 and 教育漢字 levels, kanji components etc.).

I'd be more that happy to merge in the Bunpro levels as part of that process. I assume there are no licensing issues involved in using the Bunpro level data?

Do you have any suggestions about how we would display the data? For example, for the WaniKani level data we show a "WK 21" badge next to words that match. Would it make sense to show "BP N5 Grammar" etc.? Perhaps that might be too long? Or is there are freely available Bunpro logo we should use in place of "BP"?

Yes, entries like に堪えない could be a bit difficult. We might be able to coax some of them into working, however. For WaniKani we already do that to match WaniKani's "~放題" against "放題", and "感動する" against "感動".

The only other data I think we might need is readings for kanji, if they are available. For example, WaniKani has an entry for 注文 which includes the reading ちゅうもん. That allows us to avoid showing it against the 注文/ちゅうぶん entry.

@birtles
Copy link
Member

birtles commented Oct 23, 2023

Each term is given a deck_type (either Grammar or Vocab) and then the deck name which are currently N5, N4, N3, N2 or N1 (but there may be more in the future, I think this field should be treated like an opaque text string).

One follow-up on this, since you mentioned that the set of deck names might change in future: is there somewhere we can pull this data from periodically? We try to keep data up-to-date which is why we push dictionary updates twice a week. We also fetch WaniKani data from the WaniKani API twice a week.

@simias
Copy link
Author

simias commented Oct 23, 2023

I'd be more that happy to merge in the Bunpro levels as part of that process. I assume there are no licensing issues involved in using the Bunpro level data?

Currently it's just a list of Japanese terms and N1/N2/N3/N4/N5, so I don't really see how this could be meaningfully copyrighted, but I'm not exactly an expert in these matters so take that with a huge grain of salt.

Do you have any suggestions about how we would display the data? For example, for the WaniKani level data we show a "WK 21" badge next to words that match. Would it make sense to show "BP N5 Grammar" etc.? Perhaps that might be too long? Or is there are freely available Bunpro logo we should use in place of "BP"?

Here's what I have in Yomichan:

bunpro

Bunpro does have a logo that's a stylized version of 文プロ merged in one character, to keep things simple I just used '文プ' for the badge in yomichan. But that's of course entirely arbitrary. I can if one could spare one extra character "文プロ" might be a little more proper.

One follow-up on this, since you mentioned that the set of deck names might change in future: is there somewhere we can pull this data from periodically? We try to keep data up-to-date which is why we push dictionary updates twice a week. We also fetch WaniKani data from the WaniKani API twice a week.

So currently I use this very rough Python script to scrape the deck data: https://gitlab.com/flio/wkanki/-/blob/main/bunpro-vocab.py?ref_type=heads

Unfortunately unlike WaniKani, Bunpro does not currently offer a public API to do that cleanly, so I have to parse the HTML. That's naturally quite fragile.

You can also see that I've hardcoded the set of decks in the script, technically I could pull the full list from https://bunpro.jp/decks but the issue is that you can see that there already are a bunch of other decks beyond the "Nx" JLPT-based ones and I don't think it's a good idea to add those because they're just the same content in a different order to follow the progression of various textbooks.

What I think may need to be added in the future is if the Bunpro team adds "N0" decks with new content beyond what currently exists, but unless they dramatically change the way they organize their content that should be a very uncommon occurrence that could be handled manually I think.

By the way I did request a cleaner way to deal with this on the Bunpro forums but I haven't received an answer yet: https://community.bunpro.jp/t/feedback-suggested-improvements-feature-request/131/2108?u=simias

I can imagine that it's not a very high priority for them right now, but I hope that eventually they'll offer a cleaner way to access this data.

@birtles
Copy link
Member

birtles commented Oct 24, 2023

Bunpro does have a logo that's a stylized version of 文プロ merged in one character, to keep things simple I just used '文プ' for the badge in yomichan. But that's of course entirely arbitrary. I can if one could spare one extra character "文プロ" might be a little more proper.

Interesting. I wonder if absolute beginners might prefer "BP" instead. The Japanese localization could still use 文プロ.

For the Vocab-N5 part, I think we could use colour to differentiate between vocab (blue) and grammar (red). Then we could either use "N5 Vocab" or even "N5 単語" since the colour should convey the distinction for beginners.

So currently I use this very rough Python script to scrape the deck data: https://gitlab.com/flio/wkanki/-/blob/main/bunpro-vocab.py?ref_type=heads

Unfortunately unlike WaniKani, Bunpro does not currently offer a public API to do that cleanly, so I have to parse the HTML. That's naturally quite fragile.

You can also see that I've hardcoded the set of decks in the script, technically I could pull the full list from https://bunpro.jp/decks but the issue is that you can see that there already are a bunch of other decks beyond the "Nx" JLPT-based ones and I don't think it's a good idea to add those because they're just the same content in a different order to follow the progression of various textbooks.

What I think may need to be added in the future is if the Bunpro team adds "N0" decks with new content beyond what currently exists, but unless they dramatically change the way they organize their content that should be a very uncommon occurrence that could be handled manually I think.

That all makes sense. Thank you very much. I see there's no reading information there so we either just have to annotate all entries that match the kanji or we do a second pass and try to annotate the entry with the highest priority (i.e. most commonly used version).

The second pass would be better but it might be quite a bit of extra work.

By the way I did request a cleaner way to deal with this on the Bunpro forums but I haven't received an answer yet: https://community.bunpro.jp/t/feedback-suggested-improvements-feature-request/131/2108?u=simias

I can imagine that it's not a very high priority for them right now, but I hope that eventually they'll offer a cleaner way to access this data.

Thanks for doing that. Yes, I imagine so.

@birtles birtles mentioned this issue Nov 27, 2023
11 tasks
@birtles
Copy link
Member

birtles commented Nov 27, 2023

Just a little update here. I've spent quite a bit of time on this and the data pipeline part should now be done. In fact, the data has already been published and just needs to be displayed.

I didn't quite get it done in time for the 1.16 release because I'd like to finish overhauling the options screen before adding more options to it. Hopefully I should finish it off in the next few days, however.

The mapping of Bunpro data to word entries is fairly involved. I made it do two passes to find the best match and apply various heuristics to the grammar entries so that it can successfully match "の極み" to "極み" and "~を~に任せる" to "任せる". When it does a fuzzy match, it will show the original Bunpro entry alongside the tag.

That said, there are still about 500+ grammar entries that aren't matched to anything. You can see the list here:

https://docs.google.com/spreadsheets/d/1LoxIlKAWUaTy-weqs8jGYW6udsAWWpL8QUL-78NXl3c/edit?usp=sharing

Fortunately it's possible to incrementally update the heuristics and the changes will just flow downstream without having to publish a new version of the add-on. If you can see any common cases we should handle, feel free to let me know.

birtles added a commit that referenced this issue Nov 29, 2023
birtles added a commit that referenced this issue Nov 29, 2023
@birtles
Copy link
Member

birtles commented Dec 11, 2023

@simias This feature should now be live in all browsers except Edge (still waiting on the Edge store review). You will need to enable it from the options panel, however. If it doesn't show up in Firefox it might be because your browser hasn't updated to 1.17 yet (you can manually trigger an update from the top-right menu in the add-ons management screen).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants