-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Bunpro deck support #1383
Comments
Hi! Thank you for filing this issue. Yes, it's a little more difficult to add this data to 10ten at the moment because we preprocess all this data on the server (merging in pitch accent data, WaniKani data, Conning references, updated 漢検 and 教育漢字 levels, kanji components etc.). I'd be more that happy to merge in the Bunpro levels as part of that process. I assume there are no licensing issues involved in using the Bunpro level data? Do you have any suggestions about how we would display the data? For example, for the WaniKani level data we show a "WK 21" badge next to words that match. Would it make sense to show "BP N5 Grammar" etc.? Perhaps that might be too long? Or is there are freely available Bunpro logo we should use in place of "BP"? Yes, entries like に堪えない could be a bit difficult. We might be able to coax some of them into working, however. For WaniKani we already do that to match WaniKani's "~放題" against "放題", and "感動する" against "感動". The only other data I think we might need is readings for kanji, if they are available. For example, WaniKani has an entry for 注文 which includes the reading ちゅうもん. That allows us to avoid showing it against the 注文/ちゅうぶん entry. |
One follow-up on this, since you mentioned that the set of deck names might change in future: is there somewhere we can pull this data from periodically? We try to keep data up-to-date which is why we push dictionary updates twice a week. We also fetch WaniKani data from the WaniKani API twice a week. |
Currently it's just a list of Japanese terms and N1/N2/N3/N4/N5, so I don't really see how this could be meaningfully copyrighted, but I'm not exactly an expert in these matters so take that with a huge grain of salt.
Here's what I have in Yomichan: Bunpro does have a logo that's a stylized version of 文プロ merged in one character, to keep things simple I just used '文プ' for the badge in yomichan. But that's of course entirely arbitrary. I can if one could spare one extra character "文プロ" might be a little more proper.
So currently I use this very rough Python script to scrape the deck data: https://gitlab.com/flio/wkanki/-/blob/main/bunpro-vocab.py?ref_type=heads Unfortunately unlike WaniKani, Bunpro does not currently offer a public API to do that cleanly, so I have to parse the HTML. That's naturally quite fragile. You can also see that I've hardcoded the set of decks in the script, technically I could pull the full list from https://bunpro.jp/decks but the issue is that you can see that there already are a bunch of other decks beyond the "Nx" JLPT-based ones and I don't think it's a good idea to add those because they're just the same content in a different order to follow the progression of various textbooks. What I think may need to be added in the future is if the Bunpro team adds "N0" decks with new content beyond what currently exists, but unless they dramatically change the way they organize their content that should be a very uncommon occurrence that could be handled manually I think. By the way I did request a cleaner way to deal with this on the Bunpro forums but I haven't received an answer yet: https://community.bunpro.jp/t/feedback-suggested-improvements-feature-request/131/2108?u=simias I can imagine that it's not a very high priority for them right now, but I hope that eventually they'll offer a cleaner way to access this data. |
Interesting. I wonder if absolute beginners might prefer "BP" instead. The Japanese localization could still use 文プロ. For the Vocab-N5 part, I think we could use colour to differentiate between vocab (blue) and grammar (red). Then we could either use "N5 Vocab" or even "N5 単語" since the colour should convey the distinction for beginners.
That all makes sense. Thank you very much. I see there's no reading information there so we either just have to annotate all entries that match the kanji or we do a second pass and try to annotate the entry with the highest priority (i.e. most commonly used version). The second pass would be better but it might be quite a bit of extra work.
Thanks for doing that. Yes, I imagine so. |
Just a little update here. I've spent quite a bit of time on this and the data pipeline part should now be done. In fact, the data has already been published and just needs to be displayed. I didn't quite get it done in time for the 1.16 release because I'd like to finish overhauling the options screen before adding more options to it. Hopefully I should finish it off in the next few days, however. The mapping of Bunpro data to word entries is fairly involved. I made it do two passes to find the best match and apply various heuristics to the grammar entries so that it can successfully match "の極み" to "極み" and "~を~に任せる" to "任せる". When it does a fuzzy match, it will show the original Bunpro entry alongside the tag. That said, there are still about 500+ grammar entries that aren't matched to anything. You can see the list here: https://docs.google.com/spreadsheets/d/1LoxIlKAWUaTy-weqs8jGYW6udsAWWpL8QUL-78NXl3c/edit?usp=sharing Fortunately it's possible to incrementally update the heuristics and the changes will just flow downstream without having to publish a new version of the add-on. If you can see any common cases we should handle, feel free to let me know. |
@simias This feature should now be live in all browsers except Edge (still waiting on the Edge store review). You will need to enable it from the options panel, however. If it doesn't show up in Firefox it might be because your browser hasn't updated to 1.17 yet (you can manually trigger an update from the top-right menu in the add-ons management screen). |
Hello and thank you for the great project!
I compiled a list of all vocabulary and grammar points in every JLPT Bunpro decks and used that to generate a Yomichan dictionary to show a little badge next to an entry if it's found in one of Bunpro's decks. I thought I would try to do the same for 10ten but the process seems a little more involved and I admit that I'm a bit out of my depth here.
At any rate if somebody feels like doing the hard work, here's the raw data in JSON: https://gitlab.com/flio/wkanki/-/blob/main/bunpro/deck_index.json?ref_type=heads
Each term is given a deck_type (either Grammar or Vocab) and then the deck name which are currently N5, N4, N3, N2 or N1 (but there may be more in the future, I think this field should be treated like an opaque text string).
The grammar deck data is a bit difficult to handle because many entries have either sentence fragments like "に堪えない" which don't really make sense in a dictionary application, or even English descriptions like "い-Adjectives" which will work even more poorly.
The vocab decks on the other hand should just work out of the box since all words are in dictionary form.
There's a lot more data I could add to this dump such as the direct URL to the various entries but I don't know if it would be useful.
The text was updated successfully, but these errors were encountered: