Skip to content

Index created by elasticlunr-rs doesn't work with elasticlunr.js for characters that can't be represented by a single UTF-16 Code Unit #53

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Sunshine40 opened this issue Jun 5, 2024 · 1 comment

Comments

@Sunshine40
Copy link

Sunshine40 commented Jun 5, 2024

fn add_token(&mut self, doc_ref: &str, token: &str, term_freq: f64) {
let mut iter = token.chars();
if let Some(character) = iter.next() {

During index building, elasticlunr-rs iterates over the token &str's content in Unicode Scalar Values.

While the JS library does it in this way:

elasticlunr.InvertedIndex.prototype.addToken = function (token, tokenInfo, root) {
  var root = root || this.root,
      idx = 0;

  while (idx <= token.length - 1) {
    var key = token[idx];

The JS string is actually iterated in UTF-16 Code Units, which are entire characters for English, most alphabetic text, common Chinese characters; but not Emojis and rare Chinese characters.


Related issue with mdBook.

@ImUrX
Copy link

ImUrX commented Mar 19, 2025

3.0.3 should probably be yanked for now as it breaks mdbook

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants