-
Notifications
You must be signed in to change notification settings - Fork 484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better search results #560
Conversation
I think this also fixes #520, though I didn't check. |
Modified the PR title to make clear this is good to go, not WIP :) |
This looks great, thanks! I would also have a few search-related feature requests. If they could be added into this PR, that would be awesome (not exactly sure how much work they'd be), or we could just merge this as is and track them in a separate issue.
|
Regarding #520: searching for "get" now definitely finds stuff, but not the |
Thanks for having a look.
Probably the best fix for this would be to add fuzzy matching by default, perhaps just with an edit distance of 1, so as not to hurt performance too much. You can already try searching
Sounds like an improvement, but I don't know how it could be implemented. If we enable the wildcards stemming won't be performed anymore. Though
There is support for highlighting (see the demo), but as you know we currently don't show any of the body in the search page. Seems like more work, perhaps better for a separate issue. Not 100% convinced it's worth it though, it's also nice to see the search results directly beneath each other as it is now. Plus we would have to check that it won't hurt performance too much.
Do you have examples where this is not the case? See also the last paragraph of the commit message in aa1bb53. Shorter matches get higher scores, and search results are sorted by score. So if you want I can add the fuzzy matching and trailing wildcard, I think that's not too much work. |
Ha, this is the reason |
For lunr.js changes see https://github.com/olivernn/lunr.js/blob/master/CHANGELOG.mdown For upgrade guide see https://lunrjs.com/guides/upgrading.html The index is now immutable, hence the documents have to be added before the end of the configuration function. Build time field boosting is no longer supported, and replaced by query time boosting. In this commit it is excluded. Before adding it again, evaluate if it is still required, because with default settings it should score the short titles higher anyways.
Spaces are encoded as + in the URI and have to be treated separately first. fixes JuliaDocs#471
only put in what we use
To be able to match fully qualified title fields more easily.
Query provides a programmatic way of defining (customisation of) queries.
With an edit distance of 2, such that `htmlwrita` matches `htmlwriter`.
A match on `title` gets a boost of 10, just like the original implementation. Furthermore the pipeline is not used for `title`. Trailing wildcards did not seem to work well with the other options, so they are off.
Stops filtering out Base Julia names and syntax that also happen to be common stopwords. fixes JuliaDocs#520
Ok @mortenpi could you have a look at the last 4 commits? They should improve the results a bit more. Essentially I'm now searching each word twice, once in the title and once in the text. This way I can boost the title matches. I didn't build the Base docs, and no time now, could you perhaps try to roughly compare the performance? |
So my primitive benchmarking system (i.e. counting seconds) showed that it's roughly twice as slow. But the results are definitely better I think! Here's a build of Base docs so that you could check it out and compare the performance as well: http://mortenpi.eu/julia/search-v1/search/?q=get My feeling is that working search is more important than performant search, so even if this is a bit slow, we can live with it (and optimize later) -- the fixes are worth it. But perhaps @ararslan or someone could give their opinion as well whether this performance is acceptable? Here's an example of a case where I'd expect "Documenter.Travis.genkeys(Function)" to be higher up in the list (I am assuming that like 90% of the searches are for specific functions, so those matches should probably be preferred): |
Thanks. Building the index takes very long. We can prebuild and serialize it to improve on this in another PR (mentioned in #212 as well). |
Yup.
I think we could disable the submit event on the search page. Something like this should work: $("form.search").submit(function() {
return false;
}) Might be best to give the form an |
This causes a reload, which rebuilds the index, which can take long. Downside is that the URL does not reflect the query, `/search/?q=get`, although pasting links with this syntax will still work.
Thanks for the suggestion. I added another commit. Downside is that the URL is no longer updated on enter, but I don't see an easy way around that. |
In principle we could put the query into the fragment (after |
I hope you don't mind some drive by comments... The search looks good and the results seem promising, though I'm not familiar enough with Julia to really judge. I think there are a couple of things that might improve the performance though... I think the quickest win would be to debounce the keyup event, it would mean less queries made and hopefully reduce the contention on the UI thread between the UI and search itself. For the best results you could experiment with doing the search in a web worker, freeing up the UI thread. The implementation is likely to be a little more involved though. Finally, you might perhaps tweak the query you are doing. Using edit distance is the most expensive way to query, and a value of 2, especially for short queries, might be too expensive. One approach I've suggested to other implementors (with good results) is to combine exact search, with trailing wildcards and optionally fuzzy search using edit distance when doing typeahead style searches. Something like this: idx.query(function (q) {
q.term(term, { boost: 100 })
q.term(term, { boost: 10, usePipeline: false, wildcard: lunr.Query.wilcard.TRAILING })
q.term(term, { boost: 1, usePipeline: false, editDistance: 1 })
}) In your implementation I see you are boosting the two fields differently, you might experiment with only doing the expensive fuzzy search in the title. As I'm sure you've discovered, getting the right balance between decent results and performance is down to a bit of experimentation. Anyway, just a few suggestions. I'm excited to see Julia using Lunr, and I'm more than willing to offer any help or advice where I can. |
Just when I was wondering if it was ok to ping the library author for advice 😃. |
Quoting olivernn: > I think the quickest win would be to debounce the keyup event, it would mean less queries made and hopefully reduce the contention on the UI thread between the UI and search itself.
This is good to go I guess? |
Sorry, just noticed this now. Yeah the performance leaves a bit to be desired but it's really not all that much worse than it currently is, so ¯\_(ツ)_/¯ |
I was a bit streched on time to try out the other suggestions by @olivernn. But I figured to improve performance whilst still getting good search results, testing and benchmarking is the way to go. Documenter-jl-search-testing is a start to test this in Node, on the Base docs, with |
Related discussion of performance (with links to some server-side index experiments): mkdocs/mkdocs#859 |
These are a few commits that should result in better search results for Julia code.
Lunr.js has changed quite a bit in the meantime, allowing for better performance and customization.