Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better search results #560

Merged
merged 13 commits into from
Oct 7, 2017
Merged

Better search results #560

merged 13 commits into from
Oct 7, 2017

Conversation

visr
Copy link
Contributor

@visr visr commented Sep 5, 2017

These are a few commits that should result in better search results for Julia code.
Lunr.js has changed quite a bit in the meantime, allowing for better performance and customization.

@visr
Copy link
Contributor Author

visr commented Sep 5, 2017

I think this also fixes #520, though I didn't check.

@visr visr changed the title Work on search Better search results Sep 11, 2017
@visr
Copy link
Contributor Author

visr commented Sep 11, 2017

Modified the PR title to make clear this is good to go, not WIP :)

@mortenpi mortenpi added this to the 0.12 milestone Sep 23, 2017
@mortenpi
Copy link
Member

This looks great, thanks!

I would also have a few search-related feature requests. If they could be added into this PR, that would be awesome (not exactly sure how much work they'd be), or we could just merge this as is and track them in a separate issue.

  • Partial matches don't always show up, e.g. searching for "autodocs" does not find @autodocs related stuff in Documenter's manual.
  • Should the language-reduced search words not find functions names etc? I.e. right now, "makes" finds everything that has "make" in it, e.g. Generator.make.. perhaps it shouldn't (i.e. only match "make" in docstrings).
  • I think the page of search results could be significantly improved.
    • Snippets showing what was matched would be great.
    • Somehow prioritizing the results (e.g. exact function name matched would be preferred over docstring matches)

@mortenpi
Copy link
Member

Regarding #520: searching for "get" now definitely finds stuff, but not the Base.get function. "get!" finds Base.get! without problems.

@visr
Copy link
Contributor Author

visr commented Sep 24, 2017

Thanks for having a look.

Partial matches don't always show up, e.g. searching for "autodocs" does not find @autodocs related stuff in Documenter's manual.

Probably the best fix for this would be to add fuzzy matching by default, perhaps just with an edit distance of 1, so as not to hurt performance too much. You can already try searching autodocs~1. Additionaly, we can try enabling automatic trailing wildcards, such that you already get search results as you type depl without having to fully type deploydocs.

Should the language-reduced search words not find functions names etc? I.e. right now, "makes" finds everything that has "make" in it, e.g. Generator.make.. perhaps it shouldn't (i.e. only match "make" in docstrings).

Sounds like an improvement, but I don't know how it could be implemented. If we enable the wildcards stemming won't be performed anymore. Though makes* still matches Generator.make through fuzzy matching.

Snippets showing what was matched would be great.

There is support for highlighting (see the demo), but as you know we currently don't show any of the body in the search page. Seems like more work, perhaps better for a separate issue. Not 100% convinced it's worth it though, it's also nice to see the search results directly beneath each other as it is now. Plus we would have to check that it won't hurt performance too much.

Somehow prioritizing the results (e.g. exact function name matched would be preferred over docstring matches)

Do you have examples where this is not the case? See also the last paragraph of the commit message in aa1bb53. Shorter matches get higher scores, and search results are sorted by score.

So if you want I can add the fuzzy matching and trailing wildcard, I think that's not too much work.

@visr
Copy link
Contributor Author

visr commented Sep 25, 2017

Ha, this is the reason get doesn't find Base.get; the stop word filter. We could disable it altogether, or just remove the ones that are function names in Base.

For lunr.js changes see
https://github.com/olivernn/lunr.js/blob/master/CHANGELOG.mdown
For upgrade guide see https://lunrjs.com/guides/upgrading.html

The index is now immutable, hence the documents have to be added before
the end of the configuration function.

Build time field boosting is no longer supported, and replaced by query
time boosting. In this commit it is excluded. Before adding it again,
evaluate if it is still required, because with default settings it
should score the short titles higher anyways.
Spaces are encoded as + in the URI and have to be treated separately
first.

fixes JuliaDocs#471
only put in what we use
To be able to match fully qualified title fields more easily.
Query provides a programmatic way of defining (customisation of)
queries.
With an edit distance of 2, such that `htmlwrita` matches `htmlwriter`.
A match on `title` gets a boost of 10, just like the original
implementation. Furthermore the pipeline is not used for `title`.
Trailing wildcards did not seem to work well with the other options, so
they are off.
Stops filtering out Base Julia names and syntax that also happen to be
common stopwords.
fixes JuliaDocs#520
@visr
Copy link
Contributor Author

visr commented Sep 25, 2017

Ok @mortenpi could you have a look at the last 4 commits? They should improve the results a bit more. Essentially I'm now searching each word twice, once in the title and once in the text. This way I can boost the title matches.

I didn't build the Base docs, and no time now, could you perhaps try to roughly compare the performance?

mortenpi added a commit to mortenpi/julia that referenced this pull request Sep 25, 2017
@mortenpi
Copy link
Member

So my primitive benchmarking system (i.e. counting seconds) showed that it's roughly twice as slow. But the results are definitely better I think!

Here's a build of Base docs so that you could check it out and compare the performance as well: http://mortenpi.eu/julia/search-v1/search/?q=get

My feeling is that working search is more important than performant search, so even if this is a bit slow, we can live with it (and optimize later) -- the fixes are worth it. But perhaps @ararslan or someone could give their opinion as well whether this performance is acceptable?

Here's an example of a case where I'd expect "Documenter.Travis.genkeys(Function)" to be higher up in the list (I am assuming that like 90% of the searches are for specific functions, so those matches should probably be preferred):
deploy
But again, this can be a future improvement.

@visr
Copy link
Contributor Author

visr commented Sep 26, 2017

Thanks. Building the index takes very long. We can prebuild and serialize it to improve on this in another PR (mentioned in #212 as well).
I have the habit of hitting enter after completing my query, but I see that this rebuilds the index, taking much longer. Any way of avoiding this?

@mortenpi
Copy link
Member

improve on this in another PR

Yup.

I have the habit of hitting enter after completing my query, but I see that this rebuilds the index, taking much longer. Any way of avoiding this?

I think we could disable the submit event on the search page. Something like this should work:

$("form.search").submit(function() {
    return false;
})

Might be best to give the form an id though (in HTMLWriter.jl) and use that to refer to it.

This causes a reload, which rebuilds the index, which can take long.
Downside is that the URL does not reflect the query, `/search/?q=get`, although pasting links with this syntax will still work.
@visr
Copy link
Contributor Author

visr commented Sep 26, 2017

Thanks for the suggestion. I added another commit. Downside is that the URL is no longer updated on enter, but I don't see an easy way around that.

@mortenpi
Copy link
Member

In principle we could put the query into the fragment (after #) and then it should be possible to update that in javascript. Actually, since search is javascript-based anyway, having the query in the fragment seems like the more correct solution. But this could be in a separate PR.

@olivernn
Copy link

I hope you don't mind some drive by comments...

The search looks good and the results seem promising, though I'm not familiar enough with Julia to really judge. I think there are a couple of things that might improve the performance though...

I think the quickest win would be to debounce the keyup event, it would mean less queries made and hopefully reduce the contention on the UI thread between the UI and search itself. For the best results you could experiment with doing the search in a web worker, freeing up the UI thread. The implementation is likely to be a little more involved though.

Finally, you might perhaps tweak the query you are doing. Using edit distance is the most expensive way to query, and a value of 2, especially for short queries, might be too expensive. One approach I've suggested to other implementors (with good results) is to combine exact search, with trailing wildcards and optionally fuzzy search using edit distance when doing typeahead style searches. Something like this:

idx.query(function (q) {
  q.term(term, { boost: 100 })
  q.term(term, { boost: 10, usePipeline: false, wildcard: lunr.Query.wilcard.TRAILING })
  q.term(term, { boost: 1, usePipeline: false, editDistance: 1 })
})

In your implementation I see you are boosting the two fields differently, you might experiment with only doing the expensive fuzzy search in the title. As I'm sure you've discovered, getting the right balance between decent results and performance is down to a bit of experimentation.

Anyway, just a few suggestions. I'm excited to see Julia using Lunr, and I'm more than willing to offer any help or advice where I can.

@visr
Copy link
Contributor Author

visr commented Sep 29, 2017

Just when I was wondering if it was ok to ping the library author for advice 😃.
Much appreciated. I'll try out some of your suggestions this weekend.

Quoting olivernn:
> I think the quickest win would be to debounce the keyup event, it would mean less queries made and hopefully reduce the contention on the UI thread between the UI and search itself.
@mortenpi
Copy link
Member

mortenpi commented Oct 6, 2017

This is good to go I guess?

@mortenpi mortenpi merged commit 2e152e1 into JuliaDocs:master Oct 7, 2017
@mortenpi
Copy link
Member

mortenpi commented Oct 7, 2017

We can always make additional improvements later. Thanks again @visr, @olivernn!

@ararslan
Copy link
Member

ararslan commented Oct 7, 2017

But perhaps @ararslan or someone could give their opinion as well whether this performance is acceptable?

Sorry, just noticed this now. Yeah the performance leaves a bit to be desired but it's really not all that much worse than it currently is, so ¯\_(ツ)_/¯

@visr visr deleted the search branch October 8, 2017 20:04
@visr
Copy link
Contributor Author

visr commented Oct 8, 2017

I was a bit streched on time to try out the other suggestions by @olivernn. But I figured to improve performance whilst still getting good search results, testing and benchmarking is the way to go.

Documenter-jl-search-testing is a start to test this in Node, on the Base docs, with npm test and npm run perf.

@ihnorton
Copy link

ihnorton commented Dec 27, 2017

Related discussion of performance (with links to some server-side index experiments): mkdocs/mkdocs#859

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants