Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add WebRetriever as an agent tool #4259

Closed
vblagoje opened this issue Feb 23, 2023 · 2 comments · Fixed by #4437
Closed

Add WebRetriever as an agent tool #4259

vblagoje opened this issue Feb 23, 2023 · 2 comments · Fixed by #4437

Comments

@vblagoje
Copy link
Member

Is your feature request related to a problem? Please describe.

As specified in the Agent tools & demo proposal, we want to enable an agent to search the web for relevant documents. WebRetriever is a wrapper around SearchEngine that produces a list of Haystack Documents.

Describe the solution you'd like

WebRetriever is a BaseComponent that will operate in two modes:

  • snippet mode: WebRetriever will return a list of Documents, each Document being a snippet of the search result
  • document mode: WebRetriever will return a list of Documents, each Document being a full HTML-stripped document of the search result

Describe alternatives you've considered
There are no alternatives, WebRetriever is a must

@danielbichuetti
Copy link
Contributor

danielbichuetti commented Feb 24, 2023

@vblagoje While doing some experiments with WebRetriever, I noticed one small issue: PreProcessor will fail to split by sentence in a plenty of cases. This is caused by the way the PunktSentenceTokenizer was trained. For example, some documents with a split length set to10 or 2, would not make any difference because the sentence “breaking” was a newline and not any other pattern.

Maybe we can try to train a custom “web” PunktSentenceTokenizer and test the results. Or I would recommend avoiding sentence splitting as it could increase the context without any necessity. I have some tests using a WebRetriever and SearchEngine.

@vblagoje
Copy link
Member Author

Ok, but we can use word splitting until sentence splitting has been fixed. Please try that one. Try a window of, say, 100 words with a 10-word overlap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants