You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
As specified in the Agent tools & demo proposal, we want to enable an agent to search the web for relevant documents. WebRetriever is a wrapper around SearchEngine that produces a list of Haystack Documents.
Describe the solution you'd like
WebRetriever is a BaseComponent that will operate in two modes:
snippet mode: WebRetriever will return a list of Documents, each Document being a snippet of the search result
document mode: WebRetriever will return a list of Documents, each Document being a full HTML-stripped document of the search result
Describe alternatives you've considered
There are no alternatives, WebRetriever is a must
The text was updated successfully, but these errors were encountered:
@vblagoje While doing some experiments with WebRetriever, I noticed one small issue: PreProcessor will fail to split by sentence in a plenty of cases. This is caused by the way the PunktSentenceTokenizer was trained. For example, some documents with a split length set to10 or 2, would not make any difference because the sentence “breaking” was a newline and not any other pattern.
Maybe we can try to train a custom “web” PunktSentenceTokenizer and test the results. Or I would recommend avoiding sentence splitting as it could increase the context without any necessity. I have some tests using a WebRetriever and SearchEngine.
Ok, but we can use word splitting until sentence splitting has been fixed. Please try that one. Try a window of, say, 100 words with a 10-word overlap.
Is your feature request related to a problem? Please describe.
As specified in the Agent tools & demo proposal, we want to enable an agent to search the web for relevant documents. WebRetriever is a wrapper around SearchEngine that produces a list of Haystack Documents.
Describe the solution you'd like
WebRetriever is a BaseComponent that will operate in two modes:
Describe alternatives you've considered
There are no alternatives, WebRetriever is a must
The text was updated successfully, but these errors were encountered: