You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
As part of the overall #3753 epic, we'll need fast URLs to the Documents pipeline where a Crawler can play a significant role. However, Crawler currently, by default, fetches content from a list of URLs in a sequential synchronous one-by-one approach.
We should investigate using a pool of worker threads to parallelize the injection of fetched documents into a processing pipeline.
The ultimate goal of these improvements is to have a blazing-fast pipeline that can take topk URLs and produce a set of Documents (splitting to a specific token count or not).
Describe the solution you'd like
Add a pool of worker threads to fetch documents from the URL list.
Describe alternatives you've considered
We will use Crawler as is and pay the performance penalty.
Additional context
N/A
The text was updated successfully, but these errors were encountered:
Just a side note as we faced on our scrapping workers: Selenium and WebDriver don't play well in a multi-threading mechanism.
Probably will need to start a process pool, using multiprocessing.
Furthermore, we abandoned Selenium in favor of Playwright, as it's faster, we can easily keep screenshots, video-recordings and network traffic for some specific auditing issues.
Hey @danielbichuetti, I'm a research engineer working on language modelling and wanting to contribute to open source. I was wondering if this is still open and if I could try to implement a solution?
Is your feature request related to a problem? Please describe.
As part of the overall #3753 epic, we'll need fast URLs to the Documents pipeline where a Crawler can play a significant role. However, Crawler currently, by default, fetches content from a list of URLs in a sequential synchronous one-by-one approach.
We should investigate using a pool of worker threads to parallelize the injection of fetched documents into a processing pipeline.
The ultimate goal of these improvements is to have a blazing-fast pipeline that can take topk URLs and produce a set of Documents (splitting to a specific token count or not).
Describe the solution you'd like
Add a pool of worker threads to fetch documents from the URL list.
Describe alternatives you've considered
We will use Crawler as is and pay the performance penalty.
Additional context
N/A
The text was updated successfully, but these errors were encountered: