Crawler should optionally use a pool of worker threads to fetch URLs #4084

vblagoje · 2023-02-07T10:11:40Z

Is your feature request related to a problem? Please describe.
As part of the overall #3753 epic, we'll need fast URLs to the Documents pipeline where a Crawler can play a significant role. However, Crawler currently, by default, fetches content from a list of URLs in a sequential synchronous one-by-one approach.

We should investigate using a pool of worker threads to parallelize the injection of fetched documents into a processing pipeline.

The ultimate goal of these improvements is to have a blazing-fast pipeline that can take topk URLs and produce a set of Documents (splitting to a specific token count or not).

Describe the solution you'd like
Add a pool of worker threads to fetch documents from the URL list.

Describe alternatives you've considered
We will use Crawler as is and pay the performance penalty.

Additional context
N/A

danielbichuetti · 2023-02-07T12:28:50Z

Just a side note as we faced on our scrapping workers: Selenium and WebDriver don't play well in a multi-threading mechanism.

Probably will need to start a process pool, using multiprocessing.

Furthermore, we abandoned Selenium in favor of Playwright, as it's faster, we can easily keep screenshots, video-recordings and network traffic for some specific auditing issues.

jackapbutler · 2023-02-09T11:29:33Z

Hey @danielbichuetti, I'm a research engineer working on language modelling and wanting to contribute to open source. I was wondering if this is still open and if I could try to implement a solution?

danielbichuetti · 2023-02-09T11:31:40Z

@jackapbutler Hello

I've not started on it. I was discussing some implementation options for a specific scenario.

jackapbutler · 2023-02-09T15:51:14Z

Cool @danielbichuetti thanks for the tip on multithreading, I'll take a look and run some comparative tests for single / multiprocessing.

vblagoje added topic:crawler Contributions wanted! Looking for external contributions labels Feb 7, 2023

julian-risch added the P3 Low priority, leave it in the backlog label Feb 8, 2023

jackapbutler mentioned this issue Feb 9, 2023

feat: add parallel multiprocessing option to Crawler #4126

Closed

5 tasks

masci removed the Contributions wanted! Looking for external contributions label Dec 13, 2023

masci added the wontfix This will not be worked on label Feb 26, 2024

masci closed this as completed Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler should optionally use a pool of worker threads to fetch URLs #4084

Crawler should optionally use a pool of worker threads to fetch URLs #4084

vblagoje commented Feb 7, 2023

danielbichuetti commented Feb 7, 2023

jackapbutler commented Feb 9, 2023

danielbichuetti commented Feb 9, 2023 •

edited

Loading

jackapbutler commented Feb 9, 2023

Crawler should optionally use a pool of worker threads to fetch URLs #4084

Crawler should optionally use a pool of worker threads to fetch URLs #4084

Comments

vblagoje commented Feb 7, 2023

danielbichuetti commented Feb 7, 2023

jackapbutler commented Feb 9, 2023

danielbichuetti commented Feb 9, 2023 • edited Loading

jackapbutler commented Feb 9, 2023

danielbichuetti commented Feb 9, 2023 •

edited

Loading