Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler should optionally use a pool of worker threads to fetch URLs #4084

Closed
vblagoje opened this issue Feb 7, 2023 · 4 comments
Closed
Labels
P3 Low priority, leave it in the backlog topic:crawler wontfix This will not be worked on

Comments

@vblagoje
Copy link
Member

vblagoje commented Feb 7, 2023

Is your feature request related to a problem? Please describe.
As part of the overall #3753 epic, we'll need fast URLs to the Documents pipeline where a Crawler can play a significant role. However, Crawler currently, by default, fetches content from a list of URLs in a sequential synchronous one-by-one approach.

We should investigate using a pool of worker threads to parallelize the injection of fetched documents into a processing pipeline.

The ultimate goal of these improvements is to have a blazing-fast pipeline that can take topk URLs and produce a set of Documents (splitting to a specific token count or not).

Describe the solution you'd like
Add a pool of worker threads to fetch documents from the URL list.

Describe alternatives you've considered
We will use Crawler as is and pay the performance penalty.

Additional context
N/A

@vblagoje vblagoje added topic:crawler Contributions wanted! Looking for external contributions labels Feb 7, 2023
@danielbichuetti
Copy link
Contributor

Just a side note as we faced on our scrapping workers: Selenium and WebDriver don't play well in a multi-threading mechanism.

Probably will need to start a process pool, using multiprocessing.

Furthermore, we abandoned Selenium in favor of Playwright, as it's faster, we can easily keep screenshots, video-recordings and network traffic for some specific auditing issues.

@julian-risch julian-risch added the P3 Low priority, leave it in the backlog label Feb 8, 2023
@jackapbutler
Copy link
Contributor

Hey @danielbichuetti, I'm a research engineer working on language modelling and wanting to contribute to open source. I was wondering if this is still open and if I could try to implement a solution?

@danielbichuetti
Copy link
Contributor

danielbichuetti commented Feb 9, 2023

@jackapbutler Hello

I've not started on it. I was discussing some implementation options for a specific scenario.

@jackapbutler
Copy link
Contributor

Cool @danielbichuetti thanks for the tip on multithreading, I'll take a look and run some comparative tests for single / multiprocessing.

@masci masci removed the Contributions wanted! Looking for external contributions label Dec 13, 2023
@masci masci added the wontfix This will not be worked on label Feb 26, 2024
@masci masci closed this as completed Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 Low priority, leave it in the backlog topic:crawler wontfix This will not be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants