feat: add parallel multiprocessing option to `Crawler` #4126

jackapbutler · 2023-02-09T22:07:14Z

Related Issues

closes Crawler should optionally use a pool of worker threads to fetch URLs #4084

Proposed Changes:

Added an optional num_processes argument to Crawler which enables downloaded url content using multiple processes. This requires creating a Chrome driver per process to handle downloading of url content AFAIU as the driver object is not pickleable between processes.
I also added some other small refactors to enable this in a relatively clean way.
The feature requires joblib so I've added that as a dependency. However, we could potentially use concurrent.futures from the standard library if we don't want that.

How did you test it?

I verified the new functionality worked for different sets of urls and timed the speeds shown below. The below timings (in seconds) were performed on a 10 core M1 MacBook and are shown as mean +/- std;

Number of URLS	Single Process (seconds)	Multiprocessing (seconds)
19	21.73 +/- 0.86	14.45 +/- 1.90
153	113.33 +/- 9.28	51.034 +/- 2.12

It doesn't quite get close to a 10x speedup but I assume most of this workflow is IO-bound. I also added a unit test to ensure this works in CI.

Checklist

I have read the contributors guidelines and the code of conduct
I added tests that demonstrate the correct behavior of the change
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

To Recreate Results

smaller_url_set = [
            "https://haystack.deepset.ai/overview",
            "https://tetrath.com/"
        ]

larger_url_set = [
            "https://haystack.deepset.ai/overview",
            "https://tetrath.com/",
            "https://github.com/",
            "https://github.com/deepset-ai"
        ]

danielbichuetti · 2023-02-09T22:26:31Z

That's great. I was about to look over this tonight.

Amazing job. Any specific reason to not use ProcessPoolExecutor?

I was thinking about the extra memory that Selenium will consume, as it's a need to spawn multiple drivers. Maybe add some info on docstring that memory will increase significantly with the usage of it? Or maybe allowing user to set the number of processes, so it would be able to make the balance, speed vs memory.

jackapbutler · 2023-02-09T22:52:54Z

That's great. I was about to look over this tonight.

Amazing job. Any specific reason to not use ProcessPoolExecutor?

I was thinking about the extra memory that Selenium will consume, as it's a need to spawn multiple drivers. Maybe add some info on docstring that memory will increase significantly with the usage of it? Or maybe allowing user to set the number of processes, so it would be able to make the balance, speed vs memory.

Thanks @danielbichuetti! I generally chose joblib over the ProcessPoolExecutor as it uses the loky backend by default and I've heard that's slightly more robust than the builtin multiprocessing backend in concurrent.futures. That being said, I'm definitely not an expert on the inner workings of them so totally happy to try out ProcessPoolExecutor if you think that's the cleaner way to add it in and merge with the rest of the codebase. I can give that a go tomorrow as it would avoid the extra third party package?

Yes, totally agree on the additional memory and # processes point. I've just added another commit which should give the user that flexibility and awareness.

danielbichuetti · 2023-02-09T23:00:15Z

Thanks @danielbichuetti! I generally chose joblib over the ProcessPoolExecutor as it uses the loky backend by default and I've heard that's slightly more robust than the builtin multiprocessing backend in concurrent.futures. That being said, I'm definitely not an expert on the inner workings of them so totally happy to try out ProcessPoolExecutor if you think that's the cleaner way to add it in and merge with the rest of the codebase. I can give that a go tomorrow as it would avoid the extra third party package?

Yes, totally agree on the additional memory and # processes point. I've just added another commit which should give the user that flexibility and awareness.

@jackapbutler
loky is great. It's the default sklearn backend for single-host.

I was exploring the reasoning behind the decision. Some core team members were concerned about introducing more dependencies in the past.

CLAassistant · 2023-02-09T23:03:53Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

danielbichuetti · 2023-02-10T00:16:13Z

@jackapbutler I wanted to share some experiences that I have faced in the past. Since the CI runs on Python 3.8, I would recommend using a virtual environment with it. It helps to avoid some test issues, such as mypy and others.

I usually use two environments:

Python 3.8
Python 3.10

UPDATE: Python 3.8 is being used by CI.

jackapbutler · 2023-02-10T00:19:57Z

@jackapbutler I wanted to share some experiences that I have faced in the past. Since the CI runs on Python 3.7, I would recommend using a virtual environment with it. It helps to avoid some test issues, such as mypy and others.

I usually use two environments:

Python 3.7

Python 3.10

Great thanks @danielbichuetti, I'll try that now. I think I've still got 1-2 tests failing in CI which I'll dig into tomorrow 🙂.

vblagoje · 2023-02-10T09:14:11Z

First of all @jackapbutler - this is a phenomenally high-quality contribution. Having said that let's wait for the merge of #4122 first and then we can rebase and adjust this PR.

vblagoje · 2023-02-10T09:19:13Z

@jackapbutler, as a side note @danielbichuetti and I pondered about a crawler that first attempts to retrieve data from a URL (and convert it to Document) by using a simple requests fetch. If a fetch fails due to the javascript library used at that URL preventing the "simple-scrape" we then fall back to the browser fetch. The motivation is to make Crawler a super fast URL to Document Retriever in addition to the current crawling functionality. Given a list of URLs (as a result from a search engine) Crawler converts search results to a Document list - blazingly fast. Let's talk about that on Discord and plan accordingly.

jackapbutler · 2023-02-13T09:28:16Z

@jackapbutler, as a side note @danielbichuetti and I pondered about a crawler that first attempts to retrieve data from a URL (and convert it to Document) by using a simple requests fetch. If a fetch fails due to the javascript library used at that URL preventing the "simple-scrape" we then fall back to the browser fetch. The motivation is to make Crawler a super fast URL to Document Retriever in addition to the current crawling functionality. Given a list of URLs (as a result from a search engine) Crawler converts search results to a Document list - blazingly fast. Let's talk about that on Discord and plan accordingly.

Hey @vblagoje, thanks very much, happy to be able to contribute! That sounds like a great idea. I'm not sure about the current state of the conversation but let me know if it's helpful for me to consider that and try to implement / benchmark it in this PR 🙂

jackapbutler · 2023-02-24T10:05:02Z

Hey @vblagoje, I just saw #4122 has been merged, will I rebase here and update this branch?

vblagoje · 2023-02-24T10:21:16Z

Yes please, @jackapbutler coordinate with @danielbichuetti whenever necessary. FYI, we are also working on a new component #4259 that might be interesting to you. I am working on it already in cooperation with @danielbichuetti

jackapbutler · 2023-03-20T18:20:40Z

Hey @danielbichuetti and @vblagoje, apologies for the delay I got tied up with other stuff. I've rebased off the new main branch and re-added the functionality which seems to be almost passing CI now 🎊.

I was wondering if either of you had an idea what might be going wrong in the Windows environment? I've no way to test it locally and not too sure what it could be.

jackapbutler requested a review from a team as a code owner February 9, 2023 22:07

jackapbutler requested review from bogdankostic and removed request for a team February 9, 2023 22:07

jackapbutler changed the title ~~Add parallel crawling option to Crawler~~ Add parallel multiprocessing option to Crawler Feb 9, 2023

github-actions bot added topic:build/distribution topic:crawler topic:dependencies topic:tests labels Feb 9, 2023

jackapbutler requested a review from a team as a code owner February 28, 2023 11:24

github-actions bot added proposal topic:agent topic:audio topic:CI topic:dc_document_store topic:docker topic:document_store topic:elasticsearch topic:faiss topic:file_converter labels Feb 28, 2023

github-actions bot removed topic:elasticsearch topic:modeling topic:docker topic:weaviate topic:faiss topic:reader topic:dc_document_store topic:pinecone topic:opensearch labels Feb 28, 2023

jackapbutler added 8 commits March 16, 2023 14:45

refactor: ensure only one call to _crawl_urls

2502f8a

refactor: move _crawl out of class for parallelism

131dcaf

refactor: allow creation of webdriver on the fly

52e0e47

fix: add catch for no urls

35f46bf

chore: add joblib req

72ba40e

feat: add num_processes with joblib parallelism

5e03ced

fix: edge case with crawler depth 0

674f05b

test: add test for multiprocessing crawler

247217a

jackapbutler changed the title ~~Add parallel multiprocessing option to Crawler~~ feat: add parallel multiprocessing option to Crawler Mar 16, 2023

jackapbutler added 8 commits March 16, 2023 16:50

fix: fix reported ci lint/type errors

eb8c410

fix: add catch for parallel crawler deletion

060de21

fix: remove redundant lines

adb80aa

fix: initialise urls_to_search with urls

9cddc85

fix: initialise base urls to same size

6f6f000

chore: update type hints

4199de2

fix: parameterise tests and update joblib output

cbc5eae

fix: update test logic

7c25380

bogdankostic removed their request for review April 12, 2023 18:39

julian-risch assigned ZanSara Jun 7, 2023

jackapbutler closed this by deleting the head repository Sep 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add parallel multiprocessing option to `Crawler` #4126

feat: add parallel multiprocessing option to `Crawler` #4126

jackapbutler commented Feb 9, 2023 •

edited

Loading

danielbichuetti commented Feb 9, 2023

jackapbutler commented Feb 9, 2023

danielbichuetti commented Feb 9, 2023 •

edited

Loading

CLAassistant commented Feb 9, 2023 •

edited

Loading

danielbichuetti commented Feb 10, 2023 •

edited

Loading

jackapbutler commented Feb 10, 2023

vblagoje commented Feb 10, 2023

vblagoje commented Feb 10, 2023 •

edited

Loading

jackapbutler commented Feb 13, 2023

jackapbutler commented Feb 24, 2023

vblagoje commented Feb 24, 2023

jackapbutler commented Mar 20, 2023

feat: add parallel multiprocessing option to Crawler #4126

feat: add parallel multiprocessing option to Crawler #4126

Conversation

jackapbutler commented Feb 9, 2023 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Checklist

To Recreate Results

danielbichuetti commented Feb 9, 2023

jackapbutler commented Feb 9, 2023

danielbichuetti commented Feb 9, 2023 • edited Loading

CLAassistant commented Feb 9, 2023 • edited Loading

danielbichuetti commented Feb 10, 2023 • edited Loading

jackapbutler commented Feb 10, 2023

vblagoje commented Feb 10, 2023

vblagoje commented Feb 10, 2023 • edited Loading

jackapbutler commented Feb 13, 2023

jackapbutler commented Feb 24, 2023

vblagoje commented Feb 24, 2023

jackapbutler commented Mar 20, 2023

feat: add parallel multiprocessing option to `Crawler` #4126

feat: add parallel multiprocessing option to `Crawler` #4126

jackapbutler commented Feb 9, 2023 •

edited

Loading

danielbichuetti commented Feb 9, 2023 •

edited

Loading

CLAassistant commented Feb 9, 2023 •

edited

Loading

danielbichuetti commented Feb 10, 2023 •

edited

Loading

vblagoje commented Feb 10, 2023 •

edited

Loading