Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add parallel multiprocessing option to Crawler #4126

Closed
wants to merge 16 commits into from
Closed

feat: add parallel multiprocessing option to Crawler #4126

wants to merge 16 commits into from

Conversation

jackapbutler
Copy link
Contributor

@jackapbutler jackapbutler commented Feb 9, 2023

Related Issues

Proposed Changes:

  • Added an optional num_processes argument to Crawler which enables downloaded url content using multiple processes. This requires creating a Chrome driver per process to handle downloading of url content AFAIU as the driver object is not pickleable between processes.
  • I also added some other small refactors to enable this in a relatively clean way.
  • The feature requires joblib so I've added that as a dependency. However, we could potentially use concurrent.futures from the standard library if we don't want that.

How did you test it?

I verified the new functionality worked for different sets of urls and timed the speeds shown below. The below timings (in seconds) were performed on a 10 core M1 MacBook and are shown as mean +/- std;

Number of URLS Single Process (seconds) Multiprocessing (seconds)
19 21.73 +/- 0.86 14.45 +/- 1.90
153 113.33 +/- 9.28 51.034 +/- 2.12

It doesn't quite get close to a 10x speedup but I assume most of this workflow is IO-bound. I also added a unit test to ensure this works in CI.

Checklist

To Recreate Results

smaller_url_set = [
            "https://haystack.deepset.ai/overview",
            "https://tetrath.com/"
        ]

larger_url_set = [
            "https://haystack.deepset.ai/overview",
            "https://tetrath.com/",
            "https://github.com/",
            "https://github.com/deepset-ai"
        ]

@jackapbutler jackapbutler requested a review from a team as a code owner February 9, 2023 22:07
@jackapbutler jackapbutler requested review from bogdankostic and removed request for a team February 9, 2023 22:07
@jackapbutler jackapbutler changed the title Add parallel crawling option to Crawler Add parallel multiprocessing option to Crawler Feb 9, 2023
@danielbichuetti
Copy link
Contributor

That's great. I was about to look over this tonight.

Amazing job. Any specific reason to not use ProcessPoolExecutor?

I was thinking about the extra memory that Selenium will consume, as it's a need to spawn multiple drivers. Maybe add some info on docstring that memory will increase significantly with the usage of it? Or maybe allowing user to set the number of processes, so it would be able to make the balance, speed vs memory.

@jackapbutler
Copy link
Contributor Author

That's great. I was about to look over this tonight.

Amazing job. Any specific reason to not use ProcessPoolExecutor?

I was thinking about the extra memory that Selenium will consume, as it's a need to spawn multiple drivers. Maybe add some info on docstring that memory will increase significantly with the usage of it? Or maybe allowing user to set the number of processes, so it would be able to make the balance, speed vs memory.

Thanks @danielbichuetti! I generally chose joblib over the ProcessPoolExecutor as it uses the loky backend by default and I've heard that's slightly more robust than the builtin multiprocessing backend in concurrent.futures. That being said, I'm definitely not an expert on the inner workings of them so totally happy to try out ProcessPoolExecutor if you think that's the cleaner way to add it in and merge with the rest of the codebase. I can give that a go tomorrow as it would avoid the extra third party package?

Yes, totally agree on the additional memory and # processes point. I've just added another commit which should give the user that flexibility and awareness.

@danielbichuetti
Copy link
Contributor

danielbichuetti commented Feb 9, 2023

Thanks @danielbichuetti! I generally chose joblib over the ProcessPoolExecutor as it uses the loky backend by default and I've heard that's slightly more robust than the builtin multiprocessing backend in concurrent.futures. That being said, I'm definitely not an expert on the inner workings of them so totally happy to try out ProcessPoolExecutor if you think that's the cleaner way to add it in and merge with the rest of the codebase. I can give that a go tomorrow as it would avoid the extra third party package?

Yes, totally agree on the additional memory and # processes point. I've just added another commit which should give the user that flexibility and awareness.

@jackapbutler
loky is great. It's the default sklearn backend for single-host.

I was exploring the reasoning behind the decision. Some core team members were concerned about introducing more dependencies in the past.

@CLAassistant
Copy link

CLAassistant commented Feb 9, 2023

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@danielbichuetti
Copy link
Contributor

danielbichuetti commented Feb 10, 2023

@jackapbutler I wanted to share some experiences that I have faced in the past. Since the CI runs on Python 3.8, I would recommend using a virtual environment with it. It helps to avoid some test issues, such as mypy and others.

I usually use two environments:

  • Python 3.8
  • Python 3.10

UPDATE: Python 3.8 is being used by CI.

@jackapbutler
Copy link
Contributor Author

@jackapbutler I wanted to share some experiences that I have faced in the past. Since the CI runs on Python 3.7, I would recommend using a virtual environment with it. It helps to avoid some test issues, such as mypy and others.

I usually use two environments:

  • Python 3.7
  • Python 3.10

Great thanks @danielbichuetti, I'll try that now. I think I've still got 1-2 tests failing in CI which I'll dig into tomorrow 🙂.

@vblagoje
Copy link
Member

First of all @jackapbutler - this is a phenomenally high-quality contribution. Having said that let's wait for the merge of #4122 first and then we can rebase and adjust this PR.

@vblagoje
Copy link
Member

vblagoje commented Feb 10, 2023

@jackapbutler, as a side note @danielbichuetti and I pondered about a crawler that first attempts to retrieve data from a URL (and convert it to Document) by using a simple requests fetch. If a fetch fails due to the javascript library used at that URL preventing the "simple-scrape" we then fall back to the browser fetch. The motivation is to make Crawler a super fast URL to Document Retriever in addition to the current crawling functionality. Given a list of URLs (as a result from a search engine) Crawler converts search results to a Document list - blazingly fast. Let's talk about that on Discord and plan accordingly.

@jackapbutler
Copy link
Contributor Author

@jackapbutler, as a side note @danielbichuetti and I pondered about a crawler that first attempts to retrieve data from a URL (and convert it to Document) by using a simple requests fetch. If a fetch fails due to the javascript library used at that URL preventing the "simple-scrape" we then fall back to the browser fetch. The motivation is to make Crawler a super fast URL to Document Retriever in addition to the current crawling functionality. Given a list of URLs (as a result from a search engine) Crawler converts search results to a Document list - blazingly fast. Let's talk about that on Discord and plan accordingly.

Hey @vblagoje, thanks very much, happy to be able to contribute! That sounds like a great idea. I'm not sure about the current state of the conversation but let me know if it's helpful for me to consider that and try to implement / benchmark it in this PR 🙂

@jackapbutler
Copy link
Contributor Author

Hey @vblagoje, I just saw #4122 has been merged, will I rebase here and update this branch?

@vblagoje
Copy link
Member

Yes please, @jackapbutler coordinate with @danielbichuetti whenever necessary. FYI, we are also working on a new component #4259 that might be interesting to you. I am working on it already in cooperation with @danielbichuetti

@jackapbutler jackapbutler changed the title Add parallel multiprocessing option to Crawler feat: add parallel multiprocessing option to Crawler Mar 16, 2023
@jackapbutler
Copy link
Contributor Author

Hey @danielbichuetti and @vblagoje, apologies for the delay I got tied up with other stuff. I've rebased off the new main branch and re-added the functionality which seems to be almost passing CI now 🎊.

I was wondering if either of you had an idea what might be going wrong in the Windows environment? I've no way to test it locally and not too sure what it could be.

@bogdankostic bogdankostic removed their request for review April 12, 2023 18:39
@jackapbutler jackapbutler closed this by deleting the head repository Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crawler should optionally use a pool of worker threads to fetch URLs
5 participants