-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add parallel multiprocessing option to Crawler
#4126
Conversation
Crawler
Crawler
That's great. I was about to look over this tonight. Amazing job. Any specific reason to not use ProcessPoolExecutor? I was thinking about the extra memory that Selenium will consume, as it's a need to spawn multiple drivers. Maybe add some info on docstring that memory will increase significantly with the usage of it? Or maybe allowing user to set the number of processes, so it would be able to make the balance, speed vs memory. |
Thanks @danielbichuetti! I generally chose Yes, totally agree on the additional memory and # processes point. I've just added another commit which should give the user that flexibility and awareness. |
@jackapbutler I was exploring the reasoning behind the decision. Some core team members were concerned about introducing more dependencies in the past. |
|
@jackapbutler I wanted to share some experiences that I have faced in the past. Since the CI runs on Python 3.8, I would recommend using a virtual environment with it. It helps to avoid some test issues, such as mypy and others. I usually use two environments:
UPDATE: Python 3.8 is being used by CI. |
Great thanks @danielbichuetti, I'll try that now. I think I've still got 1-2 tests failing in CI which I'll dig into tomorrow 🙂. |
First of all @jackapbutler - this is a phenomenally high-quality contribution. Having said that let's wait for the merge of #4122 first and then we can rebase and adjust this PR. |
@jackapbutler, as a side note @danielbichuetti and I pondered about a crawler that first attempts to retrieve data from a URL (and convert it to Document) by using a simple requests fetch. If a fetch fails due to the javascript library used at that URL preventing the "simple-scrape" we then fall back to the browser fetch. The motivation is to make Crawler a super fast URL to Document Retriever in addition to the current crawling functionality. Given a list of URLs (as a result from a search engine) Crawler converts search results to a Document list - blazingly fast. Let's talk about that on Discord and plan accordingly. |
Hey @vblagoje, thanks very much, happy to be able to contribute! That sounds like a great idea. I'm not sure about the current state of the conversation but let me know if it's helpful for me to consider that and try to implement / benchmark it in this PR 🙂 |
Yes please, @jackapbutler coordinate with @danielbichuetti whenever necessary. FYI, we are also working on a new component #4259 that might be interesting to you. I am working on it already in cooperation with @danielbichuetti |
Crawler
Crawler
Hey @danielbichuetti and @vblagoje, apologies for the delay I got tied up with other stuff. I've rebased off the new main branch and re-added the functionality which seems to be almost passing CI now 🎊. I was wondering if either of you had an idea what might be going wrong in the Windows environment? I've no way to test it locally and not too sure what it could be. |
Related Issues
Proposed Changes:
num_processes
argument toCrawler
which enables downloaded url content using multiple processes. This requires creating a Chrome driver per process to handle downloading of url content AFAIU as thedriver
object is not pickleable between processes.joblib
so I've added that as a dependency. However, we could potentially useconcurrent.futures
from the standard library if we don't want that.How did you test it?
I verified the new functionality worked for different sets of urls and timed the speeds shown below. The below timings (in seconds) were performed on a 10 core M1 MacBook and are shown as
mean +/- std
;It doesn't quite get close to a 10x speedup but I assume most of this workflow is IO-bound. I also added a unit test to ensure this works in CI.
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.To Recreate Results