-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: session locking #1839
feat: session locking #1839
Conversation
@@ -1013,7 +1013,8 @@ export class BasicCrawler<Context extends CrawlingContext = BasicCrawlingContext | |||
|
|||
// reclaim session if request finishes successfully | |||
request.state = RequestState.DONE; | |||
crawlingContext.session?.markGood(); | |||
session?.markGood(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember we had some bug in here that was resolved by using crawlingContext.session
instead of just session
, or something similar. I think it was about this part, so might be irrelevant for basic crawler:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, and it was me who was fixing it 🤦🏼 at least I have a better grip on the session management in basic-/browser-crawler
.
That said, I find the session locking a much more minor issue now than before. Let's scratch that and rethink the session handling in the BrowserCrawler
instead.
When not stated otherwise (using useIncognitoPages
or experimentalContainers
), one browser instance is tied to a one specific session (retrieved from a session pool in a preLaunchHook
). Basically, the session constraints (maxErrorScore
, maxUsageCount
) are only checked at the time of session retrieval from the pool - that's what is happening in the #1836 with the maxUsageCount
- and it's also why CheerioCrawler is not riddled with this problem. Not sure why the AutoscaledPool.maxConcurrency
affects this though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this could be solved by adding a getSession(currentSessionId)
call to a prePageCreate
hook and - if this call returns null - give the browser a new session (by calling getSession()
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this could be solved by adding a getSession(currentSessionId) call to a prePageCreate hook and - if this call returns null - give the browser a new session (by calling getSession())?
Sounds good to me. (as long as it works, I though we can't just change the proxy for existing browser, at least in PW?)
But maybe we should stop thinking about workarounds and start working on the user pool, as that should help with this exact problem.
Closing in favour of user pool development. |
This did not age well 😄 |
Adds session locking for limiting session usage concurrency.
This caused inconsistencies with the user-provided options values (see #1836 )and (in extreme cases?) might have tripped bot detection.
fixes #627