Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: session locking #1839

Closed
wants to merge 2 commits into from
Closed

feat: session locking #1839

wants to merge 2 commits into from

Conversation

barjin
Copy link
Contributor

@barjin barjin commented Mar 20, 2023

Adds session locking for limiting session usage concurrency.

This caused inconsistencies with the user-provided options values (see #1836 )and (in extreme cases?) might have tripped bot detection.

fixes #627

@@ -1013,7 +1013,8 @@ export class BasicCrawler<Context extends CrawlingContext = BasicCrawlingContext

// reclaim session if request finishes successfully
request.state = RequestState.DONE;
crawlingContext.session?.markGood();
session?.markGood();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember we had some bug in here that was resolved by using crawlingContext.session instead of just session, or something similar. I think it was about this part, so might be irrelevant for basic crawler:

https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L474-L478

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, and it was me who was fixing it 🤦🏼 at least I have a better grip on the session management in basic-/browser-crawler.

That said, I find the session locking a much more minor issue now than before. Let's scratch that and rethink the session handling in the BrowserCrawler instead.

When not stated otherwise (using useIncognitoPages or experimentalContainers), one browser instance is tied to a one specific session (retrieved from a session pool in a preLaunchHook). Basically, the session constraints (maxErrorScore, maxUsageCount) are only checked at the time of session retrieval from the pool - that's what is happening in the #1836 with the maxUsageCount - and it's also why CheerioCrawler is not riddled with this problem. Not sure why the AutoscaledPool.maxConcurrency affects this though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could be solved by adding a getSession(currentSessionId) call to a prePageCreate hook and - if this call returns null - give the browser a new session (by calling getSession())?

Copy link
Member

@B4nan B4nan Mar 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could be solved by adding a getSession(currentSessionId) call to a prePageCreate hook and - if this call returns null - give the browser a new session (by calling getSession())?

Sounds good to me. (as long as it works, I though we can't just change the proxy for existing browser, at least in PW?)

But maybe we should stop thinking about workarounds and start working on the user pool, as that should help with this exact problem.

@barjin
Copy link
Contributor Author

barjin commented Mar 27, 2023

Closing in favour of user pool development.

@barjin barjin closed this Mar 27, 2023
@B4nan B4nan deleted the feat/session-locking branch July 28, 2023 11:49
@mnmkng
Copy link
Member

mnmkng commented Feb 2, 2024

Closing in favour of user pool development.

This did not age well 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SessionPool - locking down session to prevent concurrent usage.
3 participants