Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Browser-based crawlers double increment session.usageCount #2851

Open
barjin opened this issue Feb 19, 2025 · 1 comment
Open

bug: Browser-based crawlers double increment session.usageCount #2851

barjin opened this issue Feb 19, 2025 · 1 comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@barjin
Copy link
Contributor

barjin commented Feb 19, 2025

As mentioned in comments of #1836 , instances of BrowserCrawler track the session.usageCount incorrectly. This is caused (partially?) by a double increment of session.usageCount - first in BrowserCrawler:

if (session) session.markGood();

and then in BasicCrawler:

crawlingContext.session?.markGood();

Reproduction

    const crawler = new PlaywrightCrawler({
        requestHandler: ({ session }) => {
            console.log(session?.getState().usageCount);
        },
        sessionPoolOptions: {
            maxPoolSize: 1,
        },
        maxConcurrency: 1,
    });

    await crawler.run([
        'https://example.org/1',
        'https://example.org/2',
        'https://example.org/3',
        'https://example.org/4',
        'https://example.org/5',
        'https://example.org/6',
        'https://example.org/7',
        'https://example.org/8',
        'https://example.org/9',
        'https://example.org/10',
    ]);

prints out only even numbers:

INFO  PlaywrightCrawler: Starting the crawler.
0
2
4
6
8
10
12
14
16
18
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Feb 19, 2025
@barjin
Copy link
Contributor Author

barjin commented Feb 19, 2025

This likely happened as a result of fixes in #1709 - before, BasicCrawler and BrowserCrawler were touching different Session instances, so each incremented one and the code above would have returned the correct 0..9 sequence.

Since both methods are accessing the same Session instance now, we just have to remove the markGood call from BrowserCrawler?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

1 participant