Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SessionPool generates more sessions that needed and also does not respect "maxUsageCount" constraint. #1836

Open
1 task
yellott opened this issue Mar 19, 2023 · 8 comments
Assignees
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@yellott
Copy link

yellott commented Mar 19, 2023

Which package is this bug report for? If unsure which one to select, leave blank

None

Issue description

  1. Create basic PlaywrightCrawler/PuppeteerCrawler/HttpCrawler (at least I've tried all these).
  2. Set sessionPoolOptions.sessionOptions.maxUsageCount to 1, it can actually be anything else, it's just easier to see with 1.
  3. Log sessionId in default request handler.
  4. Run crawler with 10 start urls.
  5. Compare sessionId logs with content of SDK_SESSION_POOL_STATE.json file.
  6. There will be 15 sessions in the file - usageCount 10 of them equals to 0 while usageCount of 5 another sessions equals to 8.
  7. In console it's logged that there were 5 sessions in use and each were used twice.

Repro is here.

Code sample

import { PlaywrightCrawler } from 'crawlee';

function* getUrls() {
    for (let i = 0; i < 10; i++) {
        yield `http://localhost:3000/root?key=${i.toString()}`;
    }
}

const sessionStats: Record<string, { urls: string[]; count: number }> = {};

async function runCrawlerWithStartUrls() {
    const playwrightCrawler = new PlaywrightCrawler({
        sessionPoolOptions: {
            sessionOptions: {
                maxUsageCount: 1,
            },
        },
        requestHandler: async ({ session, request }) => {
            if (session) {
                const data = sessionStats[session.id] ?? { urls: [], count: 0 };
                data.count += 1;
                data.urls.push(request.url);
                sessionStats[session.id] = data;
            }
        },
    });

    await playwrightCrawler.run(Array.from(getUrls()));

    console.log('Sessions', JSON.stringify(sessionStats, null, 4));
    console.log('Sessions used count', Object.keys(sessionStats).length);
}

await runCrawlerWithStartUrls();

And this one of the sessions in SDK_SESSION_POOL_STATE.json. As you can see usageCount is greater than maxUsageCount and errorScore is greater than maxErrorScore.

image

This way it at least respects maxUsageCount but still generates twice as much sessions than needed.

import { PlaywrightCrawler } from 'crawlee';

function* getUrls() {
    for (let i = 0; i < 10; i++) {
        yield `http://localhost:3000/root?key=${i.toString()}`;
    }
}

const urlGenerator = getUrls();

const sessionStats: Record<string, { urls: string[]; count: number }> = {};

async function runCrawlerWithAddRequests() {
    const playwrightCrawler = new PlaywrightCrawler({
        sessionPoolOptions: {
            sessionOptions: {
                maxUsageCount: 1,
            },
        },

        requestHandler: async ({ session, request, crawler }) => {
            if (session) {
                const data = sessionStats[session.id] ?? { urls: [], count: 0 };
                data.count += 1;
                data.urls.push(request.url);
                sessionStats[session.id] = data;
            }

            const next = urlGenerator.next();
            if (!next.done) {
                crawler.addRequests([next.value]);
            }
        },
    });

    await playwrightCrawler.run(['http://localhost:3000/root']);

    console.log('Sessions', JSON.stringify(sessionStats, null, 4));
    console.log('Sessions used count', Object.keys(sessionStats).length);
}

await runCrawlerWithAddRequests()

Package version

3.3.0

Node.js version

v18.13.0

Operating system

Ubuntu 22.04.1 LTS on WSL 2

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

3.3.1-beta.10

Other context

No response

@yellott yellott added the bug Something isn't working. label Mar 19, 2023
@metalwarrior665
Copy link
Member

Thanks for report. The first test looks a bit weird. Where do the errors even happen? If the page would not load the localhost, your code in request handler would not run.

@B4nan
Copy link
Member

B4nan commented Mar 20, 2023

Note that you are not awaiting the crawler.addRequests([next.value]) call, that's another problem in the repro.

@yellott
Copy link
Author

yellott commented Mar 20, 2023

Thanks for report. The first test looks a bit weird. Where do the errors even happen? If the page would not load the localhost, your code in request handler would not run.

There is a local server in repo) This can be tested on any others.
There are no errors in debug logs. That is the most interesting part.

@yellott
Copy link
Author

yellott commented Mar 20, 2023

Note that you are not awaiting the crawler.addRequests([next.value]) call, that's another problem in the repro.

I've tried awaiting it as well but this does not change the result.

@yellott
Copy link
Author

yellott commented Mar 20, 2023

@B4nan I've updated repo to await crawler.addRequests calls. Data in SDK_SESSION_POOL_STATE.json still differs from manually collected session usage stats.

@barjin barjin self-assigned this Mar 20, 2023
@mtrunkat mtrunkat added the t-tooling Issues with this label are in the ownership of the tooling team. label Sep 12, 2023
@zopieux
Copy link

zopieux commented Apr 7, 2024

I can reproduce this. I use maxUsageCount: 1 as a workaround to enforce spreading the requests to proxies more uniformly, because I don't like the "round robin" strategy used by Crawlee: it only switches proxy for the session once it sees errors, meaning it hammers the one proxy until it fails, which is definitely not what I want.

Even though this helps spreading, I can see more than 1 request being done per session by looking at logs.

@slow-groovin
Copy link

I found that every request increase the usageCount of one session by 2

const crawler = new PlaywrightCrawler({
	headless: true,
	maxRequestRetries: -1,
	useSessionPool: true,
	
	persistCookiesPerSession: true,

	async requestHandler({page, response, request, proxyInfo, enqueueLinks, browserController, session}) {
		console.log(session?.getState().usageCount,)
	},
})


await crawler.run(list(1, 100).map(i => `http://localhost:3000/mock/forCrawlee/cookie?order=10&q=${i}`))

@vikyw89
Copy link

vikyw89 commented Jan 29, 2025

I experienced something similar with playwright crawler.

Upon investigation:

  • this doesn't happen when maxConcurrency is 1, I did logging on preNav and postNav hook.

This point out to be a bug in either:

  • session count increment (session use got incremented after the check ?)
  • batching, session got incremented correctly after being taken out from the queue, but it get reused / duplicated on the concurrent batching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

8 participants