SessionPool generates more sessions that needed and also does not respect "maxUsageCount" constraint. #1836

yellott · 2023-03-19T18:09:57Z

Which package is this bug report for? If unsure which one to select, leave blank

None

Issue description

Create basic PlaywrightCrawler/PuppeteerCrawler/HttpCrawler (at least I've tried all these).
Set sessionPoolOptions.sessionOptions.maxUsageCount to 1, it can actually be anything else, it's just easier to see with 1.
Log sessionId in default request handler.
Run crawler with 10 start urls.
Compare sessionId logs with content of SDK_SESSION_POOL_STATE.json file.
There will be 15 sessions in the file - usageCount 10 of them equals to 0 while usageCount of 5 another sessions equals to 8.
In console it's logged that there were 5 sessions in use and each were used twice.

Repro is here.

Code sample

import { PlaywrightCrawler } from 'crawlee';

function* getUrls() {
    for (let i = 0; i < 10; i++) {
        yield `http://localhost:3000/root?key=${i.toString()}`;
    }
}

const sessionStats: Record<string, { urls: string[]; count: number }> = {};

async function runCrawlerWithStartUrls() {
    const playwrightCrawler = new PlaywrightCrawler({
        sessionPoolOptions: {
            sessionOptions: {
                maxUsageCount: 1,
            },
        },
        requestHandler: async ({ session, request }) => {
            if (session) {
                const data = sessionStats[session.id] ?? { urls: [], count: 0 };
                data.count += 1;
                data.urls.push(request.url);
                sessionStats[session.id] = data;
            }
        },
    });

    await playwrightCrawler.run(Array.from(getUrls()));

    console.log('Sessions', JSON.stringify(sessionStats, null, 4));
    console.log('Sessions used count', Object.keys(sessionStats).length);
}

await runCrawlerWithStartUrls();

And this one of the sessions in SDK_SESSION_POOL_STATE.json. As you can see usageCount is greater than maxUsageCount and errorScore is greater than maxErrorScore.

This way it at least respects `maxUsageCount` but still generates twice as much sessions than needed.

import { PlaywrightCrawler } from 'crawlee';

function* getUrls() {
    for (let i = 0; i < 10; i++) {
        yield `http://localhost:3000/root?key=${i.toString()}`;
    }
}

const urlGenerator = getUrls();

const sessionStats: Record<string, { urls: string[]; count: number }> = {};

async function runCrawlerWithAddRequests() {
    const playwrightCrawler = new PlaywrightCrawler({
        sessionPoolOptions: {
            sessionOptions: {
                maxUsageCount: 1,
            },
        },

        requestHandler: async ({ session, request, crawler }) => {
            if (session) {
                const data = sessionStats[session.id] ?? { urls: [], count: 0 };
                data.count += 1;
                data.urls.push(request.url);
                sessionStats[session.id] = data;
            }

            const next = urlGenerator.next();
            if (!next.done) {
                crawler.addRequests([next.value]);
            }
        },
    });

    await playwrightCrawler.run(['http://localhost:3000/root']);

    console.log('Sessions', JSON.stringify(sessionStats, null, 4));
    console.log('Sessions used count', Object.keys(sessionStats).length);
}

await runCrawlerWithAddRequests()

Package version

3.3.0

Node.js version

v18.13.0

Operating system

Ubuntu 22.04.1 LTS on WSL 2

Apify platform

Tick me if you encountered this issue on the Apify platform

I have tested this on the `next` release

3.3.1-beta.10

Other context

No response

The text was updated successfully, but these errors were encountered:

metalwarrior665 · 2023-03-20T10:01:02Z

Thanks for report. The first test looks a bit weird. Where do the errors even happen? If the page would not load the localhost, your code in request handler would not run.

B4nan · 2023-03-20T10:03:05Z

Note that you are not awaiting the crawler.addRequests([next.value]) call, that's another problem in the repro.

yellott · 2023-03-20T10:10:47Z

Thanks for report. The first test looks a bit weird. Where do the errors even happen? If the page would not load the localhost, your code in request handler would not run.

There is a local server in repo) This can be tested on any others.
There are no errors in debug logs. That is the most interesting part.

yellott · 2023-03-20T10:17:49Z

Note that you are not awaiting the crawler.addRequests([next.value]) call, that's another problem in the repro.

I've tried awaiting it as well but this does not change the result.

yellott · 2023-03-20T10:41:12Z

@B4nan I've updated repo to await crawler.addRequests calls. Data in SDK_SESSION_POOL_STATE.json still differs from manually collected session usage stats.

zopieux · 2024-04-07T20:24:51Z

I can reproduce this. I use maxUsageCount: 1 as a workaround to enforce spreading the requests to proxies more uniformly, because I don't like the "round robin" strategy used by Crawlee: it only switches proxy for the session once it sees errors, meaning it hammers the one proxy until it fails, which is definitely not what I want.

Even though this helps spreading, I can see more than 1 request being done per session by looking at logs.

slow-groovin · 2024-10-29T07:10:43Z

I found that every request increase the usageCount of one session by 2

const crawler = new PlaywrightCrawler({
	headless: true,
	maxRequestRetries: -1,
	useSessionPool: true,
	
	persistCookiesPerSession: true,

	async requestHandler({page, response, request, proxyInfo, enqueueLinks, browserController, session}) {
		console.log(session?.getState().usageCount,)
	},
})


await crawler.run(list(1, 100).map(i => `http://localhost:3000/mock/forCrawlee/cookie?order=10&q=${i}`))

vikyw89 · 2025-01-29T09:11:10Z

I experienced something similar with playwright crawler.

Upon investigation:

this doesn't happen when maxConcurrency is 1, I did logging on preNav and postNav hook.

This point out to be a bug in either:

session count increment (session use got incremented after the check ?)
batching, session got incremented correctly after being taken out from the queue, but it get reused / duplicated on the concurrent batching.

yellott added the bug Something isn't working. label Mar 19, 2023

yellott mentioned this issue Mar 19, 2023

Dataset.pushData() - items are not written to dataset #1811

Closed

1 task

barjin self-assigned this Mar 20, 2023

barjin mentioned this issue Mar 20, 2023

feat: session locking #1839

Closed

mtrunkat added the t-tooling Issues with this label are in the ownership of the tooling team. label Sep 12, 2023

barjin mentioned this issue Feb 2, 2024

maxUsageCount: 1 does not retire session after a single use #2309

Closed

1 task

barjin mentioned this issue Feb 19, 2025

bug: Browser-based crawlers double increment session.usageCount #2851

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SessionPool generates more sessions that needed and also does not respect "maxUsageCount" constraint. #1836

SessionPool generates more sessions that needed and also does not respect "maxUsageCount" constraint. #1836

yellott commented Mar 19, 2023 •

edited

Loading

metalwarrior665 commented Mar 20, 2023

B4nan commented Mar 20, 2023

yellott commented Mar 20, 2023 •

edited

Loading

yellott commented Mar 20, 2023 •

edited

Loading

yellott commented Mar 20, 2023

zopieux commented Apr 7, 2024

slow-groovin commented Oct 29, 2024

vikyw89 commented Jan 29, 2025

SessionPool generates more sessions that needed and also does not respect "maxUsageCount" constraint. #1836

SessionPool generates more sessions that needed and also does not respect "maxUsageCount" constraint. #1836

Comments

yellott commented Mar 19, 2023 • edited Loading

Which package is this bug report for? If unsure which one to select, leave blank

Issue description

Code sample

This way it at least respects maxUsageCount but still generates twice as much sessions than needed.

Package version

Node.js version

Operating system

Apify platform

I have tested this on the next release

Other context

metalwarrior665 commented Mar 20, 2023

B4nan commented Mar 20, 2023

yellott commented Mar 20, 2023 • edited Loading

yellott commented Mar 20, 2023 • edited Loading

yellott commented Mar 20, 2023

zopieux commented Apr 7, 2024

slow-groovin commented Oct 29, 2024

vikyw89 commented Jan 29, 2025

yellott commented Mar 19, 2023 •

edited

Loading

This way it at least respects `maxUsageCount` but still generates twice as much sessions than needed.

I have tested this on the `next` release

yellott commented Mar 20, 2023 •

edited

Loading

yellott commented Mar 20, 2023 •

edited

Loading