Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameter sizeLimit is not ending the crawl correctly #256

Closed
gitreich opened this issue Mar 20, 2023 · 6 comments
Closed

Parameter sizeLimit is not ending the crawl correctly #256

gitreich opened this issue Mar 20, 2023 · 6 comments

Comments

@gitreich
Copy link
Contributor

Using Browsertrix in the terminal I was testing all Limits to end the crawl, but the sizeLimit was not ending my crawl correctly

Here is the starting command for sizeLimit:
docker run -p 9037:9037 -i -v $PWD/crawls:/crawls webrecorder/browsertrix-crawler crawl --url "http://falter.at" --profile /crawls/profiles/profile_falter.tar.gz --sizeLimit 2000 --text --depth 3 --scopeType domain --screencastPort 9037
I also removed the profile with the command:
docker run -i -v $PWD/crawls:/crawls webrecorder/browsertrix-crawler crawl --url "http://falter.at" --sizeLimit 2000 --text --depth 3 --scopeType domain
it was not changing anything on the sizeLimit quit

I think the error is coming from crawler.js Line 515:
const size = await getDirSize(dir);
I was expecting in the console the log entry "Size threshold reached ..."
but never saw it

See also Log File:
crawl-20230320113927564.log
crawl-20230320122513349.log

@ikreymer
Copy link
Member

Can you try the current 0.9.0 beta on main? I think there may have been a bug related to this but should been fixed in the latest main.

@ikreymer
Copy link
Member

Specifically, it was fixed via #241. We could also do a 0.8.2 release perhaps if this is urgent.

@gitreich
Copy link
Contributor Author

thank you and no it is not urgent;
I will not have time to retest it before next monday (but maybe someone can provide a quick start to build the project localy, which i would need to do, as there is no docker image 0.9.0 beta available?)

@ikreymer
Copy link
Member

Just pushed a webrecorder/browsertrix-crawler:0.9.0-beta.1 image which you should be able to try as well.

@gitreich
Copy link
Contributor Author

It's now working
{"logLevel":"info","timestamp":"2023-03-24T08:26:21.145Z","context":"general","message":"Size threshold reached 9716966 >= 2000, stopping","details":{}}

@ikreymer
Copy link
Member

Great, thanks for testing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants