-
-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't archive a page - 2 different environments, 2 different results #288
Comments
I think the main issue is that the autoscroll behavior doesn't detect that the page has dynamic resources at the moment, and thus skips slow scrolling, resulting in an incomplete capture. This type of detection is tricky, but we'll see if can figure out why that may be. In the first example, I'm surprised no WACZ files were created, perhaps something else was wrong (permissions issue?) In the second example, I think redis startup was just a bit slow, hence the initial error, but since it completed the crawl, it was able to connect. My guess is that if you look at the full logs, it will also say that scrolling is skipped here as well due to the reason mentioned above. If you run |
browser: don't disable service workers always (accidentally added as part of playwright migration) only disable if using profile, same as 0.8.x behavior potential fix for #288 bump version to 0.9.1
It turns out the issue was not the autoscroll behavior, but accidentally disabling service workers altogether! |
Thank you for looking into this. The RedHat server I use doesn't have that issue. However, although the page is successfully archived I noticed a couple of things:
Below is the command I ran and the generated terminal output: |
Hm, yes, seems like its being blocked on your end somehow, a bit harder to investigate, but a different non-pywb approach we're working on many solve this in the future.
Great! The 0.9.1 release will be out shortly with these fixes.
Thanks is now fixed in #296, also added to 0.9.1
This is all fine! The redis warning is just that redis is loading slowly, eventually it loads. |
0.9.1 release is now out with the fixes mentioned |
…ce workers if no profile used (#297) * browser: just pass profileUrl and track if custom profile is used browser: don't disable service workers always (accidentally added as part of playwright migration) only disable if using profile, same as 0.8.x behavior fix for #288 * Fix full page screenshot (#296) --------- Co-authored-by: Tessa Walsh <[email protected]>
Hi
I'm attempting to archive a single page and am unsuccessful.
Command
$ podman run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://nteconomy.nt.gov.au/labour-market?expanded=1 --generateWACZ --text --scopeType page --collection labour-market
Ubuntu in WSL
I get no archive because browsertrix thinks it can't scroll the page:
Output:
{"logLevel":"info","timestamp":"2023-04-17T01:28:46.686Z","context":"general","message":"Browsertrix-Crawler 0.9.0 (with warcio.js 1.6.2 pywb 2.7.3)","details":{}}
...
{"logLevel":"info","timestamp":"2023-04-17T01:28:49.574Z","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"Skipping autoscroll, page seems to not be responsive to scrolling events","page":"https://nteconomy.nt.gov.au/labour-market?expanded=1","workerid":0}}
...
{"logLevel":"info","timestamp":"2023-04-17T01:28:49.715Z","context":"general","message":"Crawl status: done","details":{}}
No wacz/warcs are created.
RHEL 7.9
I'm using the exact same command. Here it can scroll the page, but there's something wrong with Redis:
Output:
{"logLevel":"info","timestamp":"2023-04-17T01:36:33.770Z","context":"general","message":"Browsertrix-Crawler 0.9.0 (with warcio.js 1.6.2 pywb 2.7.3)","details":{}}
...
{"logLevel":"warn","timestamp":"2023-04-17T01:36:34.110Z","context":"redis","message":"ioredis error","details":{"error":"[ioredis] Unhandled error event:"}}
{"logLevel":"warn","timestamp":"2023-04-17T01:36:34.111Z","context":"state","message":"Waiting for redis at redis://localhost:6379/0","details":{}}
...
{"logLevel":"info","timestamp":"2023-04-17T01:37:45.483Z","context":"general","message":"Crawl status: done","details":{}}
It creates a wacz/warc files, however when viewing the wacz in ReplayWeb.page (v1.7.14) - the graphs and images are missing.
Thank you for any help you can provide.
The text was updated successfully, but these errors were encountered: