Can't archive a page - 2 different environments, 2 different results #288

ArtHoff · 2023-04-17T02:25:12Z

Hi
I'm attempting to archive a single page and am unsuccessful.

Command

$ podman run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://nteconomy.nt.gov.au/labour-market?expanded=1 --generateWACZ --text --scopeType page --collection labour-market

Ubuntu in WSL

I get no archive because browsertrix thinks it can't scroll the page:
Output:
{"logLevel":"info","timestamp":"2023-04-17T01:28:46.686Z","context":"general","message":"Browsertrix-Crawler 0.9.0 (with warcio.js 1.6.2 pywb 2.7.3)","details":{}}
...
{"logLevel":"info","timestamp":"2023-04-17T01:28:49.574Z","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"Skipping autoscroll, page seems to not be responsive to scrolling events","page":"https://nteconomy.nt.gov.au/labour-market?expanded=1","workerid":0}}
...
{"logLevel":"info","timestamp":"2023-04-17T01:28:49.715Z","context":"general","message":"Crawl status: done","details":{}}

No wacz/warcs are created.

RHEL 7.9

I'm using the exact same command. Here it can scroll the page, but there's something wrong with Redis:
Output:
{"logLevel":"info","timestamp":"2023-04-17T01:36:33.770Z","context":"general","message":"Browsertrix-Crawler 0.9.0 (with warcio.js 1.6.2 pywb 2.7.3)","details":{}}
...
{"logLevel":"warn","timestamp":"2023-04-17T01:36:34.110Z","context":"redis","message":"ioredis error","details":{"error":"[ioredis] Unhandled error event:"}}
{"logLevel":"warn","timestamp":"2023-04-17T01:36:34.111Z","context":"state","message":"Waiting for redis at redis://localhost:6379/0","details":{}}
...
{"logLevel":"info","timestamp":"2023-04-17T01:37:45.483Z","context":"general","message":"Crawl status: done","details":{}}

It creates a wacz/warc files, however when viewing the wacz in ReplayWeb.page (v1.7.14) - the graphs and images are missing.

Thank you for any help you can provide.

The text was updated successfully, but these errors were encountered:

ikreymer · 2023-04-20T18:37:25Z

I think the main issue is that the autoscroll behavior doesn't detect that the page has dynamic resources at the moment, and thus skips slow scrolling, resulting in an incomplete capture. This type of detection is tricky, but we'll see if can figure out why that may be.

In the first example, I'm surprised no WACZ files were created, perhaps something else was wrong (permissions issue?)
Were there any other errors printed - usually it will print if WACZ creation fails. I would expect that it would be similar to the second - WACZ is created but some of the elements are missing due to lack of scrolling.

In the second example, I think redis startup was just a bit slow, hence the initial error, but since it completed the crawl, it was able to connect. My guess is that if you look at the full logs, it will also say that scrolling is skipped here as well due to the reason mentioned above.

If you run podman run -p 9037:9037 -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://nteconomy.nt.gov.au/labour-market?expanded=1 --generateWACZ --text --scopeType page --collection labour-market --screencastPort 9037, you should be able to open localhost:9037 and see what the crawler is doing.

browser: don't disable service workers always (accidentally added as part of playwright migration) only disable if using profile, same as 0.8.x behavior potential fix for #288 bump version to 0.9.1

ikreymer · 2023-04-22T19:40:41Z

It turns out the issue was not the autoscroll behavior, but accidentally disabling service workers altogether!
With that restored, it seems everything is being captured now. It does take a few seconds for the graphs to show up in replay, but they do all render using the default crawling params above.
Could you try running with the https://github.com/webrecorder/browsertrix-crawler/tree/0.9.x branch?
if it all works for you as well, can release this as 0.9.1

ArtHoff · 2023-04-24T06:12:34Z

Thank you for looking into this.
I've used the --screencastPort option and found that there seems to be a connection issue on the WSL installation I tested this on.

The RedHat server I use doesn't have that issue.
I built the new image on the RedHat server and ran it. It created the wacz file and this time it is complete, including all graphs. Thank you very much.

However, although the page is successfully archived I noticed a couple of things:

Full page screenshots don't work

Using the option: --screenshot fullPage does not create a full page screenshot. It just shows the top part of the page, which seems to be the same as what the --screenshot view option shows.

Log messages

a. The output still lists the redis warning.
b. The output still informs that the page can't be scrolled.

Below is the command I ran and the generated terminal output:
podman run -p 9037:9037 -v $PWD/crawls:/crawls/ -it localhost/browsertrix-crawler:0.9.1 crawl --url https://nteconomy.nt.gov.au/labour-market?expanded=1 --generateWACZ --text --scopeType page --collection labour-market --screencastPort 9037 --screenshot fullPage,view
{"logLevel":"info","timestamp":"2023-04-24T05:44:42.951Z","context":"general","message":"Browsertrix-Crawler 0.9.1 (with warcio.js 1.6.2 pywb 2.7.3)","details":{}}
{"logLevel":"info","timestamp":"2023-04-24T05:44:42.953Z","context":"general","message":"Seeds","details":[{"url":"https://nteconomy.nt.gov.au/labour-market?expanded=1","include":[],"exclude":[],"scopeType":"page","sitemap":false,"allowHash":false,"maxExtraHops":0,"maxDepth":1000000}]}
{"logLevel":"warn","timestamp":"2023-04-24T05:44:43.263Z","context":"redis","message":"ioredis error","details":{"error":"[ioredis] Unhandled error event:"}}
{"logLevel":"warn","timestamp":"2023-04-24T05:44:43.264Z","context":"state","message":"Waiting for redis at redis://localhost:6379/0","details":{}}
{"logLevel":"info","timestamp":"2023-04-24T05:44:50.637Z","context":"worker","message":"Creating 1 workers","details":{}}
{"logLevel":"info","timestamp":"2023-04-24T05:44:50.638Z","context":"worker","message":"Worker starting","details":{"workerid":0}}
{"logLevel":"info","timestamp":"2023-04-24T05:44:50.865Z","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://nteconomy.nt.gov.au/labour-market?expanded=1"}}
{"logLevel":"info","timestamp":"2023-04-24T05:44:50.867Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":0,"total":1,"pending":1,"limit":{"max":0,"hit":false},"pendingPages":["{"seedId":0,"started":"2023-04-24T05:44:50.640Z","url":"https://nteconomy.nt.gov.au/labour-market?expanded=1","added":"2023-04-24T05:44:46.338Z","depth":0}"]}}
{"logLevel":"info","timestamp":"2023-04-24T05:44:50.978Z","context":"general","message":"Awaiting page load","details":{"page":"https://nteconomy.nt.gov.au/labour-market?expanded=1","workerid":0}}
{"logLevel":"warn","timestamp":"2023-04-24T05:45:17.909Z","context":"general","message":"Invalid Seed - URL must start with http:// or https://","details":{"url":"mailto:[email protected]","page":"https://nteconomy.nt.gov.au/labour-market?expanded=1","workerid":0}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:19.920Z","context":"general","message":"Screenshot (type: view) for https://nteconomy.nt.gov.au/labour-market?expanded=1 written to /crawls/collections/labour-market/archive/screenshots.warc.gz","details":{}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:20.429Z","context":"general","message":"Screenshot (type: fullPage) for https://nteconomy.nt.gov.au/labour-market?expanded=1 written to /crawls/collections/labour-market/archive/screenshots.warc.gz","details":{}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:20.519Z","context":"behavior","message":"Running behaviors","details":{"frames":1,"frameUrls":["https://nteconomy.nt.gov.au/labour-market?expanded=1"],"page":"https://nteconomy.nt.gov.au/labour-market?expanded=1","workerid":0}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:20.519Z","context":"behavior","message":"Run Script Started","details":{"frameUrl":"https://nteconomy.nt.gov.au/labour-market?expanded=1","page":"https://nteconomy.nt.gov.au/labour-market?expanded=1","workerid":0}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:21.035Z","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"Skipping autoscroll, page seems to not be responsive to scrolling events","page":"https://nteconomy.nt.gov.au/labour-market?expanded=1","workerid":0}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:21.036Z","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"done!","page":"https://nteconomy.nt.gov.au/labour-market?expanded=1","workerid":0}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:21.038Z","context":"behavior","message":"Run Script Finished","details":{"frameUrl":"https://nteconomy.nt.gov.au/labour-market?expanded=1","page":"https://nteconomy.nt.gov.au/labour-market?expanded=1","workerid":0}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:21.043Z","context":"behavior","message":"Behaviors finished","details":{"finished":1,"page":"https://nteconomy.nt.gov.au/labour-market?expanded=1","workerid":0}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:21.044Z","context":"pageStatus","message":"Page Finished","details":{"loadState":4,"page":"https://nteconomy.nt.gov.au/labour-market?expanded=1","workerid":0}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:21.116Z","context":"worker","message":"Worker exiting, all tasks complete","details":{"workerid":0}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:21.586Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":1,"total":1,"pending":0,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:21.587Z","context":"general","message":"Waiting to ensure pending data is written to WARCs...","details":{}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:22.598Z","context":"general","message":"Generating WACZ","details":{}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:22.600Z","context":"general","message":"Num WARC Files: 13","details":{}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:25.790Z","context":"general","message":"Crawl status: done","details":{}}

ikreymer · 2023-04-24T17:03:25Z

Thank you for looking into this. I've used the --screencastPort option and found that there seems to be a connection issue on the WSL installation I tested this on.

Hm, yes, seems like its being blocked on your end somehow, a bit harder to investigate, but a different non-pywb approach we're working on many solve this in the future.

The RedHat server I use doesn't have that issue. I built the new image on the RedHat server and ran it. It created the wacz file and this time it is complete, including all graphs. Thank you very much.

Great! The 0.9.1 release will be out shortly with these fixes.

However, although the page is successfully archived I noticed a couple of things:

Full page screenshots don't work

Using the option: --screenshot fullPage does not create a full page screenshot. It just shows the top part of the page, which seems to be the same as what the --screenshot view option shows.

Thanks is now fixed in #296, also added to 0.9.1

Log messages

a. The output still lists the redis warning.

b. The output still informs that the page can't be scrolled.

This is all fine! The redis warning is just that redis is loading slowly, eventually it loads.
The scrolling message means the behavior has not detected a need to manually scroll the page, which is true -- it captured everything without having to slowly scroll through the page, as that was not the issue.'
We can think about ways to tweak the logging to be a bit more clear.

ikreymer · 2023-04-24T17:22:18Z

0.9.1 release is now out with the fixes mentioned

…ce workers if no profile used (#297) * browser: just pass profileUrl and track if custom profile is used browser: don't disable service workers always (accidentally added as part of playwright migration) only disable if using profile, same as 0.8.x behavior fix for #288 * Fix full page screenshot (#296) --------- Co-authored-by: Tessa Walsh <[email protected]>

github-project-automation bot moved this to Triage in Webrecorder Projects Apr 17, 2023

github-project-automation bot added this to Webrecorder Projects Apr 17, 2023

ikreymer closed this as completed Apr 24, 2023

github-project-automation bot moved this from Triage to Done! in Webrecorder Projects Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't archive a page - 2 different environments, 2 different results #288

Can't archive a page - 2 different environments, 2 different results #288

ArtHoff commented Apr 17, 2023

ikreymer commented Apr 20, 2023

ikreymer commented Apr 22, 2023

ArtHoff commented Apr 24, 2023

ikreymer commented Apr 24, 2023

ikreymer commented Apr 24, 2023

Can't archive a page - 2 different environments, 2 different results #288

Can't archive a page - 2 different environments, 2 different results #288

Comments

ArtHoff commented Apr 17, 2023

Command

Ubuntu in WSL

RHEL 7.9

ikreymer commented Apr 20, 2023

ikreymer commented Apr 22, 2023

ArtHoff commented Apr 24, 2023

ikreymer commented Apr 24, 2023

ikreymer commented Apr 24, 2023