Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't archive a page - 2 different environments, 2 different results #288

Closed
ArtHoff opened this issue Apr 17, 2023 · 5 comments
Closed

Comments

@ArtHoff
Copy link

ArtHoff commented Apr 17, 2023

Hi
I'm attempting to archive a single page and am unsuccessful.

Command

$ podman run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://nteconomy.nt.gov.au/labour-market?expanded=1 --generateWACZ --text --scopeType page --collection labour-market

Ubuntu in WSL

I get no archive because browsertrix thinks it can't scroll the page:
Output:
{"logLevel":"info","timestamp":"2023-04-17T01:28:46.686Z","context":"general","message":"Browsertrix-Crawler 0.9.0 (with warcio.js 1.6.2 pywb 2.7.3)","details":{}}
...
{"logLevel":"info","timestamp":"2023-04-17T01:28:49.574Z","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"Skipping autoscroll, page seems to not be responsive to scrolling events","page":"https://nteconomy.nt.gov.au/labour-market?expanded=1","workerid":0}}
...
{"logLevel":"info","timestamp":"2023-04-17T01:28:49.715Z","context":"general","message":"Crawl status: done","details":{}}

No wacz/warcs are created.

RHEL 7.9

I'm using the exact same command. Here it can scroll the page, but there's something wrong with Redis:
Output:
{"logLevel":"info","timestamp":"2023-04-17T01:36:33.770Z","context":"general","message":"Browsertrix-Crawler 0.9.0 (with warcio.js 1.6.2 pywb 2.7.3)","details":{}}
...
{"logLevel":"warn","timestamp":"2023-04-17T01:36:34.110Z","context":"redis","message":"ioredis error","details":{"error":"[ioredis] Unhandled error event:"}}
{"logLevel":"warn","timestamp":"2023-04-17T01:36:34.111Z","context":"state","message":"Waiting for redis at redis://localhost:6379/0","details":{}}
...
{"logLevel":"info","timestamp":"2023-04-17T01:37:45.483Z","context":"general","message":"Crawl status: done","details":{}}

It creates a wacz/warc files, however when viewing the wacz in ReplayWeb.page (v1.7.14) - the graphs and images are missing.

Thank you for any help you can provide.

@ikreymer
Copy link
Member

I think the main issue is that the autoscroll behavior doesn't detect that the page has dynamic resources at the moment, and thus skips slow scrolling, resulting in an incomplete capture. This type of detection is tricky, but we'll see if can figure out why that may be.

In the first example, I'm surprised no WACZ files were created, perhaps something else was wrong (permissions issue?)
Were there any other errors printed - usually it will print if WACZ creation fails. I would expect that it would be similar to the second - WACZ is created but some of the elements are missing due to lack of scrolling.

In the second example, I think redis startup was just a bit slow, hence the initial error, but since it completed the crawl, it was able to connect. My guess is that if you look at the full logs, it will also say that scrolling is skipped here as well due to the reason mentioned above.

If you run podman run -p 9037:9037 -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://nteconomy.nt.gov.au/labour-market?expanded=1 --generateWACZ --text --scopeType page --collection labour-market --screencastPort 9037, you should be able to open localhost:9037 and see what the crawler is doing.

ikreymer added a commit that referenced this issue Apr 22, 2023
browser: don't disable service workers always (accidentally added as part of playwright migration)
only disable if using profile, same as 0.8.x behavior
potential fix for #288
bump version to 0.9.1
@ikreymer
Copy link
Member

It turns out the issue was not the autoscroll behavior, but accidentally disabling service workers altogether!
With that restored, it seems everything is being captured now. It does take a few seconds for the graphs to show up in replay, but they do all render using the default crawling params above.
Could you try running with the https://github.com/webrecorder/browsertrix-crawler/tree/0.9.x branch?
if it all works for you as well, can release this as 0.9.1

@ArtHoff
Copy link
Author

ArtHoff commented Apr 24, 2023

Thank you for looking into this.
I've used the --screencastPort option and found that there seems to be a connection issue on the WSL installation I tested this on.
Pywb error

The RedHat server I use doesn't have that issue.
I built the new image on the RedHat server and ran it. It created the wacz file and this time it is complete, including all graphs. Thank you very much.

However, although the page is successfully archived I noticed a couple of things:

  1. Full page screenshots don't work
  • Using the option: --screenshot fullPage does not create a full page screenshot. It just shows the top part of the page, which seems to be the same as what the --screenshot view option shows.
  1. Log messages
  • a. The output still lists the redis warning.
  • b. The output still informs that the page can't be scrolled.

Below is the command I ran and the generated terminal output:
podman run -p 9037:9037 -v $PWD/crawls:/crawls/ -it localhost/browsertrix-crawler:0.9.1 crawl --url https://nteconomy.nt.gov.au/labour-market?expanded=1 --generateWACZ --text --scopeType page --collection labour-market --screencastPort 9037 --screenshot fullPage,view
{"logLevel":"info","timestamp":"2023-04-24T05:44:42.951Z","context":"general","message":"Browsertrix-Crawler 0.9.1 (with warcio.js 1.6.2 pywb 2.7.3)","details":{}}
{"logLevel":"info","timestamp":"2023-04-24T05:44:42.953Z","context":"general","message":"Seeds","details":[{"url":"https://nteconomy.nt.gov.au/labour-market?expanded=1","include":[],"exclude":[],"scopeType":"page","sitemap":false,"allowHash":false,"maxExtraHops":0,"maxDepth":1000000}]}
{"logLevel":"warn","timestamp":"2023-04-24T05:44:43.263Z","context":"redis","message":"ioredis error","details":{"error":"[ioredis] Unhandled error event:"}}
{"logLevel":"warn","timestamp":"2023-04-24T05:44:43.264Z","context":"state","message":"Waiting for redis at redis://localhost:6379/0","details":{}}
{"logLevel":"info","timestamp":"2023-04-24T05:44:50.637Z","context":"worker","message":"Creating 1 workers","details":{}}
{"logLevel":"info","timestamp":"2023-04-24T05:44:50.638Z","context":"worker","message":"Worker starting","details":{"workerid":0}}
{"logLevel":"info","timestamp":"2023-04-24T05:44:50.865Z","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://nteconomy.nt.gov.au/labour-market?expanded=1"}}
{"logLevel":"info","timestamp":"2023-04-24T05:44:50.867Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":0,"total":1,"pending":1,"limit":{"max":0,"hit":false},"pendingPages":["{"seedId":0,"started":"2023-04-24T05:44:50.640Z","url":"https://nteconomy.nt.gov.au/labour-market?expanded=1","added":"2023-04-24T05:44:46.338Z","depth":0}"]}}
{"logLevel":"info","timestamp":"2023-04-24T05:44:50.978Z","context":"general","message":"Awaiting page load","details":{"page":"https://nteconomy.nt.gov.au/labour-market?expanded=1","workerid":0}}
{"logLevel":"warn","timestamp":"2023-04-24T05:45:17.909Z","context":"general","message":"Invalid Seed - URL must start with http:// or https://","details":{"url":"mailto:[email protected]","page":"https://nteconomy.nt.gov.au/labour-market?expanded=1","workerid":0}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:19.920Z","context":"general","message":"Screenshot (type: view) for https://nteconomy.nt.gov.au/labour-market?expanded=1 written to /crawls/collections/labour-market/archive/screenshots.warc.gz","details":{}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:20.429Z","context":"general","message":"Screenshot (type: fullPage) for https://nteconomy.nt.gov.au/labour-market?expanded=1 written to /crawls/collections/labour-market/archive/screenshots.warc.gz","details":{}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:20.519Z","context":"behavior","message":"Running behaviors","details":{"frames":1,"frameUrls":["https://nteconomy.nt.gov.au/labour-market?expanded=1"],"page":"https://nteconomy.nt.gov.au/labour-market?expanded=1","workerid":0}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:20.519Z","context":"behavior","message":"Run Script Started","details":{"frameUrl":"https://nteconomy.nt.gov.au/labour-market?expanded=1","page":"https://nteconomy.nt.gov.au/labour-market?expanded=1","workerid":0}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:21.035Z","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"Skipping autoscroll, page seems to not be responsive to scrolling events","page":"https://nteconomy.nt.gov.au/labour-market?expanded=1","workerid":0}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:21.036Z","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"done!","page":"https://nteconomy.nt.gov.au/labour-market?expanded=1","workerid":0}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:21.038Z","context":"behavior","message":"Run Script Finished","details":{"frameUrl":"https://nteconomy.nt.gov.au/labour-market?expanded=1","page":"https://nteconomy.nt.gov.au/labour-market?expanded=1","workerid":0}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:21.043Z","context":"behavior","message":"Behaviors finished","details":{"finished":1,"page":"https://nteconomy.nt.gov.au/labour-market?expanded=1","workerid":0}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:21.044Z","context":"pageStatus","message":"Page Finished","details":{"loadState":4,"page":"https://nteconomy.nt.gov.au/labour-market?expanded=1","workerid":0}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:21.116Z","context":"worker","message":"Worker exiting, all tasks complete","details":{"workerid":0}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:21.586Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":1,"total":1,"pending":0,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:21.587Z","context":"general","message":"Waiting to ensure pending data is written to WARCs...","details":{}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:22.598Z","context":"general","message":"Generating WACZ","details":{}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:22.600Z","context":"general","message":"Num WARC Files: 13","details":{}}
{"logLevel":"info","timestamp":"2023-04-24T05:45:25.790Z","context":"general","message":"Crawl status: done","details":{}}

@ikreymer
Copy link
Member

Thank you for looking into this. I've used the --screencastPort option and found that there seems to be a connection issue on the WSL installation I tested this on. Pywb error

Hm, yes, seems like its being blocked on your end somehow, a bit harder to investigate, but a different non-pywb approach we're working on many solve this in the future.

The RedHat server I use doesn't have that issue. I built the new image on the RedHat server and ran it. It created the wacz file and this time it is complete, including all graphs. Thank you very much.

Great! The 0.9.1 release will be out shortly with these fixes.

However, although the page is successfully archived I noticed a couple of things:

  1. Full page screenshots don't work
  • Using the option: --screenshot fullPage does not create a full page screenshot. It just shows the top part of the page, which seems to be the same as what the --screenshot view option shows.

Thanks is now fixed in #296, also added to 0.9.1

  1. Log messages
  • a. The output still lists the redis warning.
  • b. The output still informs that the page can't be scrolled.

This is all fine! The redis warning is just that redis is loading slowly, eventually it loads.
The scrolling message means the behavior has not detected a need to manually scroll the page, which is true -- it captured everything without having to slowly scroll through the page, as that was not the issue.'
We can think about ways to tweak the logging to be a bit more clear.

@ikreymer
Copy link
Member

0.9.1 release is now out with the fixes mentioned

@github-project-automation github-project-automation bot moved this from Triage to Done! in Webrecorder Projects Apr 24, 2023
ikreymer added a commit that referenced this issue Apr 24, 2023
…ce workers if no profile used (#297)

* browser: just pass profileUrl and track if custom profile is used
browser: don't disable service workers always (accidentally added as part of playwright migration)
only disable if using profile, same as 0.8.x behavior
fix for #288

* Fix full page screenshot (#296)
---------

Co-authored-by: Tessa Walsh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants