Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply exclusions to redirects #745

Merged
merged 5 commits into from
Jan 28, 2025
Merged

Apply exclusions to redirects #745

merged 5 commits into from
Jan 28, 2025

Conversation

ikreymer
Copy link
Member

- if redirected page is excluded, block loading of page
- mark page as excluded, don't retry, and don't write to page list
- support generic blocking of pages based on initial page response
- fixes #744
@ikreymer
Copy link
Member Author

ikreymer commented Jan 27, 2025

Additional testing:
docker run -p 9037:9037 -v $PWD/crawls:/crawls -it webrecorder/browsertrix-crawler crawl --url "https://tararecuperavel.org/2014/01/09/ilha-de-plastico-chega-a-praia-do-baleal/img_1188/" --screencastPort 9037 --exclude facebook.com --exclude x.com --collection redir-test

page includes share links to x.com and facebook.com, which should now be excluded

Also have test in ./tests/exclude-redirected.test.js

@ikreymer ikreymer requested a review from tw4l January 27, 2025 06:38
@tw4l
Copy link
Member

tw4l commented Jan 28, 2025

The manual testing example seems to work for Facebook but not for X - I'm still seeing it redirect to x.com in the screencast and the logs include lines such as:

{
  "timestamp": "2025-01-28T18:14:10.328Z",
  "logLevel": "info",
  "context": "behaviorScript",
  "message": "Behavior log",
  "details": {
    "state": {
      "tweets": 0,
      "images": 0,
      "videos": 0,
      "threads": 1
    },
    "msg": "done!",
    "page": "https://x.com/intent/post?via=TaraRecupervel&related=wordpressdotcom&text=IMG_1188&url=https%3A%2F%2Ftararecuperavel.org%2F2014%2F01%2F09%2Filha-de-plastico-chega-a-praia-do-baleal%2Fimg_1188%2F&mx=2",
    "workerid": 0
  }
}
{
  "timestamp": "2025-01-28T18:14:10.833Z",
  "logLevel": "info",
  "context": "behavior",
  "message": "Run Script Finished",
  "details": {
    "frameUrl": "https://x.com/intent/post?via=TaraRecupervel&related=wordpressdotcom&text=IMG_1188&url=https%3A%2F%2Ftararecuperavel.org%2F2014%2F01%2F09%2Filha-de-plastico-chega-a-praia-do-baleal%2Fimg_1188%2F&mx=2",
    "page": "https://tararecuperavel.org/2014/01/09/ilha-de-plastico-chega-a-praia-do-baleal/img_1188/?share=twitter&nb=1",
    "workerid": 0
  }
}

(pretty printed for legibility)

@tw4l
Copy link
Member

tw4l commented Jan 28, 2025

Ah, it didn't work because the redirect was still to https://twitter.com/.... With --exclude twitter.com added to the test command, I can confirm behavior is expected in manual testing. Going to take a closer look at the code now.

@ikreymer ikreymer merged commit a00866b into main Jan 28, 2025
4 checks passed
@ikreymer ikreymer deleted the skip-redirect-excluded branch January 28, 2025 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensure exclusions apply to pages that redirect
2 participants