Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds detection for various bots #7739

Merged
merged 35 commits into from
Aug 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
a5abb11
Fix user agent
liviuconcioiu Jul 14, 2024
31b781e
Add another user agent for Qwantify
liviuconcioiu Jul 14, 2024
c18b7db
Add test for PagePeeker
liviuconcioiu Jul 14, 2024
7c17088
Add another test for SemrushBot
liviuconcioiu Jul 14, 2024
823b2f9
Improves DuckDuckBot
liviuconcioiu Jul 14, 2024
39e978e
Adds detection for DuckAssistBot
liviuconcioiu Jul 14, 2024
f1fe175
Adds detection for RedekenBot
liviuconcioiu Jul 14, 2024
fbe57d6
Adds detection for semaltbot
liviuconcioiu Jul 14, 2024
2e285f0
Adds detection for MakeMerryBot
liviuconcioiu Jul 14, 2024
1d31b7b
Adds detection for Timpibot
liviuconcioiu Jul 14, 2024
6ee0dca
Add generic bot test
liviuconcioiu Jul 14, 2024
2e7dcc8
Adds detection for ValidBot
liviuconcioiu Jul 14, 2024
51416c9
Adds detection for NameProtect
liviuconcioiu Jul 14, 2024
a58e324
Change name
liviuconcioiu Jul 14, 2024
11c2976
Adds detection for CLASSLA-web
liviuconcioiu Jul 14, 2024
ec1f769
Add generic bot test
liviuconcioiu Jul 14, 2024
0e28a35
Improves detection for generic bots
liviuconcioiu Jul 14, 2024
0dd9487
Move heritrix at the bottom
liviuconcioiu Jul 14, 2024
d077d83
Fix Arquivo.pt test
liviuconcioiu Jul 14, 2024
1322c70
Adds detection for Domain Codex
liviuconcioiu Jul 14, 2024
25dbfd2
Adds detection for Swisscows Favicons
liviuconcioiu Jul 14, 2024
7d4dd68
Adds detection for leak.info
liviuconcioiu Jul 14, 2024
43da177
Adds detection for Workona
liviuconcioiu Jul 14, 2024
6429687
Adds detection for Bloglines
liviuconcioiu Jul 14, 2024
88ba016
Improves detection for generic bots
liviuconcioiu Jul 14, 2024
3c34a6a
Merge branch 'master' into bots
liviuconcioiu Jul 17, 2024
344c042
Adds detection for Marginalia
liviuconcioiu Jul 17, 2024
bf4fb69
Adds detection for VU Server Health Scanner
liviuconcioiu Jul 18, 2024
f9a3abe
Improves detection for generic bots
liviuconcioiu Jul 18, 2024
fa2db24
Improves detection for generic bots
liviuconcioiu Jul 18, 2024
08c891a
Improves detection for generic bots
liviuconcioiu Jul 18, 2024
c448dec
Adds detection for Functionize
liviuconcioiu Jul 18, 2024
c50af03
Remove from apps
liviuconcioiu Jul 18, 2024
703e81b
Adds detection for Prerender
liviuconcioiu Jul 18, 2024
91a2f95
Merge branch 'master' into bots
sanchezzzhak Aug 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 0 additions & 6 deletions Tests/Parser/Client/fixtures/mobile_app.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2057,12 +2057,6 @@
type: mobile app
name: Teams
version: 24004.1304.2655.7488
-
user_agent: Report Runner
client:
type: mobile app
name: Report Runner
version: ""
-
user_agent: Mozilla/5.0 (iPhone; CPU iPhone OS 15_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 Zalo iOS/448 ZaloTheme/light ZaloLanguage/en
client:
Expand Down
251 changes: 242 additions & 9 deletions Tests/fixtures/bots.yml
Original file line number Diff line number Diff line change
Expand Up @@ -831,18 +831,27 @@
-
user_agent: DuckDuckBot/1.0; (+http://duckduckgo.com/duckduckbot.html)
bot:
name: DuckDuckGo Bot
name: DuckDuckBot
category: Search bot
url: https://duckduckgo.com/duckduckbot
url: https://duckduckgo.com/duckduckgo-help-pages/results/duckduckbot/
producer:
name: DuckDuckGo
url: https://duckduckgo.com/
-
user_agent: Mozilla/5.0 (compatible; DuckDuckGo-Favicons-Bot/1.0; +http://duckduckgo.com)
bot:
name: DuckDuckGo Bot
name: DuckDuckBot
category: Search bot
url: https://duckduckgo.com/duckduckbot
url: https://duckduckgo.com/duckduckgo-help-pages/results/duckduckbot/
producer:
name: DuckDuckGo
url: https://duckduckgo.com/
-
user_agent: DuckAssistBot/1.1; (+http://duckduckgo.com/duckassistbot.html)
bot:
name: DuckAssistBot
category: Search bot
url: https://duckduckgo.com/duckduckgo-help-pages/results/duckassistbot/
producer:
name: DuckDuckGo
url: https://duckduckgo.com/
Expand Down Expand Up @@ -2475,7 +2484,16 @@
name: Quora
url: http://www.quora.com
-
user_agent: 'Mozilla/5.0 (compatible; Qwantify/2.2w; +https://www.qwant.com/)/*'
user_agent: Mozilla/5.0 (compatible; Qwantify/2.2w; +https://www.qwant.com/)
bot:
name: Qwantify
category: Crawler
url: https://www.qwant.com/
producer:
name: Qwant Corporation
url: https://www.qwant.com/
-
user_agent: Mozilla/5.0 (compatible; Qwantify-prod34997/1.0; +https://help.qwant.com/bot/)
bot:
name: Qwantify
category: Crawler
Expand Down Expand Up @@ -5063,6 +5081,15 @@
producer:
name: Jožef Stefan Institute
url: https://www.ijs.si/ijsw/JSI
-
user_agent: Mozilla/5.0 (compatible; CLASSLA-web; +https://www.clarin.si/info/classla-web-crawler/)
bot:
name: CLASSLA-web
category: Crawler
url: https://www.clarin.si/info/classla-web-crawler/
producer:
name: Jožef Stefan Institute
url: https://www.ijs.si/ijsw/JSI
-
user_agent: "Electronic Frontier Foundation's Do Not Track Verifier (for questions or concerns email [email protected])"
bot:
Expand Down Expand Up @@ -6705,12 +6732,12 @@
-
user_agent: Arquivo-web-crawler (compatible; heritrix/3.4.0-20200304 +https://arquivo.pt/faq-crawling)
bot:
name: Heritrix
name: Arquivo.pt
category: Crawler
url: https://webarchive.jira.com/wiki/display/Heritrix/Heritrix
url: https://sobre.arquivo.pt/en/help/crawling-and-archiving-web-content/
producer:
name: The Internet Archive
url: https://archive.org
name: FCT|FCCN
url: https://www.fct.pt/
-
user_agent: Arquivo-web-crawler (compatible; brozzler/1.5 +https://arquivo.pt/faq-crawling)
bot:
Expand Down Expand Up @@ -7803,3 +7830,209 @@
producer:
name: Meins und Vogel GmbH
url: https://muv.com/
-
user_agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36 (compatible; PagePeeker/3.0; +https://pagepeeker.com/robots/)
bot:
name: PagePeeker
category: Crawler
url: https://pagepeeker.com/robots/
producer:
name: PAGEPEEKER SRL
url: https://pagepeeker.com/
-
user_agent: Mozilla/5.0 (compatible; SemrushBot-SWA/0.1; +http://www.semrush.com/bot.html)
bot:
name: SemrushBot
category: Crawler
url: https://www.semrush.com/bot/
producer:
name: Semrush Inc.
url: https://www.semrush.com/
-
user_agent: Mozilla/5.0 (compatible; RedekenBot/0.1; +https://www.redeken.com/bot/)
bot:
name: RedekenBot
category: Crawler
url: https://www.redeken.com/en/help/bot.html
producer:
name: Redeken
url: https://www.redeken.com/
-
user_agent: semaltbot/0.1 (+http://semalt.net)
bot:
name: semaltbot
category: Crawler
url: https://semalt.net/
producer:
name: Semalt LP
url: https://semalt.net/
-
user_agent: Mozilla/5.0 (compatible; MakeMerryBot/1.0; +https://makemerry.app/bots)
bot:
name: MakeMerryBot
category: Crawler
url: https://makemerry.app/bots
-
user_agent: Timpibot/0.9 (+http://www.timpi.io)
bot:
name: Timpibot
category: Crawler
url: https://timpi.io/
producer:
name: Timpi Inc.
url: https://timpi.io/
-
user_agent: Mozilla/5.0 (compatible; Timpibot/0.8; +http://www.timpi.io)
bot:
name: Timpibot
category: Crawler
url: https://timpi.io/
producer:
name: Timpi Inc.
url: https://timpi.io/
-
user_agent: 'Tublm.com/Bot/fubpdfdotcom/Bot/Bot -❤️- +https://tublm.com/game/2048_merge'
bot:
name: Generic Bot
-
user_agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.1 Safari/605.1.15 (compatible; Validbot; +https://www.validbot.com)
bot:
name: ValidBot
category: Crawler
url: https://www.validbot.com/
producer:
name: Jake Olefsky LLC
url: https://www.validbot.com/
-
user_agent: NPBot
bot:
name: NameProtectBot
category: Crawler
url: https://www.cscglobal.com/cscglobal/home/
producer:
name: NameProtect, Inc.
url: https://www.cscglobal.com/
-
user_agent: Mozilla/5.0 (compatible; CuriousCatgirl Research; +https://curiouscatgirl.cynthia.dev)
bot:
name: Generic Bot
-
user_agent: xx032_bo9vs83_2a
bot:
name: Generic Bot
-
user_agent: Mozilla/5.0 (compatible; heritrix/3.3.0-SNAPSHOT-20160721-2308 +https://www.domaincodex.com)
bot:
name: Domain Codex
category: Crawler
url: https://www.domaincodex.com/
producer:
name: Erie Data Systems, LLC
url: https://www.eriedatasys.com/
-
user_agent: Swisscows Favicons
bot:
name: Swisscows Favicons
category: Crawler
url: https://swisscows.com/
producer:
name: Swisscows AG
url: https://swisscows.com/
-
user_agent: Mozilla/4.0 (compatible; fluid/0.0; +http://www.leak.info/bot.html)
bot:
name: leak.info
category: Crawler
url: http://www.leak.info/
-
user_agent: workona-favicon-service/1.0.0
bot:
name: Workona
category: Crawler
url: https://workona.com/
producer:
name: Workona, Inc.
url: https://workona.com/
-
user_agent: Bloglines/3.1 (http://www.bloglines.com)
bot:
name: Bloglines
category: Crawler
url: https://web.archive.org/web/20140309033202/http://www.bloglines.com/
producer:
name: Reply!, Inc.
url: https://www.reply.com/
-
user_agent: 'shadowforce.io - sslshed/0.1'
bot:
name: Generic Bot
-
user_agent: search.marginalia.nu
bot:
name: Marginalia
category: Crawler
url: https://www.marginalia.nu/marginalia-search/for-webmasters/
producer:
name: Marginalia
url: https://www.marginalia.nu/
-
user_agent: Mozilla/5.0 (compatible;vu-server-health-scanner/1.0;https://130.37.198.75/index.html)
bot:
name: VU Server Health Scanner
category: Security Checker
url: https://130.37.198.75/index.html
producer:
name: VU Amsterdam
url: https://vu.nl/en
-
user_agent: Searcherxweb
bot:
name: Generic Bot
-
user_agent: Mozilla/5.0 (platform; rv:geckoversion) Gecko/geckotrail Firefox/firefoxversion
bot:
name: Generic Bot
-
user_agent: Report Runner
bot:
name: Generic Bot
-
user_agent: Node.js
bot:
name: Generic Bot
-
user_agent: Mozilla/5.0 (X11; Windows x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Functionize
bot:
name: Functionize
category: Crawler
url: https://www.functionize.com/
producer:
name: Functionize, Inc.
url: https://www.functionize.com/
-
user_agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/W.X.Y.Z Safari/537.36 Prerender (+https://github.com/prerender/prerender)
bot:
name: Prerender
category: Crawler
url: https://docs.prerender.io/docs/33-overview-of-prerender-crawlers
producer:
name: saas.group Inc.
url: https://saas.group/
-
user_agent: Mozilla/5.0 (Linux; Android 11; Pixel 5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 Prerender (+https://github.com/prerender/prerender)
bot:
name: Prerender
category: Crawler
url: https://docs.prerender.io/docs/33-overview-of-prerender-crawlers
producer:
name: saas.group Inc.
url: https://saas.group/
-
user_agent: Prerender (+https://github.com/prerender/prerender)
bot:
name: Prerender
category: Crawler
url: https://docs.prerender.io/docs/33-overview-of-prerender-crawlers
producer:
name: saas.group Inc.
url: https://saas.group/
Loading
Loading