Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds detection for various bots #7589

Merged
merged 14 commits into from
Feb 15, 2024
151 changes: 150 additions & 1 deletion Tests/fixtures/bots.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4066,7 +4066,10 @@
bot:
name: Project Resonance
category: Crawler
url: http://project-resonance.com
url: https://project-resonance.com/
producer:
name: RedHunt Labs Limited
url: https://redhuntlabs.com/
-
user_agent: Mozilla/5.0 (compatible; DataXu/1.0; +http://dataxu.com)
bot:
Expand Down Expand Up @@ -6686,3 +6689,149 @@
user_agent: Zeus
bot:
name: Generic Bot
-
user_agent: Mozilla/5.0 (WhatsMyIP.org GeoIP_Lookups) http://whatsmyip.org/ua
bot:
name: WhatsMyIP.org
category: Service Agent
url: https://www.whatsmyip.org/ua/
-
user_agent: Mozilla/5.0 (WhatsMyIP.org HTTP_Compression_Test) http://whatsmyip.org/ua
bot:
name: WhatsMyIP.org
category: Service Agent
url: https://www.whatsmyip.org/ua/
-
user_agent: Mozilla/5.0 (WhatsMyIP.org HTTP_Headers) http://whatsmyip.org/ua
bot:
name: WhatsMyIP.org
category: Service Agent
url: https://www.whatsmyip.org/ua/
-
user_agent: Mozilla/5.0 (WhatsMyIP.org PageRank_WebStats_Tool) http://whatsmyip.org/ua
bot:
name: WhatsMyIP.org
category: Service Agent
url: https://www.whatsmyip.org/ua/
-
user_agent: Mozilla/5.0 (WhatsMyIP.org Text_to_Code_Ratio_Tool) http://whatsmyip.org/ua
bot:
name: WhatsMyIP.org
category: Service Agent
url: https://www.whatsmyip.org/ua/
-
user_agent: Mozilla/5.0 (WhatsMyIP.org MAC_Address_Lookup_Tool) http://whatsmyip.org/ua
bot:
name: WhatsMyIP.org
category: Service Agent
url: https://www.whatsmyip.org/ua/
-
user_agent: Mozilla/5.0 (WhatsMyIP.org URL_Shortener_Preview_Tool) http://whatsmyip.org/ua
bot:
name: WhatsMyIP.org
category: Service Agent
url: https://www.whatsmyip.org/ua/
-
user_agent: Mozilla/5.0 (WhatsMyIP.org Random_Website_Loader) http://whatsmyip.org/ua
bot:
name: WhatsMyIP.org
category: Service Agent
url: https://www.whatsmyip.org/ua/
-
user_agent: keycdn-tools/perf
bot:
name: KeyCDN Tools
category: Service Agent
url: https://tools.keycdn.com/
producer:
name: proinity LLC
url: https://www.keycdn.com/
-
user_agent: keycdn-tools/br
bot:
name: KeyCDN Tools
category: Service Agent
url: https://tools.keycdn.com/
producer:
name: proinity LLC
url: https://www.keycdn.com/
-
user_agent: keycdn-tools/h2
bot:
name: KeyCDN Tools
category: Service Agent
url: https://tools.keycdn.com/
producer:
name: proinity LLC
url: https://www.keycdn.com/
-
user_agent: Mozilla/5.0 (compatible; AmazonAdBot/1.0; +https://adbot.amazon.com)
bot:
name: Amazon AdBot
category: Crawler
url: https://adbot.amazon.com/
producer:
name: Amazon.com, Inc.
url: https://www.amazon.com/
-
user_agent: SenutoBot/1.0 (compatible; SenutoBot/1.0; +https://www.senuto.com/)
bot:
name: Senuto
category: Crawler
url: https://www.senuto.com/
producer:
name: Senuto Sp. z o.o.
url: https://www.senuto.com/
-
user_agent: Automattic Analytics Crawler/0.2; http://wordpress.com/crawler/
bot:
name: Automattic Analytics
category: Crawler
url: https://wordpress.com/crawler/
producer:
name: Wordpress.org
url: https://wordpress.org/
-
user_agent: IDG/EU (http://spaziodati.eu/)
bot:
name: SpazioDati
category: Crawler
url: https://www.spaziodati.eu/
producer:
name: SpazioDati s.r.l.
url: https://www.spaziodati.eu/
-
user_agent: GozleBot; http://gozle.com.tm
bot:
name: Gozle
category: Crawler
url: https://gozle.com.tm/en/blog/post/1
producer:
name: Doly Horjun HJ
url: https://gozle.com.tm/
-
user_agent: Quantcastbot/2.0 (+http://www.quantcast.com/bot)
bot:
name: Quantcast
category: Crawler
url: https://www.quantcast.com/bot/
producer:
name: Quantcast Corp.
url: https://www.quantcast.com/
-
user_agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/102.0.5005.182 Safari/537.36 FontRadar
bot:
name: FontRadar
category: Crawler
url: https://www.fontradar.com/
producer:
name: EMDASH SAS
url: https://www.fontradar.com/
-
user_agent: survey-security-dot-txt/0.1
bot:
name: Generic Bot
-
user_agent: WebAuthn Adoption Study (Contact [email protected])
bot:
name: Generic Bot
80 changes: 76 additions & 4 deletions regexes/bots.yml
Original file line number Diff line number Diff line change
Expand Up @@ -85,14 +85,22 @@
name: 'Alexa Internet'
url: 'https://www.alexa.com'

- regex: 'Amazonbot'
- regex: 'Amazonbot/[\d.]+'
name: 'Amazon Bot'
category: 'Crawler'
url: 'https://developer.amazon.com/support/amazonbot'
producer:
name: 'Amazon.com, Inc.'
url: 'https://www.amazon.com/'

- regex: 'AmazonAdBot/[\d.]+'
name: 'Amazon AdBot'
category: 'Crawler'
url: 'https://adbot.amazon.com/'
producer:
name: 'Amazon.com, Inc.'
url: 'https://www.amazon.com/'

- regex: 'Amazon[ -]Route ?53[ -]Health[ -]Check[ -]Service'
name: 'Amazon Route53 Health Check'
category: 'Service Agent'
Expand Down Expand Up @@ -1784,6 +1792,14 @@
name: 'WPBeginner, LLC'
url: 'https://www.wpbeginner.com/'

- regex: 'Automattic Analytics Crawler/[\d.]+'
name: 'Automattic Analytics'
category: 'Crawler'
url: 'https://wordpress.com/crawler/'
producer:
name: 'Wordpress.org'
url: 'https://wordpress.org/'

- regex: 'WordPress'
name: 'WordPress'
category: 'Service Agent'
Expand Down Expand Up @@ -2441,7 +2457,10 @@
- regex: 'Project-Resonance'
name: 'Project Resonance'
category: 'Crawler'
url: 'http://project-resonance.com'
url: 'https://project-resonance.com/'
producer:
name: 'RedHunt Labs Limited'
url: 'https://redhuntlabs.com/'

- regex: 'DataXu/[\d.]+'
name: 'DataXu'
Expand Down Expand Up @@ -3909,11 +3928,19 @@
name: 'Shareaholic, Inc.'
url: 'https://www.shareaholic.com/'

- regex: 'keycdn-tools'
- regex: 'keycdn-tools:'
name: 'KeyCDN Tools'
category: 'Service Agent'
url: 'https://tools.keycdn.com/geo'

- regex: 'keycdn-tools/'
name: 'KeyCDN Tools'
category: 'Service Agent'
url: 'https://tools.keycdn.com/'
producer:
name: 'proinity LLC'
url: 'https://www.keycdn.com/'

- regex: 'Arquivo-web-crawler'
name: 'Arquivo.pt'
category: 'Crawler'
Expand All @@ -3922,9 +3949,54 @@
name: 'FCT|FCCN'
url: 'https://www.fct.pt/'

- regex: 'WhatsMyIP\.org'
name: 'WhatsMyIP.org'
category: 'Service Agent'
url: 'https://www.whatsmyip.org/ua/'

- regex: 'SenutoBot/[\d.]+'
name: 'Senuto'
category: 'Crawler'
url: 'https://www.senuto.com/'
producer:
name: 'Senuto Sp. z o.o.'
url: 'https://www.senuto.com/'

- regex: 'spaziodati'
name: 'SpazioDati'
category: 'Crawler'
url: 'https://www.spaziodati.eu/'
producer:
name: 'SpazioDati s.r.l.'
url: 'https://www.spaziodati.eu/'

- regex: 'GozleBot'
name: 'Gozle'
category: 'Crawler'
url: 'https://gozle.com.tm/en/blog/post/1'
producer:
name: 'Doly Horjun HJ'
url: 'https://gozle.com.tm/'

- regex: 'Quantcastbot/[\d.]+'
name: 'Quantcast'
category: 'Crawler'
url: 'https://www.quantcast.com/bot/'
producer:
name: 'Quantcast Corp.'
url: 'https://www.quantcast.com/'

- regex: 'FontRadar'
name: 'FontRadar'
category: 'Crawler'
url: 'https://www.fontradar.com/'
producer:
name: 'EMDASH SAS'
url: 'https://www.fontradar.com/'

# Generic detections
- regex: 'nuhk|grub-client|Download Demon|SearchExpress|Microsoft URL Control|borg|altavista|dataminr\.com|tweetedtimes\.com|teoma|oegp|http%20client|htdig|mogimogi|larbin|scrubby|searchsight|semanticdiscovery|snappy|vortex(?!(?: Build|Plus))|zeal(?!ot)|dataparksearch|findlinks|BrowserMob|URL2PNG|ZooShot|GomezA|Google SketchUp|Read%20Later|7Siters|centuryb\.o\.t9|InterNaetBoten|EasyBib AutoCite|Bidtellect|tomnomnom/meg|cortex|Re-re Studio|adreview|AHC/|NameOfAgent|Request-Promise|ALittle Client|Hello,? world|wp_is_mobile|0xAbyssalDoesntExist|Anarchy99|daumoa,damoa,daum,daumos,duamoa,duam,duamos|^revolt|nvd0rz|xfa1|Hakai|gbrmss|fuck-your-hp|IDBTE4M CODE87|Antoine|Insomania|Hells-Net|b3astmode|Linux Gnu \(cow\)|Test Certificate Info|iplabel|Magellan|TheSafex?Internetx?Search|kirkland-signature|^xenu|^ZmEu|^(?:chrome|firefox|Zeus)$'
name: 'Generic Bot'

- regex: '[a-z0-9_-]*(?:(?<!cu|power[ _]|m[ _])bot(?![ _]TAB|[ _]?5[0-9]|[ _]Senior|[ _]Junior)|analyzer|appengine|archiver|checker|collector|crawl|crawler|fetcher|indexer|monitor|project(?!or)|research|resolver|robots|scraper|spider|transcoder|uptime|user[ _]?agent|validator)(?:[^a-z]|$)'
- regex: '[a-z0-9_-]*(?:(?<!cu|power[ _]|m[ _])bot(?![ _]TAB|[ _]?5[0-9]|[ _]Senior|[ _]Junior)|analyzer|appengine|archiver|checker|collector|crawl|crawler|fetcher|indexer|monitor|project(?!or)|research|resolver|robots|scraper|security|spider|study|transcoder|uptime|user[ _]?agent|validator)(?:[^a-z]|$)'
name: 'Generic Bot'
Loading