-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(plugin): Add new plugin ua-restriction for bot spider restriction #4587
Changes from 6 commits
f2a8bf1
830b621
bfe2d7c
cb6d5d8
8be944e
dc331cd
7ea7c87
0c6dcb2
420e676
f7a61bd
9ace564
1ba8c12
a0fb6dd
6837ed1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,180 @@ | ||
-- | ||
-- Licensed to the Apache Software Foundation (ASF) under one or more | ||
-- contributor license agreements. See the NOTICE file distributed with | ||
-- this work for additional information regarding copyright ownership. | ||
-- The ASF licenses this file to You under the Apache License, Version 2.0 | ||
-- (the "License"); you may not use this file except in compliance with | ||
-- the License. You may obtain a copy of the License at | ||
-- | ||
-- http://www.apache.org/licenses/LICENSE-2.0 | ||
-- | ||
-- Unless required by applicable law or agreed to in writing, software | ||
-- distributed under the License is distributed on an "AS IS" BASIS, | ||
-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
-- See the License for the specific language governing permissions and | ||
-- limitations under the License. | ||
-- | ||
local ipairs = ipairs | ||
local core = require("apisix.core") | ||
local stringx = require('pl.stringx') | ||
local type = type | ||
local str_strip = stringx.strip | ||
local re_find = ngx.re.find | ||
|
||
local MATCH_NONE = 0 | ||
local MATCH_ALLOW = 1 | ||
local MATCH_DENY = 2 | ||
local MATCH_BOT = 3 | ||
|
||
local lrucache_useragent = core.lrucache.new({ ttl = 300, count = 1024 }) | ||
|
||
local schema = { | ||
type = "object", | ||
properties = { | ||
message = { | ||
type = "string", | ||
minLength = 1, | ||
maxLength = 1024, | ||
default = "Not allowed" | ||
}, | ||
whitelist = { | ||
type = "array", | ||
minItems = 1 | ||
}, | ||
blacklist = { | ||
type = "array", | ||
minItems = 1 | ||
}, | ||
}, | ||
additionalProperties = false, | ||
} | ||
|
||
local plugin_name = "bot-restriction" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The bot-restriction is confusing. It just checks the UA. What about renaming it to ua-restriction? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I do not think so,this plugin is for spider detection,and include most common spider ua. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A real bot-detection in the industry area is not just spider detection and UA check. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The plugin is for the common BaiduSpider、360Spider and some dev tools detection. We have to use the product of professional security company to do the feature you mentioned. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I found similar function in krakend There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't believe the nonsense bullshit. This way can only kick out script boys. But for the real hacker, checking the UA is definitely not enough.
That's it. A real bot detection system should be as professional as them instead of just doing UA checks and declaring this solves the problem. People will laugh at APISIX. We should provide a mechanism that the professional security company can use to build a gateway, but not declare we are a security gateway. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is it more suitable to change plugin name to ua-restriction and remove the hard-coded ua list? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Of course, yes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it |
||
|
||
local _M = { | ||
version = 0.1, | ||
priority = 2999, | ||
name = plugin_name, | ||
schema = schema, | ||
} | ||
|
||
-- List taken from https://github.com/ua-parser/uap-core/blob/master/regexes.yaml | ||
local well_known_bots = { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should not hard code the UA list, as it could not be updated in time. It would be better to provide a mechanism but not the tool to check the UA. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Most spider bot UA contains “bot” or “spider” or “crawler” or There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But it will be updated, isn't it? Better to require the user to choose their list instead of shipping a stale one. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. User can update the list in whitelist or blacklist configuration to support the ua not listed in our package There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So why ship a stale one and ask the user to update it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can support other plugin for client restrction like ua, ip, or other infomartion. This plugin is just for users not want to add bunch of ua regex rules. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This plugin is only for simplify usage. |
||
[[(Pingdom\.com_bot_version_)(\d+)\.(\d+)]], | ||
[[(facebookexternalhit)/(\d+)\.(\d+)]], | ||
[[Google.{0,50}/\+/web/snippet]], | ||
[[(NewRelicPinger)/(\d+)\.(\d+)], | ||
[[\b(Boto3?|JetS3t|aws-(?:cli|sdk-(?:cpp|go|java|nodejs|ruby2?|dotnet-(?:\d{1,2}|c]] | ||
.. [[ore)))|s3fs)/(\d+)\.(\d+)(?:\.(\d+)|)]], | ||
[[ PTST/\d+(?:\.)?\d+$]], | ||
[[/((?:Ant-)?Nutch|[A-z]+[Bb]ot|[A-z]+[Ss]pider|Axtaris|fetchurl|Isara|ShopSalad|T]] | ||
.. [[ailsweep)[ \-](\d+)(?:\.(\d+)(?:\.(\d+))?)?]], | ||
[[\b(008|Altresium|Argus|BaiduMobaider|BoardReader|DNSGroup|DataparkSearch|EDI|Goo]] | ||
.. [[dzer|Grub|INGRID|Infohelfer|LinkedInBot|LOOQ|Nutch|OgScrper|PathDefender|Peew|Po]] | ||
.. [[stPost|Steeler|Twitterbot|VSE|WebCrunch|WebZIP|Y!J-BR[A-Z]|YahooSeeker|envolk|sp]] | ||
.. [[roose|wminer)/(\d+)(?:\.(\d+)|)(?:\.(\d+)|)]], | ||
[[(MSIE) (\d+)\.(\d+)([a-z]\d|[a-z]|);.{0,200} MSIECrawler]], | ||
[[(Google-HTTP-Java-Client|Apache-HttpClient|Go-http-client|scalaj-http|http%20cli]] | ||
.. [[ent|Python-urllib|HttpMonitor|TLSProber|WinHTTP|JNLP|okhttp|aihttp|reqwest|axios]] | ||
.. [[|unirest-(?:java|python|ruby|nodejs|php|net))(?:[ /](\d+)(?:\.(\d+)|)(?:\.(\d+)|]] | ||
.. [[)|)]], | ||
[[(CSimpleSpider|Cityreview Robot|CrawlDaddy|CrawlFire|Finderbots|Index crawler|Jo]] | ||
.. [[b Roboter|KiwiStatus Spider|Lijit Crawler|QuerySeekerSpider|ScollSpider|Trends C]] | ||
.. [[rawler|USyd-NLP-Spider|SiteCat Webbot|BotName\/\$BotVersion|123metaspider-Bot|14]] | ||
.. [[70\.net crawler|50\.nu|8bo Crawler Bot|Aboundex|Accoona-[A-z]{1,30}-Agent|AdsBot]] | ||
.. [[-Google(?:-[a-z]{1,30}|)|altavista|AppEngine-Google|archive.{0,30}\.org_bot|arch]] | ||
.. [[iver|Ask Jeeves|[Bb]ai[Dd]u[Ss]pider(?:-[A-Za-z]{1,30})(?:-[A-Za-z]{1,30}|)|bing]] | ||
.. [[bot|BingPreview|blitzbot|BlogBridge|Bloglovin|BoardReader Blog Indexer|BoardRead]] | ||
.. [[er Favicon Fetcher|boitho.com-dc|BotSeer|BUbiNG|\b\w{0,30}favicon\w{0,30}\b|\bYe]] | ||
.. [[ti(?:-[a-z]{1,30}|)|Catchpoint(?: bot|)|[Cc]harlotte|Checklinks|clumboot|Comodo ]] | ||
.. [[HTTP\(S\) Crawler|Comodo-Webinspector-Crawler|ConveraCrawler|CRAWL-E|CrawlConver]] | ||
.. [[a|Daumoa(?:-feedfetcher|)|Feed Seeker Bot|Feedbin|findlinks|Flamingo_SearchEngin]] | ||
.. [[e|FollowSite Bot|furlbot|Genieo|gigabot|GomezAgent|gonzo1|(?:[a-zA-Z]{1,30}-|)Go]] | ||
.. [[oglebot(?:-[a-zA-Z]{1,30}|)|Google SketchUp|grub-client|gsa-crawler|heritrix|Hid]] | ||
.. [[denMarket|holmes|HooWWWer|htdig|ia_archiver|ICC-Crawler|Icarus6j|ichiro(?:/mobil]] | ||
.. [[e|)|IconSurf|IlTrovatore(?:-Setaccio|)|InfuzApp|Innovazion Crawler|InternetArchi]] | ||
.. [[ve|IP2[a-z]{1,30}Bot|jbot\b|KaloogaBot|Kraken|Kurzor|larbin|LEIA|LesnikBot|Lingu]] | ||
.. [[ee Bot|LinkAider|LinkedInBot|Lite Bot|Llaut|lycos|Mail\.RU_Bot|masscan|masidani_]] | ||
.. [[bot|Mediapartners-Google|Microsoft .{0,30} Bot|mogimogi|mozDex|MJ12bot|msnbot(?:]] | ||
.. [[-media {0,2}|)|msrbot|Mtps Feed Aggregation System|netresearch|Netvibes|NewsGato]] | ||
.. [[r[^/]{0,30}|^NING|Nutch[^/]{0,30}|Nymesis|ObjectsSearch|OgScrper|Orbiter|OOZBOT|]] | ||
.. [[PagePeeker|PagesInventory|PaxleFramework|Peeplo Screenshot Bot|PlantyNet_WebRobo]] | ||
.. [[t|Pompos|Qwantify|Read%20Later|Reaper|RedCarpet|Retreiver|Riddler|Rival IQ|scoot]] | ||
.. [[er|Scrapy|Scrubby|searchsight|seekbot|semanticdiscovery|SemrushBot|Simpy|SimpleP]] | ||
.. [[ie|SEOstats|SimpleRSS|SiteCon|Slackbot-LinkExpanding|Slack-ImgProxy|Slurp|snappy]] | ||
.. [[|Speedy Spider|Squrl Java|Stringer|TheUsefulbot|ThumbShotsBot|Thumbshots\.ru|Tin]] | ||
.. [[y Tiny RSS|Twitterbot|WhatsApp|URL2PNG|Vagabondo|VoilaBot|^vortex|Votay bot|^voy]] | ||
.. [[ager|WASALive.Bot|Web-sniffer|WebThumb|WeSEE:[A-z]{1,30}|WhatWeb|WIRE|WordPress|]] | ||
.. [[Wotbox|www\.almaden\.ibm\.com|Xenu(?:.s|) Link Sleuth|Xerka [A-z]{1,30}Bot|yacy(]] | ||
.. [[?:bot|)|YahooSeeker|Yahoo! Slurp|Yandex\w{1,30}|YodaoBot(?:-[A-z]{1,30}|)|Yottaa]] | ||
.. [[Monitor|Yowedo|^Zao|^Zao-Crawler|ZeBot_www\.ze\.bz|ZooShot|ZyBorg)(?:[ /]v?(\d+)]] | ||
.. [[(?:\.(\d+)(?:\.(\d+)|)|)|)]], | ||
[[(?:\/[A-Za-z0-9\.]+|) {0,5}([A-Za-z0-9 \-_\!\[\]:]{0,50}(?:[Aa]rchiver|[Ii]ndexe]] | ||
.. [[r|[Ss]craper|[Bb]ot|[Ss]pider|[Cc]rawl[a-z]{0,50}))[/ ](\d+)(?:\.(\d+)(?:\.(\d+)]] | ||
.. [[|)|)]], | ||
[[(?:\/[A-Za-z0-9\.]+|) {0,5}([A-Za-z0-9 \-_\!\[\]:]{0,50}(?:[Aa]rchiver|[Ii]ndexe]] | ||
.. [[r|[Ss]craper|[Bb]ot|[Ss]pider|[Cc]rawl[a-z]{0,50})) (\d+)(?:\.(\d+)(?:\.(\d+)|)|]] | ||
.. [[)]], | ||
[[((?:[A-z0-9]{1,50}|[A-z\-]{1,50} ?|)(?: the |)(?:[Ss][Pp][Ii][Dd][Ee][Rr]|[Ss]cr]] | ||
.. [[ape|[Cc][Rr][Aa][Ww][Ll])[A-z0-9]{0,50})(?:(?:[ /]| v)(\d+)(?:\.(\d+)|)(?:\.(\d+]] | ||
.. [[)|)|)]], | ||
} | ||
|
||
local function match_user_agent(user_agent, conf) | ||
user_agent = str_strip(user_agent) | ||
if conf.whitelist then | ||
for _, rule in ipairs(conf.whitelist) do | ||
if re_find(user_agent, rule, "jo") then | ||
return MATCH_ALLOW | ||
end | ||
end | ||
end | ||
|
||
if conf.blacklist then | ||
for _, rule in ipairs(conf.blacklist) do | ||
if re_find(user_agent, rule, "jo") then | ||
return MATCH_DENY | ||
end | ||
end | ||
end | ||
|
||
for _, rule in ipairs(well_known_bots) do | ||
if re_find(user_agent, rule, "jo") then | ||
return MATCH_BOT | ||
end | ||
end | ||
|
||
return MATCH_NONE | ||
end | ||
|
||
function _M.check_schema(conf) | ||
local ok, err = core.schema.check(schema, conf) | ||
|
||
if not ok then | ||
return false, err | ||
end | ||
|
||
return true | ||
end | ||
|
||
function _M.access(conf, ctx) | ||
local user_agent = core.request.header(ctx, "User-Agent") | ||
|
||
if not user_agent then | ||
return | ||
end | ||
-- ignore multiple instances of request headers | ||
if type(user_agent) == "table" then | ||
return | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why ignore the UA? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this corner case is that the user-agent become table when send multiple user-agent. Almost all the bot or http-client will not send request like this,i think we ignore it is a better choice. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So if they send other UA, the check can be bypassed? This is not a good idea, especially in an open source project... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, I will check the table |
||
end | ||
local match, err = lrucache_useragent(user_agent, conf, match_user_agent, user_agent, conf) | ||
if err then | ||
return | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Better to log the err? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK |
||
end | ||
|
||
if match > MATCH_ALLOW then | ||
return 403, { message = conf.message } | ||
end | ||
end | ||
|
||
return _M |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
--- | ||
title: bot-restriction | ||
--- | ||
|
||
<!-- | ||
# | ||
# Licensed to the Apache Software Foundation (ASF) under one or more | ||
# contributor license agreements. See the NOTICE file distributed with | ||
# this work for additional information regarding copyright ownership. | ||
# The ASF licenses this file to You under the Apache License, Version 2.0 | ||
# (the "License"); you may not use this file except in compliance with | ||
# the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
--> | ||
|
||
## Summary | ||
|
||
- [**Name**](#name) | ||
- [**Attributes**](#attributes) | ||
- [**How To Enable**](#how-to-enable) | ||
- [**Test Plugin**](#test-plugin) | ||
- [**Disable Plugin**](#disable-plugin) | ||
|
||
## Name | ||
|
||
The `bot-restriction` can restrict access to a Service or a Route by either | ||
`whitelisting` or `blacklisting` or `most well-known` bots. | ||
|
||
## Attributes | ||
|
||
| Name | Type | Requirement | Default | Valid | Description | | ||
| --------- | ------------- | ----------- | ------- | ----- | ---------------------------------------- | | ||
| whitelist | array[string] | optional | | | List of User-Agent of whitelist. | | ||
| blacklist | array[string] | optional | | | List of User-Agent of blacklist. | | ||
| message | string | optional | Not allowed. | [1, 1024] | Message of deny reason. | | ||
|
||
Any of `whitelist` or `blacklist` can be optional, and can work together in this order: | ||
whitelist->blacklist->default well-known User-Agent list. | ||
|
||
The message can be user-defined. | ||
|
||
## How To Enable | ||
|
||
Creates a route or service object, and enable plugin `bot-restriction`. | ||
|
||
```shell | ||
curl http://127.0.0.1:9080/apisix/admin/routes/1 -H 'X-API-KEY: edd1c9f034335f136f87ad84b625c8f1' -X PUT -d ' | ||
{ | ||
"uri": "/index.html", | ||
"upstream": { | ||
"type": "roundrobin", | ||
"nodes": { | ||
"127.0.0.1:1980": 1 | ||
} | ||
}, | ||
"plugins": { | ||
"bot-restriction": { | ||
"whitelist": [ | ||
"my-bot1", | ||
"(Baiduspider)/(\\d+)\\.(\\d+)" | ||
], | ||
"blacklist": [ | ||
"my-bot2", | ||
"(Twitterspider)/(\\d+)\\.(\\d+)" | ||
] | ||
} | ||
} | ||
}' | ||
``` | ||
|
||
Default returns `{"message":"Not allowed"}` when rejected. If you want to use a custom message, you can configure it in the plugin section. | ||
|
||
```json | ||
"plugins": { | ||
"bot-restriction": { | ||
"blacklist": [ | ||
"my-bot2", | ||
"(Twitterspider)/(\\d+)\\.(\\d+)" | ||
], | ||
"message": "Do you want to do something bad?" | ||
} | ||
} | ||
``` | ||
|
||
## Test Plugin | ||
|
||
Requests from normal User-Agent: | ||
|
||
```shell | ||
$ curl http://127.0.0.1:9080/index.html -i | ||
HTTP/1.1 200 OK | ||
... | ||
``` | ||
|
||
Requests from bot User-Agent: | ||
|
||
```shell | ||
$ curl http://127.0.0.1:9080/index.html --header 'User-Agent: Twitterspider/2.0' | ||
HTTP/1.1 403 Forbidden | ||
``` | ||
|
||
## Disable Plugin | ||
|
||
When you want to disable the `bot-restriction` plugin, it is very simple, | ||
you can delete the corresponding json configuration in the plugin configuration, | ||
no need to restart the service, it will take effect immediately: | ||
|
||
```shell | ||
$ curl http://127.0.0.1:2379/v2/keys/apisix/routes/1 -H 'X-API-KEY: edd1c9f034335f136f87ad84b625c8f1' -X PUT -d value=' | ||
{ | ||
"uri": "/index.html", | ||
"plugins": {}, | ||
"upstream": { | ||
"type": "roundrobin", | ||
"nodes": { | ||
"39.97.63.215:80": 1 | ||
} | ||
} | ||
}' | ||
``` | ||
|
||
The `bot-restriction` plugin has been disabled now. It works for other plugins. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about
allowlist
andblocklist
, we should avoid using these sensitive words.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok