Nginx中对爬虫请求进行延迟
大家好,我是无能。
这是我偶然检查自己托管的网站请求时发生的事情。
惊人的Bot请求
基本上,我使用fail2ban在IP层面阻止HTTP DoS攻击,并将其设置为drop被阻止的目标,所以我平时没有经常检查。
由于rhit本身无法查看UA,所以我很好奇,就fork了它,并在dev分支上实现了按UA查看的功能。
https://github.com/haturatu/rhit/tree/dev
结果显示,前50名如下:
# rhit -f ua -l 5
I've read 3 files in "/var/log/nginx"
1,848,993 hits and 11G from 2026/03/28 to 2026/04/02
2,025 user agents. 100 most frequent:
┌───┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬───────┬─────┬─────┬─────┐
│ # │user agent │ hits │bytes│days │trend│
├───┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┼─────┼─────┼─────┤
│ 1│meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler) │486,092│ 4.5G│▇ ▄▆ │ │
│ 2│Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36 Edg/135.0.0.0 │143,018│ 663M│▄ ▇▆ │ │
│ 3│Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36 Edg/136.0.0.0 │142,880│ 665M│▄ ▇▆ │ │
│ 4│Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36 │142,562│ 661M│▄ ▇▆ │ │
│ 5│Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36 │142,454│ 648M│▄ ▇▆ │ │
│ 6│Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36 │142,284│ 636M│▄ ▇▆ │ │
│ 7│Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 │142,279│ 643M│▄ ▇▆ │ │
│ 8│Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36 │141,777│ 669M│▄ ▇▆ │ │
│ 9│Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36 │141,209│ 652M│▄ ▇▆ │ │
│ 10│Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) │ 48,242│ 211M│▆ ▅▇ │ │
│ │Chrome/119.0.6045.214 Safari/537.36 │ │ │ │ │
│ 11│Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15 (Applebot/0.1; │ 47,859│ 252M│▂ ▄▇ │ ➚ │
│ │+http://www.apple.com/go/applebot) │ │ │ │ │
│ 12│Enjoy Relay 0.3.1 │ 45,217│ 236M│▆ ▇▇ │ │
│ 13│selective-relay/0.1.0 (https://hashtag-relay.dtp-mstdn.jp) │ 13,577│ 71M│▇ │➘ ➘ ➘│
│ 14│Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/) │ 6,575│ 14M│▇ ▁▂ │ ➘ │
│ 15│curl/8.7.1 │ 4,062│ 7.4M│▇ ▇ │ ➚ │
│ 16│pub-relay-prototype │ 3,389│ 18M│▆ ▇▇ │ │
│ 17│Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot) │ 3,113│ 8.2M│▇ ▇▇ │ │
│ 18│Mozilla/5.0 (Android 16; Mobile; rv:132.0) Gecko/132.0 Firefox/132.0 │ 2,786│ 15M│▆ ▃▇ │ │
│ 19│Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/) │ 2,240│ 15M│▇ ▃ │ │
│ 20│Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/146.0.0.0 Mobile Safari/537.36 │ 1,715│ 23M│▁ ▂▇ │ ➚ ➚ │
│ 21│Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; │ 1,545│ 11M│▇ ▃▅ │ │
│ │PetalBot;+https://webmaster.petalsearch.com/site/petalbot) │ │ │ │ │
│ 22│Mozilla/5.0 (compatible; SERankingBacklinksBot/1.0; +https://seranking.com/backlinks-crawler) │ 1,426│ 6.7M│▃ ▇ │➘ ➘ ➘│
│ 23│facebookexternalua │ 1,385│ 7.2M│▇ ▂▅ │ │
│ 24│Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/146.0.7680.164 Mobile │ 1,382│ 5.5M│ ▇ │➚ ➚ ➚│
│ │Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) │ │ │ │ │
│ 25│Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; │ 1,356│ 6.6M│▅ ▄▇ │ │
│ │spider-feedback@bytedance.com) │ │ │ │ │
│ 26│Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com) │ 1,061│ 2.2M│▃ ▇▄ │ │
│ 27│Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 │ 930│ 3.6M│▆ ▅▇ │ │
│ 28│Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; TikTokSpider; │ 925│ 2.1M│▆ ▅▇ │ │
│ │ttspider-feedback@tiktok.com) │ │ │ │ │
│ 29│Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3 │ 880│ 1.0M│ ▇ │➘ ➘ ➘│
│ 30│Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com) │ 645│ 9.4M│▇ │ ➘ ➘ │
│ 31│Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36 │ 637│ 14M│▇ │➘ ➘ ➘│
│ 32│Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/146.0.7680.153 Mobile │ 616│ 2.7M│▇ ▇ │➘ ➘ ➘│
│ │Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) │ │ │ │ │
│ 33│Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 │ 527│ 3.2M│▆ ▇▇ │ │
│ │Safari/604.1 │ │ │ │ │
│ 34│Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36 │ 449│ 4.8M│▄ ▇▆ │ │
│ 35│Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 │ 441│ 5.8M│▅ ▆▇ │ │
│ │Safari/537.36 │ │ │ │ │
│ 36│organisera.org - Mobilizon 5.1.0 │ 440│ 2.3M│▇ ▆▇ │ │
│ 37│python-requests/2.32.5 │ 390│ 227K│▇ ▄▇ │ │
│ 38│Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36 │ 247│ 527K│ ▇ │➘ ➘ ➘│
│ 39│Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; │ 242│ 189K│▇ ▃▂ │ ➘ ➘ │
│ │OAI-SearchBot/1.3; robots.txt; +https://openai.com/searchbot │ │ │ │ │
│ 40│Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 │ 241│ 1.9M│▆ ▇▅ │ │
│ 41│Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 │ 231│ 2.2M│▇ ▅▃ │ ➘ │
│ 42│Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 │ 215│ 2.1M│▇ ▇▅ │ │
│ 43│Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 │ 206│ 1.7M│▇ ▇▆ │ │
│ 44│Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html) │ 192│ 1.5M│▆ ▄▇ │ │
│ 45│Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.47 │ 173│ 1.4M│▆ ▇▆ │ │
│ 46│Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 │ 171│ 1.4M│▄ ▇▆ │ │
│ 47│Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 │ 168│ 1.2M│▇ ▆▄ │ ➘ │
│ 48│Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36, Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) │ 167│ 353K│▇ │➘ ➘ ➘│
│ │Chrome/120.0.0.0 Safari/537.36 │ │ │ │ │
│ 49│Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) │ 166│ 1.4M│▅ ▃▇ │ ➚ │
│ 50│Mozilla/5.0 (compatible; crawler) │ 161│ 750K│ ▅▇ │ ➚ │哎呀,Meta也来得太多了吧……
嗯,因为它没有触发DoS设置,所以它可能是在相对缓慢且正常地发送请求,但我仍然很在意。话虽如此,我也不想完全阻止Bot,因为我希望万维网能在一定程度上自由运作,所以我决定设置速率限制。
Nginx
由于我想将其应用于几乎所有内容,所以配置如下:
diff --git a/nginx.conf b/nginx.conf
index 10803c3..42f9390 100644
--- a/nginx.conf
+++ b/nginx.conf
@@ -15,6 +15,44 @@ http {
server_tokens off;
default_type application/octet-stream;
+ # Bot 或链接展开爬虫的速率限制
+ map $http_user_agent $is_bot {
+ default 0;
+ ~*bot 1;
+ ~*crawler 1;
+ ~*spider 1;
+ ~*facebookexternalhit 1;
+ ~*slackbot 1;
+ ~*discordbot 1;
+ ~*twitterbot 1;
+ ~*linkedinbot 1;
+ ~*embedly 1;
+ ~*quora 1;
+ ~*skypeuripreview 1;
+ ~*whatsapp 1;
+ ~*telegrambot 1;
+ ~*applebot 1;
+ ~*pingdom 1;
+ ~*uptimerobot 1;
+ }
+
+ # stg.api.1btc.love 用于验证目的,即使是Bot也不受速率限制
+ # 如果key为空字符串,则limit_req_zone不会计数
+ map $server_name $bot_limit_host_key {
+ stg.api.1btc.love "";
+ default $binary_remote_addr;
+ }
+
+ # 仅当是Bot时使用IP单位的key,人类用户则为空字符串不限制
+ map $is_bot $bot_limit_key {
+ 0 "";
+ 1 $bot_limit_host_key;
+ }
+
+ limit_req_zone $bot_limit_key zone=bot:10m rate=1r/s;
+ limit_req_status 429;
+ limit_req zone=bot burst=5 nodelay;
+
sendfile on;
#Enables or disables buffering of responses from the proxied server.
proxy_buffering on;通过将其应用于顶层http指令,我还能够设置排除规则。这样,对于bot,每秒只允许1个请求。
结束。