Dynamically block IPs impersonating Googlebot via reverse lookup using fail2ban

7 min

language: ja bn en es hi pt ru zh-cn zh-tw

Hello, I'm Munou.
Last time
Investigating request IPs claiming to be Google bots, including abuse - SOULMINIGRIG

Here, I confirmed that there are quite a few IP groups making requests while impersonating Google.
But how to block them? In other words, it would be fine if fail2ban could detect them and run a specified script, and I found a hint in the following issue.
apache-fakegooglebot: whitelist · Issue #1318 · fail2ban/fail2ban · GitHub

To put it simply, pick up Google bots broadly, and then perform the ignoreip determination with a specified script.
This should work.

jail.local

Add the following.


[fake-googlebot]
enabled = true
filter = fake-googlebot
port = http,https
logpath = /var/log/nginx/access.log
findtime = 1w
maxretry = 1
bantime = 99999w
ignorecommand = /usr/local/bin/check_googlebot.sh <ip>
action = pf[name=fake-googlebot]

In this case, you need to create fake-googlebot as a filter condition.
Also, the shell script set in ignorecommand is required.

filter.d/fake-googlebot.conf

Pick up broadly as follows.

[Definition]
failregex = ^<HOST> - .*"(GET|POST|HEAD|PUT|DELETE|OPTIONS|PATCH) .*" \d+ \d+ ".*" ".*Googlebot.*"$
ignoreregex =

/usr/local/bin/check_googlebot.sh

In the case of ignorecommand, it is handled as a target for banning by receiving a failure status code upon execution.
In other words, by passing the IP address as an argument and receiving the status code at execution time, you can determine whether to ban or skip.

#!/bin/sh
IP="$1"
LOG="/var/log/check_googlebot.log"
# Reverse lookup
HOST=$(getent hosts "$IP" | awk '{print $2}' | head -n1)
if [ -z "$HOST" ]; then
  echo "[$(date)] DENY $IP: no PTR" >> "$LOG"
  exit 1
fi
# Check if it's a Google-related domain
case "$HOST" in
  *.googlebot.com|*.google.com)
    ;;
  *)
    echo "[$(date)] DENY $IP: invalid domain ($HOST)" >> "$LOG"
    exit 1
    ;;
esac
# Forward lookup and check if it matches the original IP
MATCH=1
getent hosts "$HOST" | awk '{print $1}' | while read -r RESOLVED; do
  if [ "$RESOLVED" = "$IP" ]; then
    MATCH=0
    break
  fi
done
if getent hosts "$HOST" | awk '{print $1}' | grep -Fxq "$IP"; then
  echo "[$(date)] ALLOW $IP: valid Googlebot ($HOST)" >> "$LOG"
  exit 0
else
  echo "[$(date)] DENY $IP: mismatch ($HOST)" >> "$LOG"
  exit 1
fi

The reason for not using the host command is that it is not a universal command. In Debian-based systems, it seems to be included in bind-utils, but I retrieve the PTR record from getent hosts, which is available if glibc is installed.
[SOLVED] Host command / Newbie Corner / Arch Linux Forums

Verification

Try running it and confirm that it returns 0 for a Google bot IP.

# sh /usr/local/bin/check_googlebot.sh 66.249.74.78
# echo $?
0

What about a different IP? Let's try entering my own server's IP.

# sh /usr/local/bin/check_googlebot.sh 163.44.113.145
# echo $?
1

It seems to be judging correctly.

fail2ban

Restart to apply this filter on the fail2ban side.

service fail2ban restart
fail2ban-client status 

Check the following to ensure that recent IPs haven't been mistakenly banned as Google IPs.

# fail2ban-client status fake-googlebot
Status for the jail: fake-googlebot
|- Filter
|  |- Currently failed: 0
|  |- Total failed:     0
|  `- File list:        /var/log/nginx/access.log
`- Actions
   |- Currently banned: 0
   |- Total banned:     0
   `- Banned IP list:

Some logs from when I ran it with my own server's IP just before are still there, but it seems that Google bots were correctly identified in the logs.

# tail /var/log/check_googlebot.log 
[Sun Apr 19 01:58:18 JST 2026] ALLOW 66.249.74.65: valid Googlebot (crawl-66-249-74-65.googlebot.com)
[Sun Apr 19 01:58:18 JST 2026] ALLOW 66.249.74.78: valid Googlebot (crawl-66-249-74-78.googlebot.com)
[Sun Apr 19 01:58:18 JST 2026] ALLOW 66.249.74.64: valid Googlebot (crawl-66-249-74-64.googlebot.com)
[Sun Apr 19 01:58:18 JST 2026] ALLOW 66.249.74.64: valid Googlebot (crawl-66-249-74-64.googlebot.com)
[Sun Apr 19 01:58:19 JST 2026] ALLOW 66.249.74.64: valid Googlebot (crawl-66-249-74-64.googlebot.com)
[Sun Apr 19 01:58:19 JST 2026] ALLOW 66.249.74.78: valid Googlebot (crawl-66-249-74-78.googlebot.com) [Sun Apr 19 01:58:19 JST 2026] ALLOW 66.249.74.64: valid Googlebot (crawl-66-249-74-64.googlebot.com) [Sun Apr 19 01:58:19 JST 2026] ALLOW 66.249.74.78: valid Googlebot (crawl-66-249-74-78.googlebot.com) [Sun Apr 19 03:56:30 JST 2026] ALLOW 66.249.74.78: valid Googlebot (crawl-66-249-74-78.googlebot.com) [Sun Apr 19 03:58:18 JST 2026] DENY 163.44.113.145: invalid domain (v163-44-113-145.v1i0.static.cnode.jp)

Related Posts