While solving the recent search engine spoofing mystery, I came across two excellent examples of spoofed search engine bots. This article uses the examples to explain how to identify any questionable bots hitting your site.
Spoofed Search Engine Bot #1
The first example I have for you today was reported like this in my site’s access log (note that the requested domain has been changed to example.com):
TIME: February 20th 2016, 12:29am REQUEST: http://example.com/info.php.suspected_ SITE: http://example.com/ REFERRER: http://www.googlebot.com/bot.html QUERY STRING: undefined REMOTE ADDRESS: 18.104.22.168 PROXY ADDRESS: 22.214.171.124 HOST: 126.96.36.199.broad.pt.fj.dynamic.163data.com.cn REMOTE IDENTITY: undefined USER AGENT: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Breakdown: Here we have a bot that reports itself as Googlebot:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
According to my list of user agents for top search engines, this is one of the actual user agents used by Google. So if you had been paying attention only to the user agent, the request may seem legit. But then again, look at the request:
That may be something that Google is looking for, but doubtful; it looks more like something that a bad bot would want to find. So digging a little deeper into the logged data, we see that the referrer also looks legit. At first glance anyway.
If you understand what a referrer actually is, then you question why it’s reported this way. It’s as if the bot is trying to convince you that the request was the first one made when Googlebot clocked in this morning. Yep, boot up and head over to
http://example.com/info.php.suspected_ first thing. Riiight.
So what is the dead giveaway that this search-engine bot is spoofed? After all, from the user agent and referrer, it looks like the real deal. And the IP address doesn’t really say anything without doing an actual lookup, and who is going to bother with that. No, the real key to identifying this request as bogus is the reported host name:
Yeah, that doesn’t even look like anything Google would be using. It’s a Chinese TLD, after all. To verify the illegitimacy of this bot, we can do a forward-reverse DNS lookup to get the following results:
Details of 188.8.131.52 IP Address : 184.108.40.206 Location : China (95% accuracy) Host Name : 220.127.116.11.broad.pt.fj.dynamic.163data.com.cn
Bingo. It’s a fake bot. Nice try, but I think I will to ban you using BBQ Pro. Next..
Spoofed Search Engine Bot #2
Next example of a spoofed search-engine bot, we see the following report in our server’s access logs:
TIME: February 10th 2016, 10:52pm REQUEST: http://example.com/wp-content/plugins/Login-wall-etgFB/login_wall.php?login=cmd&z3=aW5mb3MucGhw&z4=L3dwLWNvbnRlbnQvcGx1Z2lucy8%253d SITE: http://example.com/ REFERRER: example.com QUERY STRING: login=cmd&z3=aW5mb3MucGhw&z4=L3dwLWNvbnRlbnQvcGx1Z2lucy8%253d REMOTE ADDRESS: 18.104.22.168 PROXY ADDRESS: 22.214.171.124 HOST: 195-154-194-111.rev.poneytelecom.eu REMOTE IDENTITY: undefined USER AGENT: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
This pretty much is the same sort of deal as before, only this time the spoofed request is coming from a well-known (and obnoxious) proxy/spam server:
So in this case, the user agent reports as legit Googlebot, but there two giant red flags:
- The requested URI is typical of an exploit scan
- And of course the host name isn’t something associated with Google
Looking up the IP address, we confirm the fakeness:
Details of 126.96.36.199 IP Address : 188.8.131.52 Location : France (95% accuracy) Host Name : 195-154-194-111.rev.poneytelecom.eu
So moral of the story: just because some bot claims to be Googlebot or some other legit bot, it doesn’t mean that it’s true. If in doubt, examine your logs and then forward/reverse lookup to reveal true identity.