How to Verify the Four Major Search Engines

.htaccess made easy

Keeping track of your access and error logs is a critical component of any serious security strategy. Many times, you will see a recorded entry that looks legitimate, such that it may easily be dismissed as genuine Google fare, only to discover upon closer investigation a fraudulent agent. There are many such cloaked or disguised agents crawling around these days, mimicking various search engines to hide beneath the radar.

So it’s always a good idea to implement a procedure for scanning and checking select agents for authenticity. In general, the verification process involves a “forward/reverse” DNS lookup, which is then cross-verified with the search engine in question. So if you want to verify the four major search engines — Google, Bing, Yahoo!, Ask, or anything else for that matter — this quick tutorial will show you how it’s done.

Intro

The best way to verify the validity of any questionable IP address involves performing a “reverse/forward” DNS lookup. It is an excellent technique for investigating any suspicious activity happening at your domain. For many, employing reverse-forward DNS lookups is common practice — an important part of any serious security strategy. For those unfamiliar with the technique or otherwise interested in refreshing those critical skills, here are the steps..

Step 1: Reverse DNS Lookup

To do a reverse DNS lookup, you need the IP address of whatever it is that you want to investigate. For example, if I find some rogue bot terrorizing my site’s access logs, I copy it’s IP address to my clipboard for further analysis. So one way or another, you will need an actual IP address in order to do a reverse lookup of the associated DNS information.

Once equipped with an IP address, locate a decent Reverse DNS lookup tool. There, enter the suspect’s IP address (e.g., 66.249.66.1) in the “IP Address” box, and then click “Find Host Name”. That will return the registered hostname along with some other information, should look something like this:

IP Address : 66.249.66.1
Location   : United States (95% accuracy)
Host Name  : crawl-66-249-66-1.googlebot.com

That’s the money right there. Now we want to verify that the host name isn’t spoofed, such that the IP address resolves to the hostname and vice versa.

Step 2: Forward DNS Lookup

To verify that the host name is not spoofed, revisit the DNS lookup page. There, enter the hostname (e.g., crawl-66-249-66-1.googlebot.com) in the “Host Name” box, and then click “Find IP Address”. That will return the registered IP address along with some other information, should look something like this:

Host Name  : crawl-66-249-66-1.googlebot.com
IP Address : 66.249.66.1
Location   : United States (94% accuracy)

And that’s all there is to it. The actual verification part happens when you compare the results to make sure that everything matches up.

The IP address and hostname returned by the reverse lookup should be identical to the IP address and hostname returned by the forward lookup. If they’re not identical, the suspect bot/agent is indeed bogus and should be dealt with swiftly and without mercy. In the next section, I provide two easy ways to block subsequent access for any sneaky little bastards that you may happen to find. And of course, if there is any doubt, try the forward-reverse lookup again using one or two different DNS lookup tools.

Optional: Blocking the IP Address

Detecting malicious activity and blocking site access via .htaccess is one of my favorite pastimes. So if you find some nasty bot that you want to block from accessing your site, you can make it happen via .htaccess (recommended) or PHP (solid technique but not as fast).

Block via .htaccess

Once you have determined the IP address(es) that you would like to block, edit the following code to match, and then copy to the root .htaccess file of your site. Add as many or as few addresses as needed to stop bad bots and spam worms from digging through your business.

<Limit GET POST PUT>
	Order allow,deny
	Allow from all
	Deny from 111.111.111
	Deny from 222.222.222
	Deny from 123.123.123
</Limit>

Block via PHP

If .htaccess is not an option, you can employ a quick bit of PHP to block the IP addresses of any incestuous cave-dwelling australopithecines that may be asking for it. Here is the code required to make it happen:

<?php $deny = array("111.111.111", "222.222.222", "333.333.333");
if (in_array ($_SERVER['REMOTE_ADDR'], $deny)) {
   header("location: http://www.google.com/");
   exit();
} ?>

Simply edit the array of IP addresses to suit your needs and place at the top of any PHP file for which you would like selectively to deny access. For WordPress users, a great choice for this would be your theme’s header.php file, or even better write a quick function and hook into something like init or better.

For more information on blocking IP addresses with PHP, check out our aptly named article, How to Block IP Addresses with PHP.