One of my favorite security measures here at Perishable Press is the site’s virtual Blackhole trap for bad bots. The concept is simple: include a hidden link to a robots.txt-forbidden directory somewhere on your pages. Bots that ignore or disobey your robots rules will crawl the link and fall into the trap, which then performs a WHOIS Lookup and records the event in the blackhole data file. Once added to the blacklist data file, bad bots immediately are denied access to your site. I call it the “one-strike” rule: bots have one chance to follow the robots.txt protocol, check the site’s robots.txt file, and obey its directives. Failure to comply results in immediate banishment. The best part is that the Blackhole only affects bad bots: normal users never see the hidden link, and good bots obey the robots rules in the first place.
In five easy steps, you can set up your own Blackhole to trap bad bots and protect your site from evil scripts, bandwidth thieves, content scrapers, spammers, and other malicious behavior.
The Blackhole is built with PHP, and uses a bit of .htaccess to protect the blackhole directory. The blackhole script combines heavily modified versions of the Kloth.net script (for the bot trap) and the Network Query Tool (for the whois lookups) (404 link removed 2012/07/08). Refined over the years and completely revamped for this tutorial, the Blackhole consists of a single plug-&-play directory that contains the following four files:
.htaccess– basic directory protectionblackhole.dat– server-writable log file (serves as the blacklist)blackhole.php– checks requests against blacklist and blocks bad botsindex.php– generates blackhole page, performs whois lookup, sends email, and logs data
These four files are all contained in a single directory named “blackhole”.
Installation Overview
I set things up to make implementation as easy as possible. Here are the five basic steps:
- Upload the
/blackhole/directory to your site - Ensure writable server permissions for the
blackhole.datfile - Add a single line to the top of your pages to include the
blackhole.phpfile - Add a hidden link to the
/blackhole/directory in the footer of your pages - Prohibit crawling of the
/blackhole/by adding a line to yourrobots.txtfile
It’s that easy to install on your own site, but there are many ways to customize functionality. For complete instructions, jump ahead to Implementation and Configuration. For now, I think a good way to understand how it works is to check out a demo..
One-time Live Demo
I have set up a working demo of the Blackhole for this tutorial. It works exactly like the download version, but it’s configured to block you only from the demo, not from the entire site. Here’s how it works:
- First visit to the Blackhole demo loads the trap page, runs the whois lookup, and adds your IP address to the blacklist data file
- Once you’re added to the blacklist, all subsequent requests for the Blackhole demo will be denied access
So you get one chance to see how it works. Once you visit, your IP will be blocked from the demo only – you will still have full access to this tutorial (and everything else). That said, here is the demo link: Blackhole Demo. Visit once to see the Blackhole trap, and then again to observe that you’ve been blocked. If I were to include the blackhole.php in the header of my theme files, you would be banned from pretty much the entire site.
Implementation and Configuration
Here are complete instructions for implementing and configuring the Perishable Press Blackhole:
Step 1: Download the Blackhole zip file, unzip and upload to your site’s root directory. This location is not required, but it enables everything to work out of the box. To use a different location, edit the include path in Step 3.
Step 2: Change file permissions for blackhole.dat to make it writable by the server. The permission settings may vary depending on server configuration. If you are unsure about this, ask your host. Note that the blackhole script needs to be able to read, write, and execute the blackhole.dat file.
Step 3: Include the bot-check script by adding the following line to the top of your pages:
<?php include($_SERVER['DOCUMENT_ROOT'] . "/blackhole/blackhole.php"); ?>
The blackhole.php script checks the request IP against the blacklist data file. If a match is found, the request is blocked with a customizable message. See the source code for more information.
Step 4: Include a hidden link to the /blackhole/ directory in the footer of your pages:
<a style="display:none;" href="http://example.com/blackhole/" rel="nofollow">Do NOT follow this link or you will be banned from the site!</a>
This is the hidden link that bad bots will follow. It’s currently hidden with CSS, so 99% of visitors won’t ever see it. To hide the link from users without CSS, replace the anchor text with a transparent 1-pixel GIF image.
Step 5: Finally, add a Disallow directive to your site’s robots.txt file:
User-agent: *
Disallow: /*/blackhole/*
This step is pretty important. Without the proper robots directives, all bots would fall into the Blackhole because they wouldn’t know any better. If a bot wants to crawl your site, it must obey the rules! The robots rule that we are using basically says, “All bots DO NOT visit the /blackhole/ directory or anything inside of it.” More on this in the next section..
Further customization: The previous five steps will get the Blackhole working, but the index.php requires a few modifications. Open the index.php file and make the following changes:
- Line #54: Edit the path to your site’s
robots.txtfile - Line #56: Edit the path to your contact page (or email address)
- Lines #140/141: Edit email address with your own
- And in
blackhole.php, edit line #53 with your contact info
These are the recommended changes, but the PHP is clean and generates valid HTML5, so feel free to modify the source code as needed. Note that beyond these three items, no other edits need made.
Caveat Emptor
Blocking bots is serious business. Good bots obey robots.txt rules, but there may be potentially useful bots that do not. Yahoo is the perfect example: it’s a valid search engine that sends some traffic, but sadly the Yahoo Slurp bot is too stupid to follow the rules. Since setting up the Blackhole several years ago, I’ve seen Slurp disobey robots rules hundreds of times. Bottom line: the Blackhole will block any bot that disobeys the Update: By default, the Blackhole no longer blocks any of the popular search engines. See the next section for more information.robots.txt directives. Proceed accordingly.
Whitelisting Search Bots
Initially, the Blackhole blocked any bot that disobeyed the robots.txt directives. Unfortunately, as discussed in the comments, Googlebot, Yahoo, and other major search bots do not always obey robots rules. And while blocking Yahoo! Slurp is debatable, blocking Google, MSN/Bing, et al would just be dumb. Thus, the Blackhole now “whitelists” any user agent identifying as any of the following:
- googlebot (Google)
- msnbot (MSN/Bing)
- yandex (Yandex)
- teoma (Ask)
- slurp (Yahoo)
Whitelisting these user agents ensures that anything claiming to be a major search engine is allowed open access. The downside is that user-agent strings are easily spoofed, so a bad bot could crawl along and say, “hey look, I’m teh Googlebot!” and the whitelist would grant access. It is possible to verify the true identity of each bot, but as X3M explains in the comments, doing so consumes significant resources and could overload the server. Avoiding that scenario, the Blackhole errs on the side of caution: it’s better to allow a few spoofs than to block any of the major search engines.
License and Disclaimer
The Perishable Press Blackhole is released under GNU General Public License. Check the Creative Commons for a summary and/or see the Blackhole source code for additional information. Also note that by downloading the Blackhole, you agree to accept full responsibility for its use. In no way shall the author be held accountable for anything that happens after the file has been downloaded.
Blackhole Download
Here you can download the current version of the Blackhole:
Blackhole - version 1.2 - 8KB ZIP
178 Responses
Kate – December 15, 2010
Mark, in his info he says you will only be banned from the blackhole demo page on this site. After you visit that page twice, it should show you as banned from there, but you can still visit this page – otherwise you wouldn’t be able to get any further help here.
What type of site are you putting it on? Did you add the call for blackhole.php to your pages? If you’re using a script that uses a template, you may need to add it to the template. That’s what I had to do.
mark – December 15, 2010
Kate, the demo you are talking about is at
http://perishablepress.com/press/wp-content/online/demos/blackhole/
not
http://perishablepress.com/blackhole/
yes I added the call on the pages. I tried it on two different hostings and on static html on WP and articlems sites.
On the WP site in the end I even parroted this very one, that’s how I got to test THIS blackhole that doesn’t seem to ban me. Not sure what I am doing, lol
anyway it doesn’t matter… I have a system nowadays that doesn’t even require my own hosting to make a few pennies so I don’t care about anything.
mark – December 15, 2010
Btw about the suspension of this account, they will suspend all sites because they just don’t care about you, I have had that happen on all sorts of hosting packages over the years, what you want to do is develop a way to use FREE hostings like blogspot, wordpress etc etc using ONLY your domain and nothing else (hint hint)
Kate – December 15, 2010
Somebody put my site on a “hack this site” forum. For months I have been inundated with bots attacking my guestbook all day every day – I mean like every minute of the day. I am using the Lazarus Guestbook where “spamming is futile”. I also have a security program that prevents spamming in other areas of the site. They have never been able to hack the site or get any spam through, but I still find it stressing just reviewing my logs and looking at all their failed attempts. Also, they are using up my bandwidth.
Last night I installed the blackhole trap in two places on my site. The first one I put in the blackhole folder.
Next, I renamed my Guestbook directory to something else and added it to the robots exclusion. Then I created a new folder with the old Guestbook directory name and put the blackhole script in it.
The spamers visiting the old guestbook folder are being trapped and blocked. Today is the first day in a long time where there are zero log entries for spamming the guestbook.
Thanks! I only wish I had found it sooner.
Jennifer – December 18, 2010
I took a look at the PHP files. In the blackhole.php, unless I am missing something, wouldn’t it make more sense to check the whitelisted user agents against the current $_SERVER['HTTP_USER_AGENT'] rather than opening and scanning the .dat file?
Also, I would suggest breaking out of the while loop as soon as a match is found. No sense in reading the rest of the file.
Any recommendations on applying this solution to a static site? I maintain a 500 page static site. I really don’t want to change them all to php and adding a directive to the .htaccess file to process all html files as php does not work with this host.
The only way I can see is to add the IP addresses directly to the .htaccess file. But it’s a little scary to have a script editing .htacess on the fly.
Skye – December 18, 2010
@ Jennifer. User agents can be spoofed very easily which is why we’re banning based on the bots behaviour.
Not sure what to do for your static site. I wouldn’t have a script editing the htaccess file. You could maybe cron it to only update htaccess once a day/week.
Jennifer – December 19, 2010
The script adds the IP address to the ban list based on them visiting the forbidden page. But then when checking the ban list, the script ignores any entry that matches a whitelisted user agent. There is no additional checking that I can see. So why bother checking the ban list at all for those with a whitelisted user agent?
The only benefit this could possibly have (vs checking the user agent directly) is if this time the bot is using a whitelisted user agent, but last time it didn’t. That seems unlikely.
I know it would be resource heavy to do reverse lookup as part of blackhole.php. But what about doing a reverse lookup as part of index.php? If the user agent is a whitelisted one, then do a reverse lookup. If the reverse lookup confirms the user agent, then don’t add it to the ban list. You could still send the email out for information.
Then blackhole.php wouldn’t look at the user agent at all. If it’s on the ban list, block it.
Vinny – January 6, 2011
Jeff thx for the amazing tool.
Can we allow a bot which has previously been blocked. I looking at this from a testing perspective to test from my computer and once I am blocked, I simply remove the block to let me through.
Regards,
Vinny
Lazza – January 7, 2011
I don’t know why, but sometimes I get an email telling I’ve been banned from my own site. I need to edit the .dat file then. Maybe it’s related to some kind of prefetching, so I suddendly removed any mention to the blackhole from my browsing history. I hope this will stop to do nasty things. :)
screenmates – January 9, 2011
How about storing the previous value in session:
if (isset($_SESSION['bh']) && $_SESSION['bh'])
$badbot = 1;
else
{
$badbot = 0;
…
…
…
}
…
HTML goes here…
screenmates – January 9, 2011
John S. Britsios: You suggested that placing a robots meta tag nofollow,noindex,noarchive in the index.php. But if the bot already made it to the index.php to read the meta tag directive, the bot is already in the blackhole folder and hence banned by then?
Lazza – January 9, 2011
I think he suggested to use an auxiliary page to check against instead of the index.php. :)
BTW LOL my blackhole keeps blocking the BingBot. :D