Latest TweetsVerify any search engine or visitor via CLI Forward-Reverse Lookup perishablepress.com/cli-forwar…
Perishable Press

Protect Your Site with a Blackhole for Bad Bots

[ Black Hole (Vector) ] One of my favorite security measures here at Perishable Press is the site’s virtual Blackhole trap for bad bots. The concept is simple: include a hidden link to a robots.txt-forbidden directory somewhere on your pages. Bots that ignore or disobey your robots rules will crawl the link and fall into the honeypot trap, which then performs a WHOIS Lookup and records the event in the blackhole data file. Once added to the blacklist data file, bad bots immediately are denied access to your site.

WordPress user? Check out the free Blackhole plugin and Blackhole Pro »

[ Black Hole (Graphic) ] I call it the “one-strike” rule: bots have one chance to follow the robots.txt protocol, check the site’s robots.txt file, and obey its directives. Failure to comply results in immediate banishment. The best part is that the Blackhole only affects bad bots: normal users never see the hidden link, and good bots obey the robots rules in the first place. So the percentage of false positives is extremely low to non-existent. It’s an ideal way to protect your site against bad bots silently, efficiently, and effectively.

With a few easy steps, you can set up your own Blackhole to trap bad bots and protect your site from evil scripts, bandwidth thieves, content scrapers, spammers, and other malicious behavior.

[ Blackhole for Bad Bots ] The Blackhole is built with PHP, and uses a bit of .htaccess to protect the blackhole directory. Refined over the years and completely revamped for this tutorial, the Blackhole consists of a plug-&-play /blackhole/ directory that contains the following three files:

  • .htaccess – protects the log file
  • blackhole.dat – log file
  • index.php – blackhole script

These three files work together to create the Blackhole for Bad Bots. If you are running WordPress, the Blackhole plugin is recommended instead of this standalone PHP version.

Note: By default, .htaccess files are hidden on Windows and OS X, so to view them you need to enable “Show hidden files” on your machine, or use any FTP or code-editing app that is capable of displaying them. It’s a common feature.

Installation Overview

The Blackhole is developed to make implementation as easy as possible. Here is an overview of the steps:

  1. Upload the /blackhole/ directory to your site
  2. Edit the three variables in the “EDIT HERE” section in index.php.
  3. Ensure writable server permissions for the blackhole.dat file
  4. Add a single line to the top of your pages to include the index.php file
  5. Add a hidden link to the /blackhole/ directory in the footer
  6. Forbid crawling of /blackhole/ by adding a line to your robots.txt

It’s that easy to install on your own site, but there are many ways to customize functionality. For complete instructions, jump ahead to Implementation and Configuration. For now, I think a good way to understand how it works is to check out a demo..

Update: This is the original Blackhole tutorial for the standalone PHP script. For a summary of this info, check out Blackhole for Bad Bots – PHP Version »

One-time Live Demo

I have set up a working demo of the Blackhole for this tutorial. It works exactly like the download version, but it’s configured to block you only from the demo, not from the entire site. Here’s how it works:

  1. First visit to the Blackhole demo loads the trap page, runs the whois lookup, and adds your IP address to the blacklist data file
  2. Once you’re added to the blacklist, all subsequent requests for the Blackhole demo will be denied access

So you get one chance to see how it works. Once you visit, your IP will be blocked from the demo only – you will still have full access to this tutorial (and everything else). That said, here is the demo link: Blackhole Demo. Visit once to see the Blackhole trap, and then again to observe that you’ve been blocked. Again, even if you are blocked from the demo page, you will continue to have access to everything else on this domain.

Implementation and Configuration

[ Black Hole (Physical) ] Here are complete instructions for implementing the PHP/standalone of Blackhole for Bad Bots. Note that these steps are written for Apache servers running PHP. The steps are the same for other PHP-enabled servers (e.g., Nginx, IIS), but you will need to replace the .htaccess file and rules with whatever works for particular server environment. Note: for a concise summary of these steps, check out this tutorial.

Step 1: Download the Blackhole zip file, unzip and upload to your site’s root directory. This location is not required, but it enables everything to work out of the box. To use a different location, edit the include path in Step 4.

Step 2: Edit the three variables in the “EDIT HERE” section in index.php.

Step 3: Change file permissions for blackhole.dat to make it writable by the server. The permission settings may vary depending on server configuration. If you are unsure about this, ask your host. Note that the blackhole script needs to be able to read, write, and execute the blackhole.dat file.

Step 4: Include the Blackhole script by adding the following line to the top of your pages (e.g., header.php):

<?php include(realpath(getenv('DOCUMENT_ROOT')) . '/blackhole/index.php'); ?>

The Blackhole script checks the bot’s IP address against the blacklist data file. If a match is found, the request is blocked with a customizable message. View the source code for more information.

Step 5: Add a hidden link to the /blackhole/ directory in the footer of your site’s web pages:

<a rel="nofollow" style="display:none;" href="https://example.com/blackhole/">Do NOT follow this link or you will be banned from the site!</a>

This is the hidden link that bad bots will follow. It’s currently hidden with CSS, so 99.999% of visitors won’t ever see it. Alternately, to hide the link from users without relying on CSS, replace the anchor text with a transparent 1-pixel GIF image. For example:

<a rel="nofollow" style="display:none;" href="http://example.com/blackhole/" title="Do NOT follow this link or you will be banned from the site!"><img src="/images/1px.gif" alt="" /></a>

Remember to edit the link href value and the image src to match the correct locations on your server.

Step 6: Finally, add a Disallow directive to your site’s robots.txt file:

User-agent: *
Disallow: /blackhole/

This step is pretty important. Without the proper robots directives, all bots would fall into the Blackhole because they wouldn’t know any better. If a bot wants to crawl your site, it must obey the rules! The robots rule that we are using basically says, “All bots DO NOT visit the /blackhole/ directory or anything inside of it.” So it is important to get your robots rules correct. Please use a robots validator to verify proper syntax.

Step 7: Done! Remember to test thoroughly before going live. Also check out Further Customization for more ideas.

Testing

[ Black Hole (Figurative) ] You can verify that the script is working by visiting the hidden Blackhole link (added in step 5). That should take you to the Blackhole warning page, and block you from further access. To verify that you’ve been blocked, try visiting another page on your site. To restore site access, you can clear the contents of the blackhole.dat log file.

Important: Make sure that your robots rules are correct and have proper syntax. For example, you can use the robots checker in Google Webmaster Tools.

Further Customization

The previous steps will get the Blackhole working, but there are some details that you may want to customize:

  • index.php (lines 54–56): Edit the three variables
  • index.php (line 172): Check/replace path to your contact form
  • index.php (lines 159–182): Customize markup of the warning page
  • index.php (line 196): Customize the list of whitelisted bots

These are the recommended changes, but the PHP is clean and generates valid HTML, so feel free to modify the markup as needed.

File Path

If you get an error letting you know that a file cannot be found, it could be an issue with how the script specifies the absolute path, using getenv('DOCUMENT_ROOT'). That function works on a majority of servers, but if it fails on your server for whatever reason, you can simply replace it with the actual path. From Step 4, the include script looks like this:

<?php include(realpath(getenv('DOCUMENT_ROOT')) . '/blackhole/index.php'); ?>

So if you are getting not-found errors, try this instead:

/var/www/httpdocs/blackhole/index.php

So that would be the actual absolute path to the blackhole index.php file on your server. As long as you get the path correct, it’s gonna fix any “file can’t be found” type errors you may be experiencing.

If in doubt about the actual full absolute path, consult your web host or use a PHP function or constant such as __DIR__ to obtain the correct infos. And check out my tutorial over at WP-Mix for more information about including files with PHP and WordPress.

Caveat Emptor

Blocking bots is serious business. Good bots obey robots.txt rules, but there may be potentially useful bots that do not. Yahoo is the perfect example: it’s a valid search engine that sends some traffic, but sadly the Yahoo Slurp bot is too stupid to follow the rules. Since setting up the Blackhole several years ago, I’ve seen Slurp disobey robots rules hundreds of times.

By default, the Blackhole DOES NOT BLOCK any of the big search engines. So Google, Bing, and company always will be allowed access to your site, even if they disobey your robots.txt rules. See the next section for more information.

Whitelisting Search Bots

Blackhole whitelists all bots related to any of the following search engines:

  • AOL.com
  • Baidu
  • Bing/MSN
  • DuckDuckGo
  • Google
  • Teoma
  • Yahoo!
  • Yandex

More specifically, here is the list of regex strings that are checked for each request:

aolbuild, baidu, bingbot, bingpreview, msnbot, duckduckgo, adsbot-google, googlebot, mediapartners-google, teoma, slurp, yandex

So any bot that reports a user agent that contains any of these strings will NOT be blocked and will have full access to your site under all conditions. To customize the list of whitelisted bots, open index.php and edit line 196.

Whitelisting these user agents ensures that anything claiming to be a major search engine is allowed open access. The downside is that user-agent strings are easily spoofed, so a bad bot could crawl along and say, “Hey look, I’m teh Googlebot!” and the whitelist would grant access.

It is possible to verify the true identity of each bot, but doing so consumes significant resources and could overload the server. Avoiding that scenario, the Blackhole errs on the side of caution: it’s better to allow a few spoofs than to block any of the major search engines.

License and Disclaimer

Blackhole for Bad Bots is released under GNU General Public License. By downloading the Blackhole, you agree to accept full responsibility for its use. In no way shall the author be held accountable for anything that happens after the file has been downloaded.

Questions & Feedback

Questions? Comments? Send ’em via my contact form. Thanks!

Blackhole Download

Here you can download the latest version of Blackhole for Bad Bots.

Standalone PHP version, last updated: 2018/05/11
Blackhole for Bad Bots – Version 4.1 (5 KB zip)
Note: if you have trouble “unzipping” the file, try downloading the file again using the Iridium browser.

Jeff Starr
About the Author Jeff Starr = Web Developer. Book Author. Secretly Important.
Archives
184 responses
  1. Philippe @ HTML5 Débloque-notes November 3, 2010 @ 9:20 am

    Hi Jeff
    Here it is: http://twitpic.com/33ljzu/full

    In the screenshot you gonna see the URI: /blackhole/process.php
    since it was Skye’s solution I’ve prefered. ;-)

    I remember some french SEO experts made a few months ago some tests: Google bot seemed to not obey to nofollow links inside a website, nor the instructions in the robots.text file.

    (Above in my comment, I wanted to say (instead of .htaccess): in my robots.txt Disallow: /*/blackhole/*… of course ;-))

  2. “One or more wildcard (*) characters can now be in a URL path, but may not be recognized by older robot crawlers” http://www.searchtools.com/robots/robots-txt.html

  3. Still no problems with mine and it’s not indexed with google.

    @ Philippe, well you did post the blackhole link up top so it could have been crawled that way. Change the dir name, remove the hidden link and update your robots.txt. Give it a week for the bots to update their copy of your robots.txt then put the hidden link back in. After another week see who you’ve caught. If it looks good put the filter code in your header and let it go live.

  4. Philippe @ HTML5 Débloque-notes November 3, 2010 @ 10:54 pm

    @ Skye and Jeff

    the explanation might be here ;-)

    or, using Firefox with Google Toolbar to test the blackhole I’ve sent the speed measurement to the GWT???

  5. Can anyone help me with my previous comment? :)
    Thank you very much in advance…

  6. @ Lazza, what’s your include look like in forum/? Did you change any of the code in blackhole.php where it opens the file? As long as the dat file is in the same directory it shouldn’t have a problem opening it.

  7. @ Skye, my include was copy-pasted from this article. I didn’t change anything except my email and twitter info… It says an error when opening the file but I don’t know how to tell to the php script to be more verbose.

  8. I couldn’t get this to ban me until I changed the call for blackhole.dat in both files to be like this:

    $filename = "/home/xxxxxxx/public_html/blackhole/blackhole.dat"; // scan to prevent duplicates

    Also used the advice above from Erik Rubright (#63) to prevent the arin WOIS lookup errors. Everything works great now. Thanks a lot.

  9. Kate, thank you very much for the $filename tip! I would suggest using:

    $filename = getenv("DOCUMENT_ROOT")."/blackhole/blackhole.dat";

    I think this doesn’t work for Windows hosting, but who is so dumb to use Windows hosting with PHP? :D
    I’ll try again to get the thing to work now. :)

  10. Thanks, Lazza, for the more secure code.

  11. Hi there https://perishablepress.com/blackhole/
    I have just followed the above link into your trap yet I am still posting this. Am I missing something?
    I am only asking this as whatever I do I can’t get this script to work

  12. @ Jeff, with the fix from Kate (and my edit) it finally does work. :) Maybe you should add a line to your post. :)

[ Comments are closed for this post ]