Protect Your Site with a Blackhole for Bad Bots

♦ Posted by Jeff Starr in .htaccess, PHP, Security

Updated November 6, 2024 • 244 comments

[ Black Hole (Vector) ] One of my favorite security measures here at Perishable Press is the site’s virtual Blackhole trap for bad bots. The concept is simple: include a hidden link to a robots.txt-forbidden directory somewhere on your pages. Bots that ignore or disobey your robots rules will crawl the link and fall into the honeypot trap, which then performs a WHOIS Lookup and records the event in the blackhole data file. Once added to the blacklist data file, bad bots immediately are denied access to your site.

Intro
Overview
Live Demo
How to Install
Testing
Customize
Troubleshoot
Caveat Emptor
Whitelist Good Bots
License & Disclaimer
Questions & Feedback
Download

Tip: For a shorter version of this tutorial, check out the Quick Start Guide »

WordPress user? Check out the free Blackhole plugin and Blackhole Pro »

Important: The article below is for the standalone PHP version of Blackhole for Bad Bots. For information about the Blackhole WordPress plugins, visit WordPress.org (free version) and/or Plugin Planet (pro version).

Intro

[ Black Hole (Graphic) ] I call it the “one-strike” rule: bots have one chance to follow the robots.txt protocol, check the site’s robots.txt file, and obey its directives. Failure to comply results in immediate banishment. The best part is that the Blackhole only affects bad bots: normal users never see the hidden link, and good bots obey the robots rules in the first place. So the percentage of false positives is extremely low to non-existent. It’s an ideal way to protect your site against bad bots silently, efficiently, and effectively.

With a few easy steps, you can set up your own Blackhole to trap bad bots and protect your site from evil scripts, bandwidth thieves, content scrapers, spammers, and other malicious behavior.

[ Blackhole for Bad Bots ] The Blackhole is built with PHP, and uses a bit of .htaccess to protect the blackhole directory. Refined over the years and completely revamped for this tutorial, the Blackhole consists of a plug-&-play /blackhole/ directory that contains the following three files:

.htaccess – protects the log file
blackhole.dat – log file
index.php – blackhole script

These three files work together to create the Blackhole for Bad Bots. If you are running WordPress, the Blackhole plugin is recommended instead of this standalone PHP version.

Note: By default, .htaccess files are hidden on Windows and OS X, so to view them you need to enable “Show hidden files” on your machine, or use any FTP or code-editing app that is capable of displaying them. It’s a common feature.

Overview

The Blackhole is developed to make implementation as easy as possible. Here is an overview of the steps:

Upload the /blackhole/ directory to your site
Edit the four variables in the “EDIT HERE” section in index.php.
Ensure writable server permissions for the blackhole.dat file
Add a single line to the top of your pages to include the index.php file
Add a hidden link to the /blackhole/ directory in the footer
Forbid crawling of /blackhole/ by adding a line to your robots.txt

So installation is straightforward, but there are many ways to customize functionality. For complete instructions, jump ahead to the installation steps. For now, I think a good way to understand how it works is to check out a demo..

Live Demo

I have set up a working demo of the Blackhole for this tutorial. It works exactly like the download version, but it’s set up as a sandbox, so when you trigger the trap, it blocks you only from the demo itself. Here’s how it works:

First visit to the Blackhole demo loads the trap page, runs the whois lookup, and adds your IP address to the blacklist data file
Once your IP is added to the blacklist, all future requests for the Blackhole demo will be denied access

So you get one chance (per IP address) to see how it works. Once you visit the demo, your IP address will be blocked from the demo only — you will still have full access to this tutorial (and everything else at Perishable Press). So with that in mind, here is the demo link (opens new tab):

Live Demo: Blackhole Demo »

Visit once to see the Blackhole trap, and then again to observe that you’ve been blocked. Again, even if you are blocked from the demo page, you will continue to have access to everything else here at Perishable Press.

Tip: You can visit the Blackhole Demo via any free proxy service if you want to try again and do another test.

How to Install

[ Black Hole (Physical) ] Here are complete instructions for implementing the PHP/standalone of Blackhole for Bad Bots. Note that these steps are written for Apache servers running PHP. The steps are the same for other PHP-enabled servers (e.g., Nginx, IIS), but you will need to replace the .htaccess file and rules with whatever works for particular server environment. Note: for a concise summary of these steps, check out this tutorial.

Step 1: Download the Blackhole zip file, unzip and upload to your site’s root directory. This location is not required, but it enables everything to work out of the box. To use a different location, edit the include path in Step 4.

Step 2: Edit the four variables in the “EDIT HERE” section in index.php.

Step 3: Change file permissions for blackhole.dat to make it writable by the server. The permission settings may vary depending on server configuration. If you are unsure about this, ask your host. Note that the blackhole script needs to be able to read, write, and execute the blackhole.dat file.

Step 4: Include the Blackhole script by adding the following line to the top of your pages (e.g., header.php):

<?php include(realpath(getenv('DOCUMENT_ROOT')) . '/blackhole/index.php'); ?>

The Blackhole script checks the bot’s IP address against the blacklist data file. If a match is found, the request is blocked with a customizable message. View the source code for more information.

Step 5: Add a hidden link to the /blackhole/ directory in the footer of your site’s web pages (replace “Your Site Name” with the name of your site):

<a rel="nofollow" style="display:none" href="https://example.com/blackhole/" title="Do NOT follow this link or you will be banned from the site!">Your Site Name</a>

This is the hidden trigger link that bad bots will follow. It’s currently hidden with CSS, so 99.999% of visitors won’t ever see it. Alternately, to hide the link from users without relying on CSS, replace the anchor text with a transparent 1-pixel GIF image. For example:

<a rel="nofollow" style="display:none" href="http://example.com/blackhole/" title="Do NOT follow this link or you will be banned from the site!"><img src="/images/1px.gif" alt=""></a>

Remember to edit the link href value and the image src to match the correct locations on your server.

Step 6: Finally, add a Disallow directive to your site’s robots.txt file:

User-agent: *
Disallow: /blackhole/

This step is pretty important. Without the proper robots directives, all bots would fall into the Blackhole because they wouldn’t know any better. If a bot wants to crawl your site, it must obey the rules! The robots rule that we are using basically says, “All bots DO NOT visit the /blackhole/ directory or anything inside of it.” So it is important to get your robots rules correct.

Step 7: Done! Remember to test thoroughly before going live. Also check out the section on customizing for more ideas.

Testing

[ Black Hole (Figurative) ] You can verify that the script is working by visiting the hidden trigger link (added in step 5). That should take you to the Blackhole warning page for your first visit, and then block you from further access on subsequent visits. To verify that you’ve been blocked entirely, try visiting any other page on your site. To restore site access at any time, you can clear the contents of the blackhole.dat log file.

Important: Make sure that all of the rules in your robots.txt file are correct and have proper syntax. For example, you can use the free robots.txt validator in Google Webmaster Tools (requires Google account).

Tip: Make sure to check the list of whitelisted user agents. For example, the Chrome browser is whitelisted. So if you want to test that Blackhole is working, either use a non-Chrome browser or remove chrome from the whitelist.

Tip: To reset the list of blocked bots at any time, simply clear the contents of the blackhole.dat file.

Customize

The previous steps will get the Blackhole set up with default configuration, but there are some details that you may want to customize:

index.php (lines 25–28): Edit the four variables as needed
index.php (lines 140–164): Customize markup of the warning page
index.php (line 180): Customize the list of whitelisted bots

These are the recommended changes, but the PHP is clean and generates valid HTML, so feel free to modify the markup or anything else as needed.

Troubleshoot

If you get an error letting you know that a file cannot be found, it could be an issue with how the script specifies the absolute path, using getenv('DOCUMENT_ROOT'). That function works on a majority of servers, but if it fails on your server for whatever reason, you can simply replace it with the actual path. From Step 4, the include script looks like this:

<?php include(realpath(getenv('DOCUMENT_ROOT')) . '/blackhole/index.php'); ?>

So if you are getting not-found or similar errors, try this instead:

/var/www/httpdocs/blackhole/index.php

So that would be the actual absolute path to the blackhole index.php file on your server. As long as you get the path correct, it’s gonna fix any “file can’t be found” type errors you may be experiencing.

If in doubt about the actual full absolute path, consult your web host or use a PHP function or constant such as __DIR__ to obtain the correct infos. And check out my tutorial over at WP-Mix for more information about including files with PHP and WordPress.

Tip: I wrote an in-depth guide on how to verify that Blackhole is working. It is written for users of the WordPress plugin, but the general steps show how to test the PHP/standalone version as well. Long story short: use a proxy service.

Caveat Emptor

Blocking bots is serious business. Good bots obey robots.txt rules, but there may be potentially useful bots that do not. Yahoo is the perfect example: it’s a valid search engine that sends some traffic, but sadly the Yahoo Slurp bot is too stupid to follow the rules. Since setting up the Blackhole several years ago, I’ve seen Slurp disobey robots rules hundreds of times.

By default, the Blackhole DOES NOT BLOCK any of the big search engines. So Google, Bing, and company always will be allowed access to your site, even if they disobey your robots.txt rules. See the next section for more details.

Whitelist Good Bots

In order to ensure that all of the major search engines always have access to your site, Blackhole whitelists the following bots:

AOL.com
Baidu
Bing/MSN
DuckDuckGo
Google
Teoma
Yahoo!
Yandex

Additionally, popular social media services are whitelisted, as well as some other known “good” bots. To whitelist these bots, the Blackhole script uses regular expressions to ensure that all possible name variations are allowed access. For each request made to your site, Blackhole checks the User Agent and always allows anything that contains any of the following strings:

a6-indexer, adsbot-google, ahrefsbot, aolbuild, apis-google, baidu, bingbot, bingpreview, butterfly, chrome, cloudflare, duckduckgo, embedly, facebookexternalhit, facebot, googlebot, google page speed, ia_archiver, linkedinbot, mediapartners-google, msnbot, netcraftsurvey, outbrain, pinterest, quora, rogerbot, showyoubot, slackbot, slurp, sogou, teoma, tweetmemebot, twitterbot, uptimerobot, urlresolver, vkshare, w3c_validator, wordpress, wp rocket, yandex

So any bot that reports a user agent that contains any of these strings will NOT be blocked and always will have full access to your site. To customize the list of whitelisted bots, open index.php and locate the function blackhole_whitelist(), where you will find the list of allowed bots.

The upside of whitelisting these user agents ensures that anything claiming to be a major search engine is allowed open access. The downside is that user-agent strings are easily spoofed, so a bad bot could crawl along and say, “Hey look, I’m teh Googlebot!” and the whitelist would grant access. It is your decision where to draw the line.

With PHP, it is possible to verify the true identity of each bot, but doing so consumes significant resources and could overload the server. Avoiding that scenario, the Blackhole errs on the side of caution: it’s better to allow a few spoofs than to block any of the major search engines and other major web services.

Tip: Check out CLI Forward-Reverse Lookup for how to verify bot identity.

License & Disclaimer

Terms of Use: Blackhole for Bad Bots is released under GNU General Public License. By downloading the Blackhole, you agree to accept full responsibility for its use. In no way shall the author be held accountable for anything that happens after the file has been downloaded.

Questions & Feedback

Questions? Comments? Send ’em via my contact form. Thanks!

Download

Here you can download the latest version of Blackhole for Bad Bots. By downloading, you agree to the terms.

Standalone PHP version, last updated: 2024/11/06

Download Blackhole for Bad BotsVersion 4.7 ( 4.43 KB ZIP )

About the Author

Jeff Starr = Fullstack Developer. Book Author. Teacher. Human Being.

244 responses to “Protect Your Site with a Blackhole for Bad Bots”

Jay 2020/10/04 3:41 pm • Reply

Since I implemented your script over a year ago, and catched some harvesters, I noticed that all the blocked IPs are ipv4 type. Does your script recognize and block also ipv6 machines?
- Jeff Starr 2020/10/04 8:57 pm • Post Author • Reply
  
  Yes it understands both IPv4 and IPv6.
  - Jay 2020/10/05 6:20 am
    
    Ohh that’s great to know! :)
Kristian Svensson 2020/12/31 3:37 am • Reply

I’m using your blackhole wp plugin on a few of my websites since a while back and I am really impressed how well it works. It’s fun to see when someone tries to scrape my website from different IPs and I receive a couple of emails showing the attempts. I just want to say thank you :)
John 2021/02/26 3:46 pm • Reply

Ran across your site yesterday. Thank you so much for your work to make wordpress and other sites safer! And I love that I can use blackhole and 7g on my non-wordpress sites. Question: I’ve moved my wordpress login url & upload directory elsewhere, and I’m curious what you think about redirecting hacker bots to blackhole if they use those urls (and any other urls commonly used to hack). I’d love to have that as part of my wp security. Thanks in advance!
- Jeff Starr 2021/02/27 10:41 am • Post Author • Reply
  
  It’s an interesting idea, I will see what’s possible maybe for a future update. Thank you for the feedback, John.
Alex 2021/07/14 4:01 pm • Reply

Hello.

Blocked user is shown a warning page:

“You have been banned from this domain. If you think there has been a mistake, contact the administrator via proxy server.”

Link directs to contact page. But the contact page is also blocked! What can be done so that the visitor can contact the administrator with a request to unblock?
- Jeff Starr 2021/07/14 4:57 pm • Post Author • Reply
  
  What can be done so that the visitor can contact the administrator with a request to unblock?
  
  That is an excellent question. Even after developing Blackhole for years now, I still am not sure of an ideal way to enable a blocked user to contact the site admin easily.
  
  One thing that helps savvy visitors is the default “you’re banned” message:
  
  If you think there has been a mistake, contact the administrator via proxy server.
  
  That explains one possible way of contacting the site admin. The idea there is to visit the contact URL via proxy service. Not all users understand that or are able, etc.
  
  What some other users do is add a link to the contact page at an alternate site, where they are not blocked.
  
  Some other users I’ve seen just link to a social-media profile. Like on Twitter, Facebook, etc. It’s trivial to contact someone on most social media sites.
  
  As a last resort, you can use a free online tool to obscure your email address and then just include it with the banned message. Alternately instead of obfuscation you could display your email address as an image, which is harder (and less likely) for bots to read.
Alex 2021/07/14 5:52 pm • Reply

Thanks Jeff for such an extensive answer.

I was thinking about html-page or php-page without “… php include (realpath ….”. With submit form and without any links.

But the solutions proposed by you are easier and simpler.

And thanks for the ‘Blackhole for Bad Bots’
- Jeff Starr 2021/07/14 5:58 pm • Post Author • Reply
  
  I like the idea of an exclude path, will be adding to the Pro version of Blackhole for WP. That way you can just enter the URL of the page that you want to exclude from bot protection. Of course, the excluded page would be a big target for bad bots. Something to think about, thank you for the idea :)
oagroot 2022/04/04 2:32 am • Reply

Every now and then the trap traps my own local network’s reverse proxy server. When this happens, no one can enter the site and I have to edit the blackhole.dat to eliminate the record (the only record, BTW). The curious thing is that the registered address is not the public DNS name, nor the public ip, but the internal proxy server address. I would like to know whose is the bot that is triggering the trap and how I can whitelist my reverseproxy’s IP. Also is worth to mention that this does not happen every day, week or month, rather seldom. I am suspicious about there can be a hidden bot in my site that is doing such sanning and getting caught in the honeypot, in which case, whitelisting my site would be dangerous.
¿any ideas?
Allen Ford 2022/09/03 11:03 am • Reply

I have created a trap with a different technique , but will be injecting some of this as alternative options. I have added mysql and Firewalld support along with ipqualityscore API.

Please Review and let me know what you think
https://www.cyfordtechnologies.com/article.php/securing-your-opencart-website
- Jeff Starr 2022/09/03 11:07 am • Post Author • Reply
  
  Thanks for sharing, Allen. The technique looks pretty intense, has it been tested much?
Viorel-Cosmin Miron 2023/02/04 4:45 am • Reply

Hi,

For some reason, the plugin got the IP in the .dat. file, yet it only block users on the /blackhole page, not on other pages, any ideas why?
- Jeff Starr 2023/02/04 9:41 am • Post Author • Reply
  
  Because the script is not included on the other pages. Check the steps for more information.
Rach 2023/02/08 3:16 am • Reply

I have implemented the script on my site, so far so good! I am finding that my firewall doesn’t always catch fraudulent IPs, and some of those IPs don’t always fall into the trap as they only seem to scan a few pages at a time and then change IPs. If I wanted to manually add known Ips to the blacklist via an IP range can this be done?

Thanks.
Lubo 2023/04/05 10:32 am • Reply

Hi, Jeff,

If it’s an HTML-only site, can we skip step 4?

Thanks.
- Jeff Starr 2023/04/05 11:33 am • Post Author • Reply
  
  Hey Lubo,
  
  Sure, as long as there is some other way of including the script.
yor 2023/04/16 6:18 am • Reply

Demo not working, and script also!
- Jeff Starr 2023/04/16 10:55 am • Post Author • Reply
  
  Thanks for reporting. It was the whitelist (not blocking Chrome et al). I’ve disabled whitelist on the demo. For the script, you’ll need to remove/edit whitelist as explained in the above instructions.
Guest 2023/09/17 10:27 pm • Reply

I’m using WordPress + WP Rocket, but since I prefer to do things manually as much as possible, I’ve opted for the standalone version instead of the plugin.

Despite using page caching, this was actually somewhat working fine for a few years, but not anymore.

I would think since I’m manually calling the blackhole file at the top of the theme’s header.php file that the WP Init firing wouldn’t be an issue?

Anyway, since I’m comfortable manually editing files and code, any ideas on a good way to implement this to override any WP or WP Rocket issues?