Protect Your Site with a Blackhole for Bad Bots
One of my favorite security measures here at Perishable Press is the site’s virtual Blackhole trap for bad bots. The concept is simple: include a hidden link to a robots.txt
-forbidden directory somewhere on your pages. Bots that ignore or disobey your robots rules will crawl the link and fall into the honeypot trap, which then performs a WHOIS Lookup and records the event in the blackhole data file. Once added to the blacklist data file, bad bots immediately are denied access to your site.
Contents
- Intro
- Overview
- Live Demo
- How to Install
- Testing
- Customize
- Troubleshoot
- Caveat Emptor
- Whitelist Good Bots
- License & Disclaimer
- Questions & Feedback
- Download
Intro
I call it the “one-strike” rule: bots have one chance to follow the robots.txt protocol, check the site’s robots.txt file, and obey its directives. Failure to comply results in immediate banishment. The best part is that the Blackhole only affects bad bots: normal users never see the hidden link, and good bots obey the robots rules in the first place. So the percentage of false positives is extremely low to non-existent. It’s an ideal way to protect your site against bad bots silently, efficiently, and effectively.
With a few easy steps, you can set up your own Blackhole to trap bad bots and protect your site from evil scripts, bandwidth thieves, content scrapers, spammers, and other malicious behavior.
The Blackhole is built with PHP, and uses a bit of .
htaccess
to protect the blackhole directory. Refined over the years and completely revamped for this tutorial, the Blackhole consists of a plug-&-play /blackhole/
directory that contains the following three files:
.htaccess
– protects the log fileblackhole.dat
– log fileindex.php
– blackhole script
These three files work together to create the Blackhole for Bad Bots. If you are running WordPress, the Blackhole plugin is recommended instead of this standalone PHP version.
Overview
The Blackhole is developed to make implementation as easy as possible. Here is an overview of the steps:
- Upload the
/blackhole/
directory to your site - Edit the four variables in the “EDIT HERE” section in
index.php
. - Ensure writable server permissions for the
blackhole.dat
file - Add a single line to the top of your pages to include the
index.php
file - Add a hidden link to the
/blackhole/
directory in the footer - Forbid crawling of
/blackhole/
by adding a line to your robots.txt
So installation is straightforward, but there are many ways to customize functionality. For complete instructions, jump ahead to the installation steps. For now, I think a good way to understand how it works is to check out a demo..
Live Demo
I have set up a working demo of the Blackhole for this tutorial. It works exactly like the download version, but it’s set up as a sandbox, so when you trigger the trap, it blocks you only from the demo itself. Here’s how it works:
- First visit to the Blackhole demo loads the trap page, runs the whois lookup, and adds your IP address to the blacklist data file
- Once your IP is added to the blacklist, all future requests for the Blackhole demo will be denied access
So you get one chance (per IP address) to see how it works. Once you visit the demo, your IP address will be blocked from the demo only — you will still have full access to this tutorial (and everything else at Perishable Press). So with that in mind, here is the demo link (opens new tab):
Visit once to see the Blackhole trap, and then again to observe that you’ve been blocked. Again, even if you are blocked from the demo page, you will continue to have access to everything else here at Perishable Press.
How to Install
Here are complete instructions for implementing the PHP/standalone of Blackhole for Bad Bots. Note that these steps are written for Apache servers running PHP. The steps are the same for other PHP-enabled servers (e.g., Nginx, IIS), but you will need to replace the .htaccess file and rules with whatever works for particular server environment. Note: for a concise summary of these steps, check out this tutorial.
Step 1: Download the Blackhole zip file, unzip and upload to your site’s root directory. This location is not required, but it enables everything to work out of the box. To use a different location, edit the include
path in Step 4.
Step 2: Edit the four variables in the “EDIT HERE” section in index.php
.
Step 3: Change file permissions for blackhole.dat
to make it writable by the server. The permission settings may vary depending on server configuration. If you are unsure about this, ask your host. Note that the blackhole script needs to be able to read, write, and execute the blackhole.dat
file.
Step 4: Include the Blackhole script by adding the following line to the top of your pages (e.g., header.php
):
<?php include(realpath(getenv('DOCUMENT_ROOT')) . '/blackhole/index.php'); ?>
The Blackhole script checks the bot’s IP address against the blacklist data file. If a match is found, the request is blocked with a customizable message. View the source code for more information.
Step 5: Add a hidden link to the /blackhole/
directory in the footer of your site’s web pages (replace “Your Site Name” with the name of your site):
<a rel="nofollow" style="display:none" href="https://example.com/blackhole/" title="Do NOT follow this link or you will be banned from the site!">Your Site Name</a>
This is the hidden trigger link that bad bots will follow. It’s currently hidden with CSS, so 99.999% of visitors won’t ever see it. Alternately, to hide the link from users without relying on CSS, replace the anchor text with a transparent 1-pixel GIF image. For example:
<a rel="nofollow" style="display:none" href="http://example.com/blackhole/" title="Do NOT follow this link or you will be banned from the site!"><img src="/images/1px.gif" alt=""></a>
Remember to edit the link href
value and the image src
to match the correct locations on your server.
Step 6: Finally, add a Disallow
directive to your site’s robots.txt
file:
User-agent: *
Disallow: /blackhole/
This step is pretty important. Without the proper robots directives, all bots would fall into the Blackhole because they wouldn’t know any better. If a bot wants to crawl your site, it must obey the rules! The robots rule that we are using basically says, “All bots DO NOT visit the /blackhole/
directory or anything inside of it.” So it is important to get your robots rules correct.
Step 7: Done! Remember to test thoroughly before going live. Also check out the section on customizing for more ideas.
Testing
You can verify that the script is working by visiting the hidden trigger link (added in step 5). That should take you to the Blackhole warning page for your first visit, and then block you from further access on subsequent visits. To verify that you’ve been blocked entirely, try visiting any other page on your site. To restore site access at any time, you can clear the contents of the blackhole.dat
log file.
Important: Make sure that all of the rules in your robots.txt file are correct and have proper syntax. For example, you can use the free robots.txt validator in Google Webmaster Tools (requires Google account).
chrome
from the whitelist.blackhole.dat
file.Customize
The previous steps will get the Blackhole set up with default configuration, but there are some details that you may want to customize:
index.php
(lines 25–28): Edit the four variables as neededindex.php
(lines 140–164): Customize markup of the warning pageindex.php
(line 180): Customize the list of whitelisted bots
These are the recommended changes, but the PHP is clean and generates valid HTML, so feel free to modify the markup or anything else as needed.
Troubleshoot
If you get an error letting you know that a file cannot be found, it could be an issue with how the script specifies the absolute path, using getenv('DOCUMENT_ROOT')
. That function works on a majority of servers, but if it fails on your server for whatever reason, you can simply replace it with the actual path. From Step 4, the include script looks like this:
<?php include(realpath(getenv('DOCUMENT_ROOT')) . '/blackhole/index.php'); ?>
So if you are getting not-found or similar errors, try this instead:
/var/www/httpdocs/blackhole/index.php
So that would be the actual absolute path to the blackhole index.php
file on your server. As long as you get the path correct, it’s gonna fix any “file can’t be found” type errors you may be experiencing.
If in doubt about the actual full absolute path, consult your web host or use a PHP function or constant such as __DIR__
to obtain the correct infos. And check out my tutorial over at WP-Mix for more information about including files with PHP and WordPress.
Caveat Emptor
Blocking bots is serious business. Good bots obey robots.txt
rules, but there may be potentially useful bots that do not. Yahoo is the perfect example: it’s a valid search engine that sends some traffic, but sadly the Yahoo Slurp bot is too stupid to follow the rules. Since setting up the Blackhole several years ago, I’ve seen Slurp disobey robots rules hundreds of times.
By default, the Blackhole DOES NOT BLOCK any of the big search engines. So Google, Bing, and company always will be allowed access to your site, even if they disobey your robots.txt
rules. See the next section for more details.
Whitelist Good Bots
In order to ensure that all of the major search engines always have access to your site, Blackhole whitelists the following bots:
- AOL.com
- Baidu
- Bing/MSN
- DuckDuckGo
- Teoma
- Yahoo!
- Yandex
Additionally, popular social media services are whitelisted, as well as some other known “good” bots. To whitelist these bots, the Blackhole script uses regular expressions to ensure that all possible name variations are allowed access. For each request made to your site, Blackhole checks the User Agent and always allows anything that contains any of the following strings:
a6-indexer, adsbot-google, ahrefsbot, aolbuild, apis-google, baidu, bingbot, bingpreview, butterfly, chrome, cloudflare, duckduckgo, embedly, facebookexternalhit, facebot, googlebot, google page speed, ia_archiver, linkedinbot, mediapartners-google, msnbot, netcraftsurvey, outbrain, pinterest, quora, rogerbot, showyoubot, slackbot, slurp, sogou, teoma, tweetmemebot, twitterbot, uptimerobot, urlresolver, vkshare, w3c_validator, wordpress, wp rocket, yandex
So any bot that reports a user agent that contains any of these strings will NOT be blocked and always will have full access to your site. To customize the list of whitelisted bots, open index.php
and locate the function blackhole_whitelist()
, where you will find the list of allowed bots.
The upside of whitelisting these user agents ensures that anything claiming to be a major search engine is allowed open access. The downside is that user-agent strings are easily spoofed, so a bad bot could crawl along and say, “Hey look, I’m teh Googlebot!” and the whitelist would grant access. It is your decision where to draw the line.
With PHP, it is possible to verify the true identity of each bot, but doing so consumes significant resources and could overload the server. Avoiding that scenario, the Blackhole errs on the side of caution: it’s better to allow a few spoofs than to block any of the major search engines and other major web services.
License & Disclaimer
Terms of Use: Blackhole for Bad Bots is released under GNU General Public License. By downloading the Blackhole, you agree to accept full responsibility for its use. In no way shall the author be held accountable for anything that happens after the file has been downloaded.
Questions & Feedback
Questions? Comments? Send ’em via my contact form. Thanks!
Download
Here you can download the latest version of Blackhole for Bad Bots. By downloading, you agree to the terms.
240 responses to “Protect Your Site with a Blackhole for Bad Bots”
Please can you send me the htaccess for the blackhole tool as I can’t see it in the download. I would like to block the user if on the blocked list from all pages of may site until they contact me !!
Thanks in advance.
.htaccess files are hidden by default on Windows and Mac, so to view them you need to enable “Show hidden files” on your machine. More info: https://wp-mix.com/create-htaccess-files-osx-windows/
Hello Jeff,
A few days ago I sent you an e-mail about the bad bot blackhole. It works like a charm although the whois lookup didn’t work for some reason. I figured out why it didn’t work in my case. Quite simple actually, I’m on a vps server and it turned out that port 43 was blocked by my server’s firewall. After opening port 43 everything was okay and now I receive the whois lookup as intended.
Thanks for this wonderful black hole script.
your download does not work. it would be nice if you still have Blackhole for Bad Bots.
Works for me on Mac and Windows using Chrome or IE. If your browser is not working, try a different one.
hey, thanks for your message.
on the mac (chrome + safari) I am not offered a file to download.
on windows (ie + chrome) I can download a file, but it is 0kb?
Could you send me this as an email?
What do you mean, you are “not offered a file to download”, what does that mean? Do you mean the download link is not displayed? Or is there some error message? I want to resolve the download zip issue, but need more information. Thank you.
Download link doesn’t work in Firefox or Chrome. Firefox says “source file could not be read” and Chrome says “Failed – Network error”.
Fixed. Please try again, thank you.
Hi Jeff,
thank You for Your efforts and kind permission to use this gem of php-simplicity: it’s running perfectly!
Cheers!
This is a great little app – how many bad bots have you caught with it? I wonder if I can apply the htaccess rules to my virtual hosts configuration as I am trying to avoid using the .htaccess on my site? Something like this:
Actually I just realised that I could nest the scope so a better one might be this:
Thanks for sharing the code on pastebin, I went ahead replaced the code in your second comment. Not sure about your first comment though, whether or not the code is correct. Either way, it should work fine in virtual hosts/config. Definitely want to test it though just to be sure.
Hi Jeff, thanks for all your work on this neat functionality. I’m trying hard to install it on my beta site but looks like I need to provide my gmail password somewhere in order to use the smtp.gmail.com server. I don’t like the idea of posting a password in clear text within the index.php file, and my php.ini file does not seem to have a parameter for an smtp server password. This is obviously pretty basic stuff and I feel dumb for having to ask but I’m stuck. The blackhole “you have fallen into a trap” screen does display, but with a generic fail-to-connect warning regarding my smtp.gmail.com server. I’m assuming I need to provide a password somewhere…
Hi Jeff, no you do *not* need any usernames or passwords to set up Blackhole (standalone/PHP version). You do need a server that runs PHP. Then you can just follow the installation steps provided in the above article. Should be very straightforward.
Thanks for replying. I agree it should be very straightforward. As I said, it does look like the install was successful inasmuch as everything seems to be working except the email functionality. At the top of the “You have fellen into a trap” page I get the following warning: Warning: mail(): Failed to connect to mailserver at “smtp.gmail.com” port 465, verify your “SMTP” and “smtp_port” setting in php.ini or use ini_set() in C:\Apache24\htdocs\blackhole\index.php on line 135.
It sounds like something is misconfigured with the PHP mail() function. Something apparently due to SMTP settings. Best advice would be to ask your web host, they will best be able to handle any server-related issues.
I’m running the lastest versions of PHP (7.4.7) and Apache web server (2.4) and I am trying to use google’s smtp server, which is a pretty common set up. So yeah, even though this is obviously the first you are hearing about it I do not think it will be the last time. I’ll post the fix if I find one. Thanks again for all your help.
Thanks and good luck. Let me know I can provide any infos specifically for Blackhole Bad Bots. Glad to help however possible.
I’m running a locally installed Apache webserver ver 2.4 (the newest). I feel like I will not be the only person having this issue and you may get more people reporting it to you. Anyway, I just now changed the port from 465 (SSL) to 587 (TLS) and I got a different error message. Could it be progress? At least I got a response from the server this time. “Warning: mail(): SMTP server response: 530 5.7.0 Must issue a STARTTLS command first. q32sm10626845qte.31 – gsmtp in C:\Apache24\htdocs\blackhole\index.php on line 135”
“I feel like I will not be the only person having this issue and you may get more people reporting it to you.”
Yeah maybe after 10 years of the script being available, maybe suddenly there will be more than just this one report.
I’m not sure about the error message, you may want to dig around online to see if there is any related information, clues, etc. Hopefully there’s something out there.
Hi Jeff,
thanks a lot for all your informations.
You are mention 2 methods to get ready – 1) a plugin 2) php-script and robots.txt.
In another post you talking about to use as less plugins as possible to get a safer website. is this plugin safer than your php-script?
and does each methods has any impact for the website performance?
It would great to hear from you.
Greetings from Germany.
If you are using WordPress, the plugin is recommended because it provides more options and features. Both the PHP standalone script and plugin are equally safe/secure. Not sure about the performance difference, really depends on other factors and how you’ve got the site set up. I’ve never heard any complaints about performance issues, both plugin and PHP script are very lightweight and fast.
Hi everybody,
I just installed the free blackhole plugin but it looks like, if its not working.
I looked up for the secret link for the bad bots crawler, opened it and boom…nothing else happend. Could still surf on my website. I got no message like…u are blocked etc.
I took a look at the comments here, but could not figure out the error. What could be the issue? The robots.txt exist as well…I also delete it and just used the automatic generated virtual file…but still not working. Do I have to setup something else after installing the free plugin? It,s up to date and active.
Does your site use any cache plugin? If so, check out this post may be useful for you.
Hi Jeff,
I am not using any cache plugin.
I Just add the htaccess File from this guy.
https://seoagentur-hamburg.com/blog/4183/
He use some caching Code in His htaccess File.
Maybe you can Take a closer Look at the Code, to See, If Something could be the reason for the issue.
Would be nice.
It works Jeff. I fixed the issue :-) Your plugin do a great job.
In general seems to be a good idea, but…
Although I have not tested the script yet nor I have read this page in detail I see some incongruence in the instructions like // DO NOT EDIT BELOW THIS LINE despite of the fact that then there are other lines to edit:
index.php (line 107): Check/replace path to your contact form
index.php (lines 130–152): Customize markup of the warning page
index.php (line 167): Customize the list of whitelisted bots
But the indicated lines are wrong. Maybe because added lots of blank lines?
Anyway, I shall return in a couple of weeks just to see how this evolves.
Thanks for reporting. It should not affect the performance of the script, but I will take a look and try to make the code/comments more clear with the next update.