Protect Your Site with a Blackhole for Bad Bots
Posted on July 14, 2010 in Websites by Jeff Starr
One of my favorite security measures here at Perishable Press is the site’s virtual Blackhole trap for bad bots. The concept is simple: include a hidden link to a robots.txt-forbidden directory somewhere on your pages. Bots that ignore or disobey your robots rules will crawl the link and fall into the trap, which then performs a WHOIS Lookup and records the event in the blackhole data file. Once added to the blacklist data file, bad bots immediately are denied access to your site. I call it the “one-strike” rule: bots have one chance to follow the robots.txt protocol, check the site’s robots.txt file, and obey its directives. Failure to comply results in immediate banishment. The best part is that the Blackhole only affects bad bots: normal users never see the hidden link, and good bots obey the robots rules in the first place.
In five easy steps, you can set up your own Blackhole to trap bad bots and protect your site from evil scripts, bandwidth thieves, content scrapers, spammers, and other malicious behavior.
The Blackhole is built with PHP, and uses a bit of .htaccess to protect the blackhole directory. The blackhole script combines heavily modified versions of the Kloth.net script (for the bot trap) and the Network Query Tool (for the whois lookups). Refined over the years and completely revamped for this tutorial, the Blackhole consists of a single plug-&-play directory that contains the following four files:
.htaccess– basic directory protectionblackhole.dat– server-writable log file (serves as the blacklist)blackhole.php– checks requests against blacklist and blocks bad botsindex.php– generates blackhole page, performs whois lookup, sends email, and logs data
These four files are all contained in a single directory named “blackhole”.
Installation Overview
I set things up to make implementation as easy as possible. Here are the five basic steps:
- Upload the
/blackhole/directory to your site - Ensure writable server permissions for the
blackhole.datfile - Add a single line to the top of your pages to include the
blackhole.phpfile - Add a hidden link to the
/blackhole/directory in the footer of your pages - Prohibit crawling of the
/blackhole/by adding a line to yourrobots.txtfile
It’s that easy to install on your own site, but there are many ways to customize functionality. For complete instructions, jump ahead to Implementation and Configuration. For now, I think a good way to understand how it works is to check out a demo..
One-time Live Demo
I have set up a working demo of the Blackhole for this tutorial. It works exactly like the download version, but it’s configured to block you only from the demo, not from the entire site. Here’s how it works:
- First visit to the Blackhole demo loads the trap page, runs the whois lookup, and adds your IP address to the blacklist data file
- Once you’re added to the blacklist, all subsequent requests for the Blackhole demo will be denied access
So you get one chance to see how it works. Once you visit, your IP will be blocked from the demo only – you will still have full access to this tutorial (and everything else). That said, here is the demo link: Blackhole Demo. Visit once to see the Blackhole trap, and then again to observe that you’ve been blocked. If I were to include the blackhole.php in the header of my theme files, you would be banned from pretty much the entire site.
Implementation and Configuration
Here are complete instructions for implementing and configuring the Perishable Press Blackhole:
Step 1: Download the Blackhole zip file, unzip and upload to your site’s root directory. This location is not required, but it enables everything to work out of the box. To use a different location, edit the include path in Step 3.
Step 2: Change file permissions for blackhole.dat to make it writable by the server. The permission settings may vary depending on server configuration. If you are unsure about this, ask your host. Note that the blackhole script needs to be able to read, write, and execute the blackhole.dat file.
Step 3: Include the bot-check script by adding the following line to the top of your pages:
<?php include($_SERVER['DOCUMENT_ROOT'] . "/blackhole/blackhole.php"); ?>
The blackhole.php script checks the request IP against the blacklist data file. If a match is found, the request is blocked with a customizable message. See the source code for more information.
Step 4: Include a hidden link to the /blackhole/ directory in the footer of your pages:
<a style="display:none;" href="http://example.com/blackhole/" rel="nofollow">Do NOT follow this link or you will be banned from the site!</a>
This is the hidden link that bad bots will follow. It’s currently hidden with CSS, so 99% of visitors won’t ever see it. To hide the link from users without CSS, replace the anchor text with a transparent 1-pixel GIF image.
Step 5: Finally, add a Disallow directive to your site’s robots.txt file:
User-agent: *
Disallow: /*/blackhole/*
This step is pretty important. Without the proper robots directives, all bots would fall into the Blackhole because they wouldn’t know any better. If a bot wants to crawl your site, it must obey the rules! The robots rule that we are using basically says, “All bots DO NOT visit the /blackhole/ directory or anything inside of it.” More on this in the next section..
Further customization: The previous five steps will get the Blackhole working, but the index.php requires a few modifications. Open the index.php file and make the following changes:
- Line #54: Edit the path to your site’s
robots.txtfile - Line #56: Edit the path to your contact page (or email address)
- Lines #140/141: Edit email address with your own
- And in
blackhole.php, edit line #53 with your contact info
These are the recommended changes, but the PHP is clean and generates valid HTML5, so feel free to modify the source code as needed. Note that beyond these three items, no other edits need made.
Caveat Emptor
Blocking bots is serious business. Good bots obey robots.txt rules, but there may be potentially useful bots that do not. Yahoo is the perfect example: it’s a valid search engine that sends some traffic, but sadly the Yahoo Slurp bot is too stupid to follow the rules. Since setting up the Blackhole several years ago, I’ve seen Slurp disobey robots rules hundreds of times. Bottom line: the Blackhole will block any bot that disobeys the Update: By default, the Blackhole no longer blocks any of the popular search engines. See the next section for more information.robots.txt directives. Proceed accordingly.
Whitelisting Search Bots
Initially, the Blackhole blocked any bot that disobeyed the robots.txt directives. Unfortunately, as discussed in the comments, Googlebot, Yahoo, and other major search bots do not always obey robots rules. And while blocking Yahoo! Slurp is debatable, blocking Google, MSN/Bing, et al would just be dumb. Thus, the Blackhole now “whitelists” any user agent identifying as any of the following:
- googlebot (Google)
- msnbot (MSN/Bing)
- yandex (Yandex)
- teoma (Ask)
- slurp (Yahoo)
Whitelisting these user agents ensures that anything claiming to be a major search engine is allowed open access. The downside is that user-agent strings are easily spoofed, so a bad bot could crawl along and say, “hey look, I’m teh Googlebot!” and the whitelist would grant access. It is possible to verify the true identity of each bot, but as X3M explains in the comments, doing so consumes significant resources and could overload the server. Avoiding that scenario, the Blackhole errs on the side of caution: it’s better to allow a few spoofs than to block any of the major search engines.
License and Disclaimer
The Perishable Press Blackhole is released under GNU General Public License. Check the Creative Commons for a summary and/or see the Blackhole source code for additional information. Also note that by downloading the Blackhole, you agree to accept full responsibility for its use. In no way shall the author be held accountable for anything that happens after the file has been downloaded.
Blackhole Download
Here you can download the current version of the Blackhole:
Perishable Press Blackhole for Bad Bots
[ version 1.2 | .zip format | 5K | 2567 downloads ]
Previous Versions
- Blackhole v1.0 [ version 1.0 | .zip format | 5K | 2309 downloads ]
- Blackhole v1.1 [ version 1.1 | .zip format | 5K | 829 downloads ]
Related articles
- Yahoo! Slurp in My Blackhole (Yet Again)
- Yahoo! in my Blackhole
- How to Protect Your Site Against Content Thieves
- Disobedient Robots and Company
- Stop 404 Requests for Mobile Versions of Your Site
- Protect Your Site Against UserCash and Other Scumbags
- Yahoo! Once Again Caught Disobeying Robots.txt Rules
According to reverse DNS lookup, Google bots seem to ignore robots.txt sometimes. By using this solution you are at risk to ban the Google bot from your site.
I’m running a Wordpress site and when I tried to install right out of the box I got a PHP failed to open file error/warning. I had to modify the path in the blackhole.php file at line 37 to the absolute path of the .dat file. Not sure if I did something wrong, of if this is due to the WP installation, or what, but it worked after I did that. Also, once an IP is banned - it gives the notice and says to contact @perishable to work things out. So I had to modify line 56 of the blackhole.php file.
Well, I just had a quick look at the code…
Consider the following situation:
$_SERVER['REMOTE_ADDR']is121.1.10.15, the IP in blackhole.dat is1.1.1.1.What will
ereg('1.1.1.1', '121.1.10.15')return? Yes, it will return1. Just because the dot matches one single character, “1.1” will match121but this is not what you want. Likewise, ifREMOTE_ADDR is 1.1.1.10, it will still be blocked by1.1.1.1.Not to mention that
ereg()is deprecated. If you compare IP addresses, why not just use string comparison? E.g.,if ($u[0] == $_SERVER['REMOTE_ADDR']) ++$badbot;Hope this makes sense.
WOW that awesome! Another great tutorial/system that i need to install on my site.
THANK YOU!
@X3M: That’s definitely a better way to do it. The code has been updated with the string comparison code. Thank you!
@Gabe: Some servers may require absolute paths. Note that there is also an instance of the
blackhole.datpath/name in theindex.phpfile. If one is changed to absolute path, both should. And thanks for the reminder about the @perishable contact thing.. I’ll be adding that step to the article promptly.@B. Moore: My pleasure :)
@Jeff - Cool, thanks!
My own bad-bot-blocker script catches some GoogleBots. It’s very annoying that they ignore the robots.txt and
rel="nofollow"commands. But be careful. Some bad bots out there spoof their user-agent, so you can’t simply automatically allow any user-agent of GoogleBot access to your site.I also have the referer sent to me in my alerts.
@Michael – you can.
$ptr = gethostbyaddr($_SERVER['REMOTE_ADDR']);if ('.googlebot.com' == substr($ptr, -14)) {// This is GoogleBot}100% accuracy is not guaranteed though – use at your own risk :-)
@X3M: I am thinking that checking for googlebot would be another improvement for the script. Maybe as a condition for
if ($u[0] == $_SERVER['REMOTE_ADDR']) ++$badbot;to execute?@Jeff - yes, it could be.
You can also detect Yahoo! Slurp, MSNbot (or how they call it now) etc.
PTR record for Yahoo! ends with .crawl.yahoo.net, for MSN/Bing - with .search.msn.bot, Yandex - with yandex.ru etc.
However, if you have a high traffic website you must be aware that calls to gethostbyaddr() can be rather expensive - you can easily overload your ISP’s DNS server. You will probably need to have a local caching DNS server installed. If DNS server gets overloaded/goes down, page loading will freeze until connection timeout occurs in gethostbyaddr().
So if you want to include this feature, I would suggest a configuration option to turn it on or off.
Thanks Jeff. Great idea. I definitely want to test this out.
I’ve used a trap like this (the one from kloth.net I think) for years now on my nsfw wallpaper site and it has worked wonders - Not only does it stop strange bots but also site rippers and all kinds off mass downloaders! :D
Never had any trouble with google or yahoo falling into the trap either!
I will definitely try this version of the trap soon. Thanks Jeff!
Update: Thanks to help from X3M, the Blackhole now whitelists the major search engines: Googlebot, Slurp, msnbot, Teoma, Yandex. Please see this section in the article for more information. If you are using a version less than 1.2, it is recommended to update.
I’m not an expert like you and I can’t contribute anything, just I can say thank you. I will test it.
I’m using a rather similar system to block access to some parts of my site, esp. to the download section for my project history. That specifically is this way because I do not want to get my complete contact data indexed by some spam bot, Google or anything else.
A short suggestion to improve your system: Rename the DAT-Logfile to .dat.php to avoid getting it read from the outside, because there are lots of scenarios where you simply CANNOT put this someplace under the web root and/or not being able to properly set the access rights.
cu, w0lf.
Hi Jeff, that’s a interesting idea, but I do foresee 2 problems:
- Some browsers/addons prefetch links on the page (f.e. the Firefox: fasterfox addon)
- Competitors could make their visitors visit your blackbox; for example by including an image pointed to ‘http://site/blackbox/’ in their HTML, thereby banning them from your site.
I think the best solution would be to split the trap into two pages. The /blackbox/ page doesn’t ban the user but links to another page that does. The URL of that page could depend on the IP of the user, for example “http://site/blackbox/?key=”+MD5(ip+”secret”). That way, there would be no way of hotlinking your ban page and prefetching is allowed 1-page in advance. :)
Another interesting solution might be to automatically unban anyone with Javascript enabled, since most bots don’t use Javascript.
Thanks for the script!
Very useful ! I was thinking about something to block bad robots and stumble on it ! Wunderbar !
Thanks
Thanks Jeff for this awesome trap. I used to block bad bots manually (via .htaccess) and when I hooked up Blackhole on a test server it worked like nuts. I got to implement it on several websites under my belt, soon.
In version 1.2 why are there already 55 lines of IP addresses and other details in the blackhole.dat file?
I recommend renaming the blackhole.dat file to be .htblackhole.dat. Most apache severs will not allow anyone to download any file that starts with .ht. After renaming the file, you need to edit line 37 in blackhole.php and line 127 in index.php.
You should also add the exclusion command to your robots.txt file. Then activate the actual blackhole several days alter. Search engines do not check the robots.txt file every time they visit your site. Most cache the instructions for anywhere from 1 to 7 days.
Final recommendation is to change the name of the directory to something innocuous. Bad bots may try to avoid directory names like “blackhole.”
And you laos need to tweak the footer of index.php to report the current version number, it still says 1.1.
And you also need to tweak the footer of index.php to report the current version number, it still says 1.1.
I don’t know much about this, but I’m curious: does publishing a technique like this make it more vulnerable to being beaten/worked around by spammers? I was wondering recently about honeypots on contact forms, which I guess can now be beaten by spammers. Was too much published about honeypots, or was it too simple a technique to keep spammers at bay for long?
I do not think that it is safe to hide text as you recommend using this code: Do NOT follow this link or you will be banned from the site!
The risk is that “display:none” for text can trigger spam filters especially with Google. Even if you did not have any evil intentions.
I would rather use javascript in combination with a “noscript” tag, or instead of using an anchor text, I would use a linked transparent image.
Another thing I would like to mention is, that using the “nofollow” attribute for internal links is not advisable. With such practices you can dilute PageRank.
So for this case, I would recommend implementing in the index.php file the robots meta tag directives “noindex,nofollow,noarchive”.
In that case the major search engines will still access the page, crawl it but will not index it.
But! The PageRank will still flow to the pages which are linked from that page. If they are external site links, there you can block passing PageRank with the rel=”nofollow” attribute. But you must make sure that you have at least one link that the PR must pass to, i.e to the homepage of your site, otherwise you will have again a PageRank dilution, because you have created a so known as dangling or dead end page.
I hope you will update above and if not, I will take care and modify all that before I implement.
By the way great job Jeff!
HI Jeff, thank you very much for this great little plugin!
I’ve just installed it on my site and found out that my site was craveling by “Baiduspider ” - It appears to be a search engine from China and it’s disregarding the nofollow rule and the robots.txt.
Have anyone of you any experience with this spider?
Do you thhing this might by harmless bot?
I just do not like the fact it’s disregarding the rulles.
Here are some info I’ve got:
IP Address: 123.125.66.22
User Agent: Baiduspider+(+http://www.baidu.com/search/spider.htm)
I came across this topic, years ago right here:
http://board.protecus.de/t21590.htm
Several hints:
# a lot of bots show into the useragent the googlebot string - thats why people thing that your script blocks the googlebot. Usally it doesnt need to be whitelisted.
# to only check for the useragent is bad in this case. Google explains how to verify your their bot correctly over there: http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html
# the best way to install this script is to add it to the robots.txt + wait a day to add the link somewhere in a hidden
div.Laserpointa from Protecus Forum
This is awesome! I’m experimenting around with it, turning it into an automated WP plugin, and I’ve managed to ban myself about 5 times from my site! lol
Doesn’t seem to be working for me. After installation I’m getting no error, but in attempting to get myself banned by repeatedly visiting the banned page, nothing happens, and the dat file never gets written to. I’ve changed permissions to allow full access to everything (just to test), and still nothing gets written.
I receive the “Bad Bot” email, the “bad bot” page displays fine, and there are no errors reporting in my error log.
Any ideas?
Figured it out.. nm.
Love it.
Now, can you make a black hole for all the spam posters to my blog. The ones that post…really love the way you write and your content is so good, never though along these lines before but you enlightened me….on my art blog with only pictures of my art or on my about page.
I would love to develop a list and block them from ever even trying to post.
Thanks again…
is there any “dynamic” for this instead of manually listing and typing into the file?
Is it ok that the .dat file has got some IPs inside? And what about the concerns of prefetching and search engines antispam features?
Hi,
found your Site via stumble upon. very nice and informative.
Keep up the good work. It has really touched me.
Greetings
Philippe
Maybe take a look at http://www.spider-trap.de/en_index.html
It’s pretty amusing to read all the discussion about how to avoid banning GoogleBot. If they’re crawling a page that you’ve explicitly requested they ignore, then you should treat them like every other crawler. I don’t see the point of breaking your own rule in this case.
Eric, that’s the 2,000-pound gorilla in the room. Why bother whitelisting any search-engine? If they break the rules, ban them. Right? Unfortunately, Google owns the Web, so they pretty much decide what it is exactly that they will and won’t do. Sucks, but true.
If you read the url Tom has posted, you will notice that Google simply wants to be called explicitly and not via the wildcard. That way it obeys the robots.txt, according to the author of that site. :)
So, seems to work, but seeing a couple errors in my apache logs.
[Thu Jul 22 13:23:01 2010] [error] [client 119.63.198.97] PHP Notice: Undefined variable: buffer in /var/www/htdocs/blackhole/index.php on line 78
[Thu Jul 22 13:23:01 2010] [error] [client 119.63.198.97] PHP Notice: Undefined variable: extra in /var/www/htdocs/blackhole/index.php on line 98
Also seeing occasionally:
[Thu Jul 22 13:56:21 2010] [error] [client 64.40.121.187] PHP Notice: Undefined variable: nextServer in /var/www/htdocs/blackhole/index.php on line 91
The first one looks like it’s just because you’re trying to append to a variable that isn’t defined in the first place…
The second… is just because extra isn’t being defined all the time.
The last one… just needs to be something more like if(isset($nextServer)) { because if returns an error on unset variables. if($variable) is not kosher for a while now.
Getting a weird return from the arin WOIS lookup.
I’ve been through it several times now, and tried several fixes but Arin continues to find an ‘n’ character (presumably from a ‘\n’) on the front of the IP. Can’t pinpoint where its getting that, or how to fix it.
Frank, I’m also trying to resolve that issue, but so far without success. I’m not sure if there is anything that can be done from within the script.
If anyone has further info on this it would be appreciated! :)
ARIN Changed its protocol for directory lookups which is why you’re seeing that weird message. They recommend changing to their new RESTful protocol.
You can read more about it here:
https://www.arin.net/resources/whoisrws/index.html
In my WP plugin version of this script (in process) I’m looking into switching to the RESTful query protocol they recommend and styling the returned XML.
@RS: The two PHP Notices you’re getting are an easy fix. Make a new line after
global $msg, $target;and add the following 2 lines:$buffer = '';$nextServer = false;@ Jeff and darrinb: Thanks for the backup. Based on my tests I was pretty certain the problem existed outside the Blackhole code.
Okay this might sound stupid to most of you, but why should I ban bad-bots? Obviously they access content I don’t want them to access, but banning them from the site would have pretty bad consequences (google).
I know nothing about this topic, but wouldn’t it be a good idea to blacklist the bots who went into the trap, not from the domain, instead update the .htaccess file to block them only from the pages I don’t want those bots to crawl?
Well I really don’t know a thing about .htaccess or blocking bots, so this might be a rather silly idea.
One reason: bandwidth.
@ノートパソコン比較:
Great questions.
In addition to bandwidth, blocking bad bots helps conserve server resources, which are a commodity on non-shared environments. Also, many bad bots are malicious, so blocking them also improves the security of your site, which benefits everyone.
Custom blocking via htaccess is also a good idea. There are many (many) articles here at Perishable Press on the topic of using htaccess to protect your site (including blocking bad bots).
You are writing that this script should be on top of the pages
So if I put this script on my header.php that will be correct?
Or do I need to put it on “single.php, page.php, archive.php etc etc”?
Just a little confused here…
Thanks
Soren
@Soren: Yeah, the include snippet just needs to be placed at the top of your
header.phpfile. Then, because that file is included with each page view, the Blackhole script covers your entire site.I have succes putting a small form in a HTML remark with the action attribute to the honeypot. Fields are named with attactive words like email, post, blog and message. Since Google are actually running HTML, they will not see the HTML remark and not follow it.
Another is to use an if statement to check if a request accepts GZIP in accept-encoding. Since bots do almost never have access to the compression library, they will not accept GZIP. Then put in the form if GZIP is not accepted, and use af post method. Google and other search engines do never follow post, since if they did, they would themselves be spambots.
I use a mix of these two methods, works great, and takes almost all spambots as well as harvesters. My blacklist is constantly around 30-50 malicous bots. When blocking, use 404 or (eventually) 410 status errors, do not indicate you have accepted the request.
The blacklist, the honeypot creates, mine is dynamic, since I do not want an IP banned for life and the blacklist to be too long. Generelly, unban an IP if it has not made a request for some time and keep the list on a max of around 70 IPs, My experiense is this is enough even for large attacks.
Some additional tips, that might or might not be usefull - these are more advanced, and focused on blog spambots more than the blackhole, but quite effective wellknown techniques:
You might want to use HTTP header information to make a “fingerprint” (fx. an MD5 checksum) of the request. The reason is, a lot of spambots are only posting to, not getting the page. They will get their information about your page instead from harvester bots, which are scraping your site. And a lot of times, the header are not the same between bots.
I use useragent, accept-encoding, accept-language, accept-charset, connection and protocol for value. To make it page-dependent as well, I put in the URL.
Use of a stardate (look it up on Google) is effective to simulate a session without using cookies, so that you have to post within (say for example) 4 hours of getting a page. Combine it with the fingerprint, put it in an HTML input hidden field in the form, and recalculate at post time to verify that the fingerprint match, and that stardatePOST-stardateGET<timelimit.
Thank you! Here is how I implemented it into my MODx site: http://modxcms.com/forums/index.php/topic,40576.msg307614.html#msg307614
I’ve been getting a lot of spam at my wordpress blog at http://stilen.net
So I installed the SI CAPTCHA Anti-Spam plugin. Fine, no more spam comments, but still, the spam bots ruins my web-statistics!
Are there no plugins available for a more easy install of the black hole?
@Boyd: I’m actually working on a plugin version for WP, and hope to be done soon. I just started a new project, so I’ve been slammed, but I’m hoping to wrap up the plugin within the next week or so.
darrinb: fantastic. I’m looking forward to it.
Have you ever considered blocking IPs temporarily instead of a permanent block via .htaccess?
I ask this based on an assumption that malicious robots are more likely spoofing the IPs which are scanning through a website. If this is the case, an immediate but temporary ban is most appropriate so that the likelihood of blocking a once malicious but eventually legitimate IP doesn’t stop an honest visitor from seeing your site.
I do see some spoofed IPs roll through every now and then, but in my experience most IPs are not spoofed. I agree that there’s no reason to block most IPs permanently – my own policy is around a year or so. Also, I haven’t done the math, but even with a blacklist of 100,000 IPs, you’ve got a better chance of winning the lottery than actually blocking a legitimate user.
Hi Jeff,
I have just implemented on my website, but i am still able to access my website after i clicked the on the link (
www.gudipudi.com/blackhole/) for testing purpose.Could you please help me ?
Thank you
gudipudi
I’ve always like using Spam Poison, it effectively blackholes bots for you, without having to set it up on your own webserver: http://www.spampoison.com/
Hi Jeff,
I’ve tried to implement this into my site but it doesn’t seem to write to the dat file when I click through to the trap page. Thus i’m still able to access the site even though a bad bot has been detcted and the email sent.
The write permissions have been set to 766 for the blackhole.dat.
Where have I gone wrong?
So I read through all the comments and finally got around to implementing this. I made a few modifications:
-the hidden footer link for the trap is left: -999 instead of display:none (not sure if this will make a difference google indexing wise)
-the link is rel nofollow which goes to a page which is noindex nofollow noarchive in the meta robots tag
-this page has a link to the trap
-there is a hash on the link of the ip and a secret word which is verified when the bot goes through to the trap. this is to thwart someones suggestions about a competitor linking to your bot trap with a hidden image and essentially banning their visitors from going to your site
The trap went live yesterday and already caught three, one of them being msnbot which looks legit according to the whois info. Surprised that after a robots file disallowing indexing of this directory, a link saying not to follow, a page saying not to index or follow, and major search engine bots still fail to play by the rules.
Skye, that sounds great… Can you post some code of your implementation?
Posted here: http://www.damnsemicolon.com/demos/robot-trap/robotsfunhouse.zip
Please let me know if you have any feedback.
Hi,
Thank for your script ! Just change image h
ttp://perishablepress.com/press/wp-content/images/2010/blackhole/blackhole.gifThe fileindex.phpisn’t appear into/blackhole/index.php.............../blackhole/...............blackhole.php...............blackhole.dat.............. .htaccessThank again
I’ve just installed it on my site,this is awesome! I’m experimenting around with it.
RE: #38/#39:
In index.php, changing line 76 from:
fputs($sock, "$target\n");to:
fputs($sock, "n $target\n");This change removed the “Query terms are ambiguous. The query is assumed to be:
“n xxx.xxx.xxx.xxx”
Use “?” to get help.”
@Erik Rubright: Thank you! I’m looking forward to trying your solution. I’ll be updating the code next opportunity.
I think version 1.2 wouldn’t go properly in the site root as stated in the article… Because there are the index.php and the dat file outside the folder. I’ll try a previous version then. :)
@ Skye, I don’t know why but it can’t open the file…
Maybe I got it… calling the include *from* a subdirectory causes a problem with the fopen of the dat file because its path is not properly referred, so PHP assumes the wrong working directory. I haven’t got time to try again now, but probably will try tomorrow.
Ok so I’ve been running this for about a month now and almost all the bots caught are agent “
Java/1.6.0_21” with different version numbers. Most are from Europe as well.Anyone know what these are?
Is there a Java scraping bot that everyone uses?
Java bots hyped up on coffee trying to take over the internets??
@ Jeff and Skye, please can you help me to fix the bug? I’m not very familiar with calling scripts via server root & so on… Thank you very much.
@ Lazza, what does your file structure look like?
Are you talking about when you include the blackhole.php in your header?
If the executing script is inside a subdirectory use
../to go up a directory.ie.
include_once('../blackhole.php');orinclude_once('../blackhole/blackhole.php');This all depends on your file structure. Please provide more details if you still need help.
Yo!
Excellent Jeff.
I’m gonna try it as soon as I understand where to place the index file… ;-)
but I have a question:
msnbot became bingbot since the 1st of october (am I wrong?]
But I don’t see this name into the preg match line code.
So should we add it?
@ Skye, yes I know, I call the script with document root so it includes corretly. The problem is that the blackhole does a “fopen” but can’t open the dat file. This file has correct path and permission, so I assume it depends on the relative path the script calls to open the file, but I’m not sure about that. :)
JAVA in the first part of the user agent is harvester bots.
You can check for both user agent, and the accept-encoding, which in all cases do NOT have GZIP (since harvesterbots apparently have to be cheap, they cannot afford to program them properly, which means they do not have the packing library).
libwww/perl are injectors (cross-site-scripting, SQL-injections and the like), follows the same scheme, they do not accept GZIP either (most of them - but be aware, you do not have to accept GZIP to send an URL only).
The most spambots use IE6.0 in user agent, they do not accept GZIP either.
The most clever spambots I have seen use a non-IE user agent, accepts GZIP, and they are not stored (like the others mentioned above) on infected computers, but on dedicated servers, whch means they have much more power. Still very few of them, but I expect there will be a rise in the future. Those bots comes maiinly from Ukraine and Israel.
@ Rune. Good to know. Thanks.
@Philippe: Great question. I need to look into this and update my scripts. For now, if you do add it, keep a close eye on things to make sure everything is working as expected. I think it’s definitely a good idea to allow access to bingbot/msnbot/live/whatever-it’s-called-these-days.
And its user agent has changed too:
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)I will install the BH pretty soon and will tell if everything is okey.
Need a little help for installation since there are a few things I don’t dig.
1/ Step 1: Download the Blackhole zip file, unzip and upload to your site’s root directory
————
There’s the BH index file in it.
As (I suppose) we all already have an index.php file at the root, a sub-directory is needed.
So let’s go for it. I’ve called it “howl” :-D and I’ve the whole BH in it.
An it gives:
root/howl/index.php2/ Step 4: Include a hidden link to the
/blackhole/directory in the footer of your pages:<a style="display:none;" href="http://example.com/blackhole/" ...————
There’s an .htaccess denying access to this directory…
So shouldn’t the link point (in my case) to:
http://mysite.com/howl/index.php???
I should be an idiot since I don’t get it at all.
Finally installed but,
I have access to the warning page “You have fallen into a trap!”
but I still can access to the website… no “You have been banned from this domain” warning!
A question: how does it work to block the IP?
@Philippe: The files are already in a directory of their own. Take another look at the article. There is a screenshot of the files contained within their directory. That is exactly how the files are zipped. Unzip and that is exactly what you should get.
@ Jeff, sorry if I take part in the conversation: no, they’re not in a folder! :) That’s why I downloaded a previous version…
@Lazza: Thanks for letting me know about this - all files should definitely be in the
blackholedirectory. I have no idea how two of them ended up outside the directory during the zip process, but it’s fixed now and should be unzipping properly.@Philippe: Apologies for my previous comment - I now understand where the confusion was coming from and it is my fault entirely. Everything should now download and unzip as advertised.
@ Philippe, you need to include the blackhole.php in your header for it to start blocking entries in the dat file.
The zip file unzipped fine for me, creating the blackhole folder.
There’s a post on my blog on the set up of my implementation.
@ Jeff, you’re welcome. Anyway it still doesn’t work. Here is my detailed situation: Wordpress site on root directory, BBpress on /forum. I have the /blackhole/ in the server root, the folder has 755 permissions and the .dat file too. If I visit the trap I receive an email and my entry is added to the .dat file. If I include the blackhole.php in the header of my BBpress theme I get “error opening file…”. That means the include is correct but it can’t find the .dat. :-(
Oh oh… I’ve sent a previous comment but it wasn’t sent.
Hello dudes
@ Jeff
No problem. We all do mistakes and me first of all.
Now I’ve changed the location. Blackhole directory is now at the root of the site.
I’ve followed Skye instructions and modified the script regarding the code in his package. Thanks Mr Skye!
But it still doen’t work.
And I don’t get why.
See the 1st link in this page :
http://normandie-web.hiseo.fr/blackhole/
The IP is well recorded in the dat file.
I’m receiving the email.
But the I still can log to my website.
I’m wondering if this issue is caused by the host (1and1 is the provider).
???
Uh! I realise now I should not have post the link… :-(
Just to add another reason to @44 - most email harvesting bots don’t adhere to the Robot Guidelines. There are numerous other projects, including the awesome Project Honeynet, that deal with spam harvesters, but your plugin will probably detect them as a positive side effect.
Only thing I’m wary about: What about performance as soon as your blacklist grows to a couple 100K entries? I’m definitely going to check that out on one of our bigger domains… ;)
Ok maybe I am missing something but instead of
User-agent: *Disallow: /*/blackhole/*shouldn’t it be:
User-agent: *Disallow: /blackhole/or???
Also.. by using akismet you are losing lots of legit comments to your blog, but that’s another story.
What??? My Akismet has an accuracy rate of 99.85%…
what I mean is that it takes a few people that don’t like you, they go spamming a few blogs using *your domain* and your domain will get banned on the akismet network.
This has been happening more and more.
that’s what they did to me on a totally white hat site I have, I just realized I can’t even mention it on here else I can’t post!!
But I don’t care anyway as the site is well positioned on the SE, still it sucks that they can do this to you.
Your robots.txt is malformed. Robots.txt specification doesn’t allow wildcard characters (“*”), so actually even good bots will go to your blackhole, since you aren’t properly ordering them not to.
Google for “check robots.txt” and/or read robotstxt.org specifications.
Your own robots.txt is also malformed and what’s even worse - uses pseudo-regular-expressions in mix with wildcard characters. None of those are in robots.txt specs.
Pretty much the ONLY proper Allow/Disallow lines in your robots.txt are:
Disallow: /transfer/and
Disallow: /(adding to previous post #91) oh and
/mint/and/labs/lines are correct too, rest is malformed and ignored by robots.txt standard.@ Chris. Good point. It would faster in a db at that point.
I can’t be up to more than 30 entries right now for a site that gets 50-100k visitors a month. My average seems to be around 5 a week.
I’d also like to test that with a bigger file vs a db. That loop would get pretty slow being O(n) run time.
@mark: Akismet blocks comment spam; the Blackhole blocks bad bots.
@Slava: Yes, you can change your robots rules to whatever you want. You can’t change mine though. That’s for me to worry about.
Oumpf!
Can someone check in Google Webmaster Tool - lab - performance… ?
Despite the rel nofollow and the .htaccess deny the URI is present!!!
@Philippe: Do you have a screenshot?
Hi Jeff
Here it is: http://twitpic.com/33ljzu/full
In the screenshot you gonna see the URI: /blackhole/process.php
since it was Skye’s solution I’ve prefered. ;-)
I remember some french SEO experts made a few months ago some tests: Google bot seemed to not obey to nofollow links inside a website, nor the instructions in the robots.text file.
(Above in my comment, I wanted to say (instead of .htaccess): in my robots.txt Disallow: /*/blackhole/*… of course ;-))
Still no problems with mine and it’s not indexed with google.
@ Philippe, well you did post the blackhole link up top so it could have been crawled that way. Change the dir name, remove the hidden link and update your robots.txt. Give it a week for the bots to update their copy of your robots.txt then put the hidden link back in. After another week see who you’ve caught. If it looks good put the filter code in your header and let it go live.
“One or more wildcard (*) characters can now be in a URL path, but may not be recognized by older robot crawlers” http://www.searchtools.com/robots/robots-txt.html
@ Skye and Jeff
the explanation might be there:
http://perishablepress.com/press/2010/07/14/blackhole-bad-bots/#comment-80814
;-)
or,
using Firefox with Google Toolbar to test the blackhole I’ve sent the speed measurement to the GWT???
Can anyone help me with my previous comment? :)
Thank you very much in advance…
@ Lazza, what’s your include look like in forum/? Did you change any of the code in blackhole.php where it opens the file? As long as the dat file is in the same directory it shouldn’t have a problem opening it.
@ Skye, my include was copy-pasted from this article. I didn’t change anything except my email and twitter info… It says an error when opening the file but I don’t know how to tell to the php script to be more verbose.
I couldn’t get this to ban me until I changed the call for blackhole.dat in both files to be like this:
$filename = "/home/xxxxxxx/public_html/blackhole/blackhole.dat"; // scan to prevent duplicatesAlso used the advice above from Erik Rubright (#63) to prevent the arin WOIS lookup errors. Everything works great now. Thanks a lot.
Kate, thank you very much for the $filename tip! I would suggest using:
$filename = getenv("DOCUMENT_ROOT")."/blackhole/blackhole.dat";I think this doesn’t work for Windows hosting, but who is so dumb to use Windows hosting with PHP? :D
I’ll try again to get the thing to work now. :)
Thanks, Lazza, for the more secure code.
Hi there http://perishablepress.com/blackhole/
I have just followed the above link into your trap yet I am still posting this. Am I missing something?
I am only asking this as whatever I do I can’t get this script to work
@ Jeff, with the fix from Kate (and my edit) it finally does work. :) Maybe you should add a line to your post. :)
Mark, in his info he says you will only be banned from the blackhole demo page on this site. After you visit that page twice, it should show you as banned from there, but you can still visit this page - otherwise you wouldn’t be able to get any further help here.
What type of site are you putting it on? Did you add the call for blackhole.php to your pages? If you’re using a script that uses a template, you may need to add it to the template. That’s what I had to do.
Kate, the demo you are talking about is at
http://perishablepress.com/press/wp-content/online/demos/blackhole/
not
http://perishablepress.com/blackhole/
yes I added the call on the pages. I tried it on two different hostings and on static html on WP and articlems sites.
On the WP site in the end I even parroted this very one, that’s how I got to test THIS blackhole that doesn’t seem to ban me. Not sure what I am doing, lol
anyway it doesn’t matter… I have a system nowadays that doesn’t even require my own hosting to make a few pennies so I don’t care about anything.
Btw about the suspension of this account, they will suspend all sites because they just don’t care about you, I have had that happen on all sorts of hosting packages over the years, what you want to do is develop a way to use FREE hostings like blogspot, wordpress etc etc using ONLY your domain and nothing else (hint hint)
Somebody put my site on a “hack this site” forum. For months I have been inundated with bots attacking my guestbook all day every day - I mean like every minute of the day. I am using the Lazarus Guestbook where “spamming is futile”. I also have a security program that prevents spamming in other areas of the site. They have never been able to hack the site or get any spam through, but I still find it stressing just reviewing my logs and looking at all their failed attempts. Also, they are using up my bandwidth.
Last night I installed the blackhole trap in two places on my site. The first one I put in the blackhole folder.
Next, I renamed my Guestbook directory to something else and added it to the robots exclusion. Then I created a new folder with the old Guestbook directory name and put the blackhole script in it.
The spamers visiting the old guestbook folder are being trapped and blocked. Today is the first day in a long time where there are zero log entries for spamming the guestbook.
Thanks! I only wish I had found it sooner.
I took a look at the PHP files. In the blackhole.php, unless I am missing something, wouldn’t it make more sense to check the whitelisted user agents against the current $_SERVER[’HTTP_USER_AGENT’] rather than opening and scanning the .dat file?
Also, I would suggest breaking out of the while loop as soon as a match is found. No sense in reading the rest of the file.
Any recommendations on applying this solution to a static site? I maintain a 500 page static site. I really don’t want to change them all to php and adding a directive to the .htaccess file to process all html files as php does not work with this host.
The only way I can see is to add the IP addresses directly to the .htaccess file. But it’s a little scary to have a script editing .htacess on the fly.
@ Jennifer. User agents can be spoofed very easily which is why we’re banning based on the bots behaviour.
Not sure what to do for your static site. I wouldn’t have a script editing the htaccess file. You could maybe cron it to only update htaccess once a day/week.
The script adds the IP address to the ban list based on them visiting the forbidden page. But then when checking the ban list, the script ignores any entry that matches a whitelisted user agent. There is no additional checking that I can see. So why bother checking the ban list at all for those with a whitelisted user agent?
The only benefit this could possibly have (vs checking the user agent directly) is if this time the bot is using a whitelisted user agent, but last time it didn’t. That seems unlikely.
I know it would be resource heavy to do reverse lookup as part of blackhole.php. But what about doing a reverse lookup as part of index.php? If the user agent is a whitelisted one, then do a reverse lookup. If the reverse lookup confirms the user agent, then don’t add it to the ban list. You could still send the email out for information.
Then blackhole.php wouldn’t look at the user agent at all. If it’s on the ban list, block it.
Jeff thx for the amazing tool.
Can we allow a bot which has previously been blocked. I looking at this from a testing perspective to test from my computer and once I am blocked, I simply remove the block to let me through.
Regards,
Vinny
I don’t know why, but sometimes I get an email telling I’ve been banned from my own site. I need to edit the .dat file then. Maybe it’s related to some kind of prefetching, so I suddendly removed any mention to the blackhole from my browsing history. I hope this will stop to do nasty things. :)
How about storing the previous value in session:
if (isset($_SESSION[’bh’]) && $_SESSION[’bh’])
$badbot = 1;
else
{
$badbot = 0;
…
…
…
}
…
HTML goes here…
John S. Britsios: You suggested that placing a robots meta tag nofollow,noindex,noarchive in the index.php. But if the bot already made it to the index.php to read the meta tag directive, the bot is already in the blackhole folder and hence banned by then?
I think he suggested to use an auxiliary page to check against instead of the index.php. :)
BTW LOL my blackhole keeps blocking the BingBot. :D
Hi,
We run Majestic-12 Distributed Search engine project and our attention was brought to our alleged bot’s bad behavior - we were directed to this tool.
Our quick investigation has shown that the root of the problem is non-standard robots.txt that is recommended on this page, please read quote from the standard:
“Note the ‘*’ is a special token, meaning “any other User-agent”; you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines. Two common errors: Wildcards are _not_ supported: instead of ‘Disallow: /tmp/*’ just say ‘Disallow: /tmp/’.”
Source: http://www.robotstxt.org/faq/robotstxt.html
This is the error that will wrongly catch our bot and others who perfectly well comply with standard robots.txt.
Please change robots.txt to be:
Disallow: /blackholeThere is NO need to add * or / at the end of URL as prefix matching is used - this is a simple change that will give fair chance to bot and you won’t be banning Yahoo for no reason - white list is not a solution, support standards is.
I am aware that Google and some other bots support non-standard extensions, however if you just plan to allow white listed bots then there is no need to create this trap that any standard complaint bot will fall into.
Best regards,
Alex Chudnovsky
Managing Director
Email :
alexc@majestic12.co.ukWeb :
http://www.majestic12.co.ukMajestic-12 Ltd
Faraday Wharf, Holt Street
Aston Science Park, Birmingham
B7 4BB
I have followed all the steps above, and I get the trap page, the Whois works, and it logs to the .dat file. I have the include in the header of my site - but I am not getting banned from the site.
Any idea where I need to look to try and make this work?
Nevermind, I found it. I also had to use the filename path the way Lazza suggested in #105 to get it to ban me. Thanks for that!
I’m glad you solved it. :)
@ Jeff, will you consider updating the path thing and the robots.txt one? IMHO those two fixes will lead to less hassle for simple users. ;)
@Jeff, thanks for the great tool !
I arrived the final step- Which index.php to edit?
I have a nearly empty index.php in root, another empty index.php in wp-content/ , the next is the index.php in my theme’s directory. Which one?
Thanks!
Nevermind, I sorted it out, it’s blackhole/index.php
@Jeff,
I got it set up and running. However, the .dat file wasn’t writable till I changed it to 777. So, now, the whole world can edit it?
Another question- From reading above comments, it seems Whois-look-up is resource comsuming. So, why not just ban them without the whois process?
Jenny, maybe it’s the type of server you are on. Setting permissions at 666 works fine for me.
@ Jenny, you can disable it by deleting the piece of code in the script. By the way since my hosting has troubles with whois but supports remote file opening, I edited the script to “rip” the whois data from an external website (you know those free dns/whois services online). :P
For some reason I can’t seem to ban myself!!
I’m getting a 500 internal server error each time I type http://my domain]/blackhole/index.php into the addresss bar of a browser to test it all out.
Question - when you have caching enabled (WP Super Cache for me), it can cache the rejected page. How would you avoid that happening?
Problem with phpBB2
I’m trying to use blackhole with Extreme-Styles Mod which allows includes to be used inside TPL templates.
In
overall_header.tplI can include a
/blackhole/test.txtfile.I can not include
/blackhole/blackhole.phpAny thoughts greatly appreciated.
PS
I have blackhole installed on 90 dynamic storefronts with no issues.
Thanks for a great script!
To enable PHP in PhpBB templates (without mods):
http://pixopoint.com/forum/index.php?topic=1825.msg8964#msg8964
BTW I’m not sure it can handle php includes, I guess it can.
Lazza: The link you used is for a conversation regarding phpBB3, not phpBB2.
Completely different animals.
phpBB2 does require the mod.
I do appreciate the reply.
Apology, I was in a hurry when reading your question. :) I suggest you to upgrade to phpBB3 anyway, I had very heavy spam and security issues with phpBB2. :(
This site relies heavily on the m2f mod (mail2forum).
bluequartz.us/phpBB2/
Just looking to be kept in the loop on the evolution of this method of dealing with the dark side of the Net.
I tried six different ways to Sunday to share modifications I made to my index.php file in my bot trap directory using the
tags, but no luck.In the interest of sharing my modifications with those interested in using or improving them, I created a simple page with Google to do so:
https://sites.google.com/site/phpblackholemods/
The only thing I haven’t quite figured out is the last link for user-agent-string . info
Hi,
your Blackhole script is a great idea, I like it very very much :)
One thing you could change about checking remote IP address.
Your script does this by simply reading one value (
$target = $REMOTE_ADDR;). But when someone uses proxy- you are blackholing proxy IP address, not user’s IP.In some cases I also want to know who’s knocking my door. This is how I check IP address:
function getIP(){if (getenv("HTTP_CLIENT_IP") && strcasecmp(getenv("HTTP_CLIENT_IP"), "unknown"))$ip = getenv("HTTP_CLIENT_IP");else if (getenv("HTTP_X_FORWARDED_FOR") && strcasecmp(getenv("HTTP_X_FORWARDED_FOR"), "unknown"))$ip = getenv("HTTP_X_FORWARDED_FOR");else if (getenv("REMOTE_ADDR") && strcasecmp(getenv("REMOTE_ADDR"), "unknown"))$ip = getenv("REMOTE_ADDR");else if (isset($_SERVER['REMOTE_ADDR']) && $_SERVER['REMOTE_ADDR'] && strcasecmp($_SERVER['REMOTE_ADDR'], "unknown"))$ip = $_SERVER['REMOTE_ADDR'];else$ip = "unknown";return($ip);}So, I just use more than
REMOTE_ADDR. Proxy servers should pass clients’ IP addresses in ‘HTTP_X_FORWARDED_FOR’ variable. Of course, some of them don’t… but still it’s worth to check it. If proxy hides client’s IP, then should be banned ;)@ascetix: Looks awesome. I’ll be experimenting with this method and try to integrate into the next Blackhole update.
Thanks for sharing :)
For some reason I can’t get it working.
This is what I get:
Warning: fopen(blackhole.dat) [function.fopen]: failed to open stream: No such file or directory in /home/mysite/public_html/blackhole/blackhole.php on line 38 Error opening file...You can find the solution in Kate’s comment -> http://perishablepress.com/press/2010/07/14/blackhole-bad-bots/#comment-81106 and the next one which is mine. :)
@Lazza: thanks a bunch! I’m a freaking noob, LOL!
It should also have the option of using an open MySQL connection.
Well it seems that the msn bot get trapped into the blackhole, as well: User Agent: msnbot/2.0b
Probably it’s time to update the whitelisting?
Should “Dow Jones Searchbot” be whitelisted?
Hi Jeff,
I just found this great script of yours and have it installed on my site. I disguised the footer script by using the same colour for the text as I had for the background, and indented the text -999px. Works great. I tested it a couple of times and got caught. Easiest install ever!
PS. Have Digging into Wordpress - absolutely love it!
Why not white list all of the major search engine IP addresses instead of their user agents?
There are good lists available, for example http://www.iplists.com/
I installed Blackhole on my site yesterday, but realized that it might cause some unwanted side-effects and deleted it the same night. I removed the php call in the header and footer.
However, today while I was in my WordPress Admin I used the browser’s back navigation button, and I was immediately blacklisted from my WordPress admin and received the Blackhole page saying my IP was blocked. I and my host’s techsupport spent an hour looking for any remnant of blackhole and could find none. Now, it blacklists any IP that tries to login to my WP admin. Please tell me how to undo this.
Hey Martin, make sure the blackhole.php file is deleted from the server. That’s what actually does the blocking, so if it’s gone, it can’t block anything. If you’ve removed the entire blackhole directory and are still getting blocked, something else is causing it. None of the blackhole files make any changes or add any new code or files anywhere, so their removal should rule them out of the equation.
Hello Jeff, and thank you for all your hard work.
Couple things scare me a bit: Having a .htaccess file in the directory on top of the already exsisting one in my root. (had issues with this in the past).
Also, the hidden text. Would it be better to do a 1px image?
Cheers!
Jeff,
Using a hidden property inside a link is a dead give-away for bots to not follow that link.
Try to keep the link friendly.
A better practice would be to remove the hidden property from the link. Replace it with a class property instead. Then use an external css to call the hidden value.
Also, use a simple text name for the link.
example:
<a href="/happy/" rel="nofollow">great place!</a>I would also change the name of blackhole directory to something less descriptive/known.
Jeff,
The .htaccess file inside the Blackhole directory is extremely important to keep anyone that knows the Blackhole directory structure from viewing your dat file.
An .htaccess file can be very useful in any directory to handle access to that directory tree.
Oops..
<a href="/happy/" rel="nofollow">great place!</a>would be nice if this list editor had a preview
hey jeff awesome stuff her a question … how do i re-allow a blocked not
Hey Ayoosh, you should be able to find/replace th IP in the .dat file and then just remove that line from the blacklist.
ok so thats how it works?? (im a beginner 15 yrs old sry). whoever falls into the trap, his IP is written into the blackhole.dat file. whenever the bot tries to visit again, blackhole.php checks it against the .dat file and does it stuff? right?
one more question. ive been caching my files with .htacess with expire headers technique… can it effect something? like caching the banned thing and showing it even if the person is not banned. ive not cached for .php files
Ayoosh,
The data used to determine whether to block a site is located in the .dat file, which is accessed at the server level by the blackhole script and thus shouldn’t be affected by htaccess caching rules directed at the client/browser.
I like the concept though I feel that blocking IP addresses should be handled by the webserver (apache) or a firewall. It’s just a gut feeling.
How can I manually add IP addresses to block into my blackhole.dat file? I have a few known problems that I just have the IP address for, not the additional information that Blackhole logs.
Do I just list these IP addresses without the accompanying “Get” date, user agent, etc.?
Barbara, add the IP’s you want to block to your .htaccess file like so:
Order Allow,DenyAllow from allDeny from 65.55.3.211Deny from 72.229.57.27Deny from 77.93.2.81Ok, forgot to use code tags, let’s see if this one comes out right:
<Limit GET POST PUT>Order Allow,DenyAllow from allDeny from 65.55.3.211Deny from 72.229.57.27Deny from 77.93.2.81Deny from 77.221.130.18Deny from 91.205.96.13Deny from 94.75.229.132Deny from 95.108.157.252Deny from 99.22.93.95Deny from 173.193.219.168Deny from 174.133.177.66Deny from 178.234.154.230Deny from 178.33.3.23Deny from 190.174.198.86Deny from 203.89.212.187Deny from 207.241.228.166Deny from 213.55.76.224Deny from 216.171.98.77</Limit>Thanks paperboy. I wanted to still get the nice banning message screen that I did with the blackhole, so I found that entering the IP address with - GET on the end into the blackhole.dat file also works for the banning.
Very nice work. I discovered this after Slurp took 3.8Gigs of data from my site and I had enough! Also nice to sprinkle into the index pages of directories no one should know about.
- Kris
I don’t understand how the entries in the .dat file prevent the bot from visiting the site. Do these entries become “Deny from” lines in the .htaccess file?
@Jack A: the .dat file is read by the script, which then allows or blocks accordingly. No changes are made to your .htaccess file.
Ok, so now I’ve been using blackhole for a week and have some questions.
If I put the hidden link to my forbidden folder at the bottom of my page, doesn’t that mean that the crawler has already visited many of my site’s pages?
Do I need to include blackhole.php on all of my pages?
The next time the bot visits my site is when I catch him, right?
@Jack A: Great question. The goal of this script is to catch bad bots. I.e., the ones that don’t obey your
robots.txtdirectives, such as the one that disallows access to the blackhole directory.Yes you should include the
blackhole.phpscript on any page that you want to protect against bad bots. An easy place for this using WordPress is theheader.phpfile.Here’s basically how it works:
I hope this helps, and refer back to the article and the comment thread for more info.
Hi, Jeff
Firstly, thank you for sharing this great little tool.
I had some problems setting it up and had to read through the whole thread to fix them. So thanks, also, to everyone who has contributed to making the script even better.
The particular issue I had was that before my server would stop throwing include and fopen errors, I had to alter the filepaths in blackhole.php and index.php to:
/home/xxxxxx/public_html/renamed-blackhole/blackhole.dat…and, change
/renamed-blackhole/in the header include, of course.————-
Thank you to people who posted a reminder that GoogleBot will use “display:none” as a red flag. It will also do so for certain security scan plugins for WP.
So now I’ve just set a 1px transparent gif before the closing body tag.
Is this good enough?
A couple of people mentioned using the footer include to link to yet another page where the link to the trap resides.
Maybe I didn’t read clearly enough, but I don’t really understand the need for this.
Can someone tell me the advantage of doing this, please?
————
Jeff, a couple of lingering questions/concerns:
1. Two people (Slava and Alex) have called attention to the wildcard asterisk in the robots.txt file.
Are they right or wrong?
Should I put THIS in my robots.txt file:
Disallow: /*/renamed-blackhole/*Or THIS:
Disallow: /renamed-blackhole/Or THIS:
Disallow: /renamed-blackhole2. Setting writable permissions to the dat file and having it “in the open” for anyone who knows (or guesses) the directory name.
Jeff, what are the minimum permissions I can set?
At the moment, I have 755 but that makes me nervous. Perhaps I’m just paranoid. If so, feel free to tell me so ;)
Also on this note, fWolf suggested renaming the dat file to avoid having it read from outside (and someone else further down referred to this, too).
Is this a legitimate concern? And if so, how might we mitigate against it?
Is there a way (like with htaccess files) or “locking down” this file and not making it readable?
Thanks again to you and everyone else contributing to this post!
Oh, and one more:
3. There haven’t been any comments on the mods posted here:
https://sites.google.com/site/phpblackholemods/
@Dave: What do these changes do? Are they only for use in certain circumstances? What potential conflicts are there? Et cetera.
The question is how I can manually add IP addresses to block into my blackhole.dat file?
Works like a charm!! Many thanks, All day’s more than 4 badbots are capturated ;)
Keep going
Seymour
how do i get this to work on a vbulletin forum?
everyone works fine except for the banning part. i cant seem to get that to work
i get a syntax error no matter where i put this
i put it everywhere possible. error, error error. ive been at this for hours
i tried my header, skin, html, include.php, everything. it doesnt seem to go well with vbulletin its there a different way that i could do that that might work?
doesnt show for some reason but im referring to the step about what is suppose to be put at the top of each page. the step that was reccomended for the header
cant get that to work on vbulletin. plz work. i really wanna ban these bots automatically
nevermind i got it to work after hours after testing
Hey, I just downloaded this, and after a couple touch up’s it works great. Couple things I did were move blackhole.dat and blackhole.php out of my www directory. Then I made a php file called footer.php that just inserts the link for bad bots to follow.
Then I edited my php.ini and added blackhole.php to auto-prepend and footer to auto-append so it covers my whole site. Then you can put index.php and the .htaccess file wherever browsing shouldn’t happen.
I’ve had a problem with people getting a link to parts of my site that don’t exist anymore that get’s a hit every minute. So I slapped the index.php in there and also a mod_rewrite so if they try to access anything, it puts them at index.php.
Just some ideas for other people.
I also think that renaming the blackhole directory and blackhole.php to something random is a good idea, so that bots can’t filter for it. Also, in blackhole.php there were a couple errors that were propagating through to my apache error log(couple uninitialized variables)
Anyways, great idea and thanks.
i just wanna say thanks a ton for this jeff. this has really helped. my site was getting killed by bots and the first day of installing this i caught over 40 of them