Jump Menu : Content | Explore | Comments | Search | Home | Sitemap | Contact | Login | Access.

Protect Your Site with a Blackhole for Bad Bots

[ Black Hole ] One of my favorite security measures here at Perishable Press is the site’s virtual Blackhole trap for bad bots. The concept is simple: include a hidden link to a robots.txt-forbidden directory somewhere on your pages. Bots that ignore or disobey your robots rules will crawl the link and fall into the trap, which then performs a WHOIS Lookup and records the event in the blackhole data file. Once added to the blacklist data file, bad bots immediately are denied access to your site. I call it the “one-strike” rule: bots have one chance to follow the robots.txt protocol, check the site’s robots.txt file, and obey its directives. Failure to comply results in immediate banishment. The best part is that the Blackhole only affects bad bots: normal users never see the hidden link, and good bots obey the robots rules in the first place.

In five easy steps, you can set up your own Blackhole to trap bad bots and protect your site from evil scripts, bandwidth thieves, content scrapers, spammers, and other malicious behavior.

[ Blackhole Directory with Files ] The Blackhole is built with PHP, and uses a bit of .htaccess to protect the blackhole directory. The blackhole script combines heavily modified versions of the Kloth.net script (for the bot trap) and the Network Query Tool (for the whois lookups). Refined over the years and completely revamped for this tutorial, the Blackhole consists of a single plug-&-play directory that contains the following four files:

  • .htaccess – basic directory protection
  • blackhole.dat – server-writable log file (serves as the blacklist)
  • blackhole.php – checks requests against blacklist and blocks bad bots
  • index.php – generates blackhole page, performs whois lookup, sends email, and logs data

These four files are all contained in a single directory named “blackhole”.

Installation Overview

I set things up to make implementation as easy as possible. Here are the five basic steps:

  1. Upload the /blackhole/ directory to your site
  2. Ensure writable server permissions for the blackhole.dat file
  3. Add a single line to the top of your pages to include the blackhole.php file
  4. Add a hidden link to the /blackhole/ directory in the footer of your pages
  5. Prohibit crawling of the /blackhole/ by adding a line to your robots.txt file

It’s that easy to install on your own site, but there are many ways to customize functionality. For complete instructions, jump ahead to Implementation and Configuration. For now, I think a good way to understand how it works is to check out a demo..

One-time Live Demo

I have set up a working demo of the Blackhole for this tutorial. It works exactly like the download version, but it’s configured to block you only from the demo, not from the entire site. Here’s how it works:

  1. First visit to the Blackhole demo loads the trap page, runs the whois lookup, and adds your IP address to the blacklist data file
  2. Once you’re added to the blacklist, all subsequent requests for the Blackhole demo will be denied access

So you get one chance to see how it works. Once you visit, your IP will be blocked from the demo only – you will still have full access to this tutorial (and everything else). That said, here is the demo link: Blackhole Demo. Visit once to see the Blackhole trap, and then again to observe that you’ve been blocked. If I were to include the blackhole.php in the header of my theme files, you would be banned from pretty much the entire site.

Implementation and Configuration

Here are complete instructions for implementing and configuring the Perishable Press Blackhole:

Step 1: Download the Blackhole zip file, unzip and upload to your site’s root directory. This location is not required, but it enables everything to work out of the box. To use a different location, edit the include path in Step 3.

Step 2: Change file permissions for blackhole.dat to make it writable by the server. The permission settings may vary depending on server configuration. If you are unsure about this, ask your host. Note that the blackhole script needs to be able to read, write, and execute the blackhole.dat file.

Step 3: Include the bot-check script by adding the following line to the top of your pages:

<?php include($_SERVER['DOCUMENT_ROOT'] . "/blackhole/blackhole.php"); ?>

The blackhole.php script checks the request IP against the blacklist data file. If a match is found, the request is blocked with a customizable message. See the source code for more information.

Step 4: Include a hidden link to the /blackhole/ directory in the footer of your pages:

<a style="display:none;" href="http://example.com/blackhole/" rel="nofollow">Do NOT follow this link or you will be banned from the site!</a>

This is the hidden link that bad bots will follow. It’s currently hidden with CSS, so 99% of visitors won’t ever see it. To hide the link from users without CSS, replace the anchor text with a transparent 1-pixel GIF image.

Step 5: Finally, add a Disallow directive to your site’s robots.txt file:

User-agent: *
Disallow: /*/blackhole/*

This step is pretty important. Without the proper robots directives, all bots would fall into the Blackhole because they wouldn’t know any better. If a bot wants to crawl your site, it must obey the rules! The robots rule that we are using basically says, “All bots DO NOT visit the /blackhole/ directory or anything inside of it.” More on this in the next section..

Further customization: The previous five steps will get the Blackhole working, but the index.php requires a few modifications. Open the index.php file and make the following changes:

  • Line #54: Edit the path to your site’s robots.txt file
  • Line #56: Edit the path to your contact page (or email address)
  • Lines #140/141: Edit email address with your own
  • And in blackhole.php, edit line #53 with your contact info

These are the recommended changes, but the PHP is clean and generates valid HTML5, so feel free to modify the source code as needed. Note that beyond these three items, no other edits need made.

Caveat Emptor

Blocking bots is serious business. Good bots obey robots.txt rules, but there may be potentially useful bots that do not. Yahoo is the perfect example: it’s a valid search engine that sends some traffic, but sadly the Yahoo Slurp bot is too stupid to follow the rules. Since setting up the Blackhole several years ago, I’ve seen Slurp disobey robots rules hundreds of times. Bottom line: the Blackhole will block any bot that disobeys the robots.txt directives. Proceed accordingly. Update: By default, the Blackhole no longer blocks any of the popular search engines. See the next section for more information.

Whitelisting Search Bots

Initially, the Blackhole blocked any bot that disobeyed the robots.txt directives. Unfortunately, as discussed in the comments, Googlebot, Yahoo, and other major search bots do not always obey robots rules. And while blocking Yahoo! Slurp is debatable, blocking Google, MSN/Bing, et al would just be dumb. Thus, the Blackhole now “whitelists” any user agent identifying as any of the following:

  • googlebot (Google)
  • msnbot (MSN/Bing)
  • yandex (Yandex)
  • teoma (Ask)
  • slurp (Yahoo)

Whitelisting these user agents ensures that anything claiming to be a major search engine is allowed open access. The downside is that user-agent strings are easily spoofed, so a bad bot could crawl along and say, “hey look, I’m teh Googlebot!” and the whitelist would grant access. It is possible to verify the true identity of each bot, but as X3M explains in the comments, doing so consumes significant resources and could overload the server. Avoiding that scenario, the Blackhole errs on the side of caution: it’s better to allow a few spoofs than to block any of the major search engines.

License and Disclaimer

The Perishable Press Blackhole is released under GNU General Public License. Check the Creative Commons for a summary and/or see the Blackhole source code for additional information. Also note that by downloading the Blackhole, you agree to accept full responsibility for its use. In no way shall the author be held accountable for anything that happens after the file has been downloaded.

Blackhole Download

Here you can download the current version of the Blackhole:

Perishable Press Blackhole for Bad Bots
    [ version 1.2 | .zip format | 5K | 568 downloads ]

Previous Versions

Related articles

About this article

This is article #759, posted by Jeff Starr on Wednesday, July 14, 2010 @ 09:30am. Categorized as Websites, and tagged with blacklist, htaccess, php, robots, security, tutorials. Updated on July 14, 2010. Visited 157767 times. 56 Responses »

BookmarkTrackbackCommentSubscribeExplore

« htaccess Code for WordPress Multisite • Up • Best Method for Email Obfuscation? »


56 Responses

1 • July 14, 2010 at 10:10 am — X3M says:

According to reverse DNS lookup, Google bots seem to ignore robots.txt sometimes. By using this solution you are at risk to ban the Google bot from your site.

2 • July 14, 2010 at 10:19 am — Gabe says:

I’m running a Wordpress site and when I tried to install right out of the box I got a PHP failed to open file error/warning. I had to modify the path in the blackhole.php file at line 37 to the absolute path of the .dat file. Not sure if I did something wrong, of if this is due to the WP installation, or what, but it worked after I did that. Also, once an IP is banned - it gives the notice and says to contact @perishable to work things out. So I had to modify line 56 of the blackhole.php file.

3 • July 14, 2010 at 10:24 am — X3M says:

Well, I just had a quick look at the code…

Consider the following situation: $_SERVER['REMOTE_ADDR'] is 121.1.10.15, the IP in blackhole.dat is 1.1.1.1.

What will ereg('1.1.1.1', '121.1.10.15') return? Yes, it will return 1. Just because the dot matches one single character, “1.1” will match 121 but this is not what you want. Likewise, if REMOTE_ADDR is 1.1.1.10, it will still be blocked by 1.1.1.1.

Not to mention that ereg() is deprecated. If you compare IP addresses, why not just use string comparison? E.g.,

if ($u[0] == $_SERVER['REMOTE_ADDR']) ++$badbot;

Hope this makes sense.

4 • July 14, 2010 at 10:42 am — B. Moore says:

WOW that awesome! Another great tutorial/system that i need to install on my site.

THANK YOU!

5 • July 14, 2010 at 11:00 am — Jeff Starr says:

@X3M: That’s definitely a better way to do it. The code has been updated with the string comparison code. Thank you!

@Gabe: Some servers may require absolute paths. Note that there is also an instance of the blackhole.dat path/name in the index.php file. If one is changed to absolute path, both should. And thanks for the reminder about the @perishable contact thing.. I’ll be adding that step to the article promptly.

@B. Moore: My pleasure :)

6 • July 14, 2010 at 11:16 am — Gabe says:

@Jeff - Cool, thanks!

7 • July 14, 2010 at 11:22 am — Michael Clark says:

My own bad-bot-blocker script catches some GoogleBots. It’s very annoying that they ignore the robots.txt and rel="nofollow" commands. But be careful. Some bad bots out there spoof their user-agent, so you can’t simply automatically allow any user-agent of GoogleBot access to your site.

I also have the referer sent to me in my alerts.

8 • July 14, 2010 at 11:28 am — X3M says:

@Michael – you can.

$ptr = gethostbyaddr($_SERVER['REMOTE_ADDR']);
if ('.googlebot.com' == substr($ptr, -14)) {
// This is GoogleBot
}

100% accuracy is not guaranteed though – use at your own risk :-)

9 • July 14, 2010 at 11:45 am — Jeff Starr says:

@X3M: I am thinking that checking for googlebot would be another improvement for the script. Maybe as a condition for if ($u[0] == $_SERVER['REMOTE_ADDR']) ++$badbot; to execute?

10 • July 14, 2010 at 12:03 pm — X3M says:

@Jeff - yes, it could be.

You can also detect Yahoo! Slurp, MSNbot (or how they call it now) etc.

PTR record for Yahoo! ends with .crawl.yahoo.net, for MSN/Bing - with .search.msn.bot, Yandex - with yandex.ru etc.

However, if you have a high traffic website you must be aware that calls to gethostbyaddr() can be rather expensive - you can easily overload your ISP’s DNS server. You will probably need to have a local caching DNS server installed. If DNS server gets overloaded/goes down, page loading will freeze until connection timeout occurs in gethostbyaddr().

So if you want to include this feature, I would suggest a configuration option to turn it on or off.

11 • July 14, 2010 at 12:14 pm — Skye says:

Thanks Jeff. Great idea. I definitely want to test this out.

12 • July 14, 2010 at 3:30 pm — Paperboy says:

I’ve used a trap like this (the one from kloth.net I think) for years now on my nsfw wallpaper site and it has worked wonders - Not only does it stop strange bots but also site rippers and all kinds off mass downloaders! :D

Never had any trouble with google or yahoo falling into the trap either!

I will definitely try this version of the trap soon. Thanks Jeff!

13 • July 14, 2010 at 3:48 pm — Jeff Starr says:

Update: Thanks to help from X3M, the Blackhole now whitelists the major search engines: Googlebot, Slurp, msnbot, Teoma, Yandex. Please see this section in the article for more information. If you are using a version less than 1.2, it is recommended to update.

14 • July 14, 2010 at 6:14 pm — Carlos Vazquez says:

I’m not an expert like you and I can’t contribute anything, just I can say thank you. I will test it.

15 • July 15, 2010 at 12:54 am — fwolf says:

I’m using a rather similar system to block access to some parts of my site, esp. to the download section for my project history. That specifically is this way because I do not want to get my complete contact data indexed by some spam bot, Google or anything else.

A short suggestion to improve your system: Rename the DAT-Logfile to .dat.php to avoid getting it read from the outside, because there are lots of scenarios where you simply CANNOT put this someplace under the web root and/or not being able to properly set the access rights.

cu, w0lf.

16 • July 15, 2010 at 3:14 am — Daan says:

Hi Jeff, that’s a interesting idea, but I do foresee 2 problems:

- Some browsers/addons prefetch links on the page (f.e. the Firefox: fasterfox addon)
- Competitors could make their visitors visit your blackbox; for example by including an image pointed to ‘http://site/blackbox/’ in their HTML, thereby banning them from your site.

I think the best solution would be to split the trap into two pages. The /blackbox/ page doesn’t ban the user but links to another page that does. The URL of that page could depend on the IP of the user, for example “http://site/blackbox/?key=”+MD5(ip+”secret”). That way, there would be no way of hotlinking your ban page and prefetching is allowed 1-page in advance. :)

Another interesting solution might be to automatically unban anyone with Javascript enabled, since most bots don’t use Javascript.

Thanks for the script!

17 • July 15, 2010 at 4:29 am — Guyaume B. Parenteau says:

Very useful ! I was thinking about something to block bad robots and stumble on it ! Wunderbar !

Thanks

18 • July 15, 2010 at 7:31 am — Jamal Mohamed says:

Thanks Jeff for this awesome trap. I used to block bad bots manually (via .htaccess) and when I hooked up Blackhole on a test server it worked like nuts. I got to implement it on several websites under my belt, soon.

19 • July 15, 2010 at 8:10 am — Michael Clark says:

In version 1.2 why are there already 55 lines of IP addresses and other details in the blackhole.dat file?

I recommend renaming the blackhole.dat file to be .htblackhole.dat. Most apache severs will not allow anyone to download any file that starts with .ht. After renaming the file, you need to edit line 37 in blackhole.php and line 127 in index.php.

You should also add the exclusion command to your robots.txt file. Then activate the actual blackhole several days alter. Search engines do not check the robots.txt file every time they visit your site. Most cache the instructions for anywhere from 1 to 7 days.

Final recommendation is to change the name of the directory to something innocuous. Bad bots may try to avoid directory names like “blackhole.”

20 • July 15, 2010 at 8:15 am — Michael Clark says:

And you laos need to tweak the footer of index.php to report the current version number, it still says 1.1.

21 • July 15, 2010 at 8:16 am — Michael Clark says:

And you also need to tweak the footer of index.php to report the current version number, it still says 1.1.

22 • July 15, 2010 at 9:04 am — Nathan B says:

I don’t know much about this, but I’m curious: does publishing a technique like this make it more vulnerable to being beaten/worked around by spammers? I was wondering recently about honeypots on contact forms, which I guess can now be beaten by spammers. Was too much published about honeypots, or was it too simple a technique to keep spammers at bay for long?

23 • July 15, 2010 at 9:08 am — John S. Britsios (Webnauts) says:

I do not think that it is safe to hide text as you recommend using this code: Do NOT follow this link or you will be banned from the site!

The risk is that “display:none” for text can trigger spam filters especially with Google. Even if you did not have any evil intentions.

I would rather use javascript in combination with a “noscript” tag, or instead of using an anchor text, I would use a linked transparent image.

Another thing I would like to mention is, that using the “nofollow” attribute for internal links is not advisable. With such practices you can dilute PageRank.

So for this case, I would recommend implementing in the index.php file the robots meta tag directives “noindex,nofollow,noarchive”.

In that case the major search engines will still access the page, crawl it but will not index it.

But! The PageRank will still flow to the pages which are linked from that page. If they are external site links, there you can block passing PageRank with the rel=”nofollow” attribute. But you must make sure that you have at least one link that the PR must pass to, i.e to the homepage of your site, otherwise you will have again a PageRank dilution, because you have created a so known as dangling or dead end page.

I hope you will update above and if not, I will take care and modify all that before I implement.

By the way great job Jeff!

24 • July 15, 2010 at 10:48 am — Jan says:

HI Jeff, thank you very much for this great little plugin!

I’ve just installed it on my site and found out that my site was craveling by “Baiduspider ” - It appears to be a search engine from China and it’s disregarding the nofollow rule and the robots.txt.

Have anyone of you any experience with this spider?
Do you thhing this might by harmless bot?

I just do not like the fact it’s disregarding the rulles.

Here are some info I’ve got:

IP Address: 123.125.66.22
User Agent: Baiduspider+(+http://www.baidu.com/search/spider.htm)

25 • July 15, 2010 at 11:29 am — Laserpointa says:

I came across this topic, years ago right here:
http://board.protecus.de/t21590.htm

Several hints:
# a lot of bots show into the useragent the googlebot string - thats why people thing that your script blocks the googlebot. Usally it doesnt need to be whitelisted.
# to only check for the useragent is bad in this case. Google explains how to verify your their bot correctly over there: http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html
# the best way to install this script is to add it to the robots.txt + wait a day to add the link somewhere in a hidden div.

Laserpointa from Protecus Forum

26 • July 15, 2010 at 12:16 pm — darrinb says:

This is awesome! I’m experimenting around with it, turning it into an automated WP plugin, and I’ve managed to ban myself about 5 times from my site! lol

27 • July 15, 2010 at 1:45 pm — Bob Dole says:

Doesn’t seem to be working for me. After installation I’m getting no error, but in attempting to get myself banned by repeatedly visiting the banned page, nothing happens, and the dat file never gets written to. I’ve changed permissions to allow full access to everything (just to test), and still nothing gets written.

I receive the “Bad Bot” email, the “bad bot” page displays fine, and there are no errors reporting in my error log.

Any ideas?

28 • July 15, 2010 at 1:55 pm — Bob Dole says:

Figured it out.. nm.

29 • July 15, 2010 at 5:53 pm — Jonathan Steele says:

Love it.

Now, can you make a black hole for all the spam posters to my blog. The ones that post…really love the way you write and your content is so good, never though along these lines before but you enlightened me….on my art blog with only pictures of my art or on my about page.

I would love to develop a list and block them from ever even trying to post.

Thanks again…

30 • July 15, 2010 at 6:10 pm — Julian says:

is there any “dynamic” for this instead of manually listing and typing into the file?

31 • July 16, 2010 at 4:32 am — Lazza says:

Is it ok that the .dat file has got some IPs inside? And what about the concerns of prefetching and search engines antispam features?

32 • July 17, 2010 at 4:08 pm — Philippe Butzmann says:

Hi,

found your Site via stumble upon. very nice and informative.
Keep up the good work. It has really touched me.

Greetings

Philippe

33 • July 20, 2010 at 2:33 pm — Tom says:

Maybe take a look at http://www.spider-trap.de/en_index.html

34 • July 20, 2010 at 3:31 pm — Eric says:

It’s pretty amusing to read all the discussion about how to avoid banning GoogleBot. If they’re crawling a page that you’ve explicitly requested they ignore, then you should treat them like every other crawler. I don’t see the point of breaking your own rule in this case.

35 • July 20, 2010 at 3:50 pm — Jeff Starr says:

Eric, that’s the 2,000-pound gorilla in the room. Why bother whitelisting any search-engine? If they break the rules, ban them. Right? Unfortunately, Google owns the Web, so they pretty much decide what it is exactly that they will and won’t do. Sucks, but true.

36 • July 21, 2010 at 12:26 am — Lazza says:

If you read the url Tom has posted, you will notice that Google simply wants to be called explicitly and not via the wildcard. That way it obeys the robots.txt, according to the author of that site. :)

37 • July 22, 2010 at 3:07 pm — RS says:

So, seems to work, but seeing a couple errors in my apache logs.

[Thu Jul 22 13:23:01 2010] [error] [client 119.63.198.97] PHP Notice: Undefined variable: buffer in /var/www/htdocs/blackhole/index.php on line 78
[Thu Jul 22 13:23:01 2010] [error] [client 119.63.198.97] PHP Notice: Undefined variable: extra in /var/www/htdocs/blackhole/index.php on line 98

Also seeing occasionally:

[Thu Jul 22 13:56:21 2010] [error] [client 64.40.121.187] PHP Notice: Undefined variable: nextServer in /var/www/htdocs/blackhole/index.php on line 91

The first one looks like it’s just because you’re trying to append to a variable that isn’t defined in the first place…

The second… is just because extra isn’t being defined all the time.

The last one… just needs to be something more like if(isset($nextServer)) { because if returns an error on unset variables. if($variable) is not kosher for a while now.

38 • August 3, 2010 at 7:48 pm — Frank says:

Getting a weird return from the arin WOIS lookup.

Query terms are ambiguous. The query is assumed to be:
“n XXX.XXX.XXX.XXX”

Use “?” to get help.

The following results may also be obtained via:
http://whois.arin.net/rest/nets;q=XXX.XXX.XXX.XXX?showDetails=true&amp;showARIN=false

I’ve been through it several times now, and tried several fixes but Arin continues to find an ‘n’ character (presumably from a ‘\n’) on the front of the IP. Can’t pinpoint where its getting that, or how to fix it.

39 • August 3, 2010 at 9:45 pm — Jeff Starr says:

Frank, I’m also trying to resolve that issue, but so far without success. I’m not sure if there is anything that can be done from within the script.

If anyone has further info on this it would be appreciated! :)

40 • August 4, 2010 at 5:57 am — darrinb says:

ARIN Changed its protocol for directory lookups which is why you’re seeing that weird message. They recommend changing to their new RESTful protocol.

You can read more about it here:
https://www.arin.net/resources/whoisrws/index.html

In my WP plugin version of this script (in process) I’m looking into switching to the RESTful query protocol they recommend and styling the returned XML.

41 • August 4, 2010 at 6:40 am — Frank says:

@RS: The two PHP Notices you’re getting are an easy fix. Make a new line after global $msg, $target; and add the following 2 lines:

$buffer = '';
$nextServer = false;

@ Jeff and darrinb: Thanks for the backup. Based on my tests I was pretty certain the problem existed outside the Blackhole code.

42 • August 6, 2010 at 1:42 am — ノートパソコン比較 says:

Okay this might sound stupid to most of you, but why should I ban bad-bots? Obviously they access content I don’t want them to access, but banning them from the site would have pretty bad consequences (google).

I know nothing about this topic, but wouldn’t it be a good idea to blacklist the bots who went into the trap, not from the domain, instead update the .htaccess file to block them only from the pages I don’t want those bots to crawl?

Well I really don’t know a thing about .htaccess or blocking bots, so this might be a rather silly idea.

43 • August 6, 2010 at 1:45 am — Lazza says:

One reason: bandwidth.

44 • August 7, 2010 at 6:54 pm — Jeff Starr says:

@ノートパソコン比較:

Great questions.

In addition to bandwidth, blocking bad bots helps conserve server resources, which are a commodity on non-shared environments. Also, many bad bots are malicious, so blocking them also improves the security of your site, which benefits everyone.

Custom blocking via htaccess is also a good idea. There are many (many) articles here at Perishable Press on the topic of using htaccess to protect your site (including blocking bad bots).

45 • August 9, 2010 at 4:13 am — Soren says:

You are writing that this script should be on top of the pages

So if I put this script on my header.php that will be correct?
Or do I need to put it on “single.php, page.php, archive.php etc etc”?

Just a little confused here…

Thanks
Soren

46 • August 9, 2010 at 9:24 am — Jeff Starr says:

@Soren: Yeah, the include snippet just needs to be placed at the top of your header.php file. Then, because that file is included with each page view, the Blackhole script covers your entire site.

47 • August 16, 2010 at 2:22 pm — Rune Jensen says:

I have succes putting a small form in a HTML remark with the action attribute to the honeypot. Fields are named with attactive words like email, post, blog and message. Since Google are actually running HTML, they will not see the HTML remark and not follow it.

Another is to use an if statement to check if a request accepts GZIP in accept-encoding. Since bots do almost never have access to the compression library, they will not accept GZIP. Then put in the form if GZIP is not accepted, and use af post method. Google and other search engines do never follow post, since if they did, they would themselves be spambots.

I use a mix of these two methods, works great, and takes almost all spambots as well as harvesters. My blacklist is constantly around 30-50 malicous bots. When blocking, use 404 or (eventually) 410 status errors, do not indicate you have accepted the request.

The blacklist, the honeypot creates, mine is dynamic, since I do not want an IP banned for life and the blacklist to be too long. Generelly, unban an IP if it has not made a request for some time and keep the list on a max of around 70 IPs, My experiense is this is enough even for large attacks.

48 • August 16, 2010 at 4:35 pm — Rune Jensen says:

Some additional tips, that might or might not be usefull - these are more advanced, and focused on blog spambots more than the blackhole, but quite effective wellknown techniques:

You might want to use HTTP header information to make a “fingerprint” (fx. an MD5 checksum) of the request. The reason is, a lot of spambots are only posting to, not getting the page. They will get their information about your page instead from harvester bots, which are scraping your site. And a lot of times, the header are not the same between bots.

I use useragent, accept-encoding, accept-language, accept-charset, connection and protocol for value. To make it page-dependent as well, I put in the URL.

Use of a stardate (look it up on Google) is effective to simulate a session without using cookies, so that you have to post within (say for example) 4 hours of getting a page. Combine it with the fingerprint, put it in an HTML input hidden field in the form, and recalculate at post time to verify that the fingerprint match, and that stardatePOST-stardateGET<timelimit.

49 • August 17, 2010 at 7:26 pm — Mr. HAW says:

Thank you! Here is how I implemented it into my MODx site: http://modxcms.com/forums/index.php/topic,40576.msg307614.html#msg307614

50 • September 1, 2010 at 12:36 am — Boyd says:

I’ve been getting a lot of spam at my wordpress blog at http://stilen.net

So I installed the SI CAPTCHA Anti-Spam plugin. Fine, no more spam comments, but still, the spam bots ruins my web-statistics!

Are there no plugins available for a more easy install of the black hole?

51 • September 1, 2010 at 6:21 am — darrinb says:

@Boyd: I’m actually working on a plugin version for WP, and hope to be done soon. I just started a new project, so I’ve been slammed, but I’m hoping to wrap up the plugin within the next week or so.

52 • September 1, 2010 at 10:51 am — Boyd says:

darrinb: fantastic. I’m looking forward to it.

Drop a comment


Trackbacks / Pingbacks

  1. Twitter Trackbacks for Easy PHP Blackhole Trap with WHOIS Lookup for Bad Bots • Perishable Press [perishablepress.com] on Topsy.com
  2. Easy PHP Blackhole Trap with WHOIS Lookup for Bad Bots • Perishable Press « Netcrema – creme de la social news via digg + delicious + stumpleupon + reddit
  3. WordPress Plugins for Security and Spam Prevention | BloggingBlogging
  4. Pimp your wp-config.php | Digging into WordPress

9rules Network

Set CSS to lite theme
Set CSS to dark theme
Attention: Do NOT follow this link!