The Perishable Press 4G Blacklist

♦ Posted by Jeff Starr in .htaccess, Security

Updated August 8, 2024 • 233 comments

[ 4G Stormtrooper ] At last! After many months of collecting data, crafting directives, and testing results, I am thrilled to announce the release of the 4G Blacklist! The 4G Blacklist is a next-generation protective firewall that secures your site against a wide range of automated attacks and other malicious activity.

Update: Check out the new and improved 6G Blacklist/Firewall »

Like its 3G predecessor, the 4G Blacklist is designed for use on Apache servers and is easily implemented via HTAccess or the httpd.conf configuration file. In order to function properly, the 4G Blacklist requires two specific Apache modules, mod_rewrite and mod_alias. As with the third generation of the blacklist, the 4G Blacklist consists of multiple parts:

HTAccess Essentials
Request-Method Filtering
IP Address Blacklist
Query-String Blacklist
URL Blacklist

Each of these methods is designed to protect different aspects of your site. They may be used independently, mixed and matched, or combined to create the complete 4G Blacklist. This modularity provides flexibility for different implementations while facilitating the testing and updating process. The core of the 4G Blacklist consists of the last two methods, the Query-String and URL Blacklists. These two sections provide an enormous amount of protection against many potentially devastating attacks. Everything else is just icing on the cake. Speaking of which, there are also two more completely optional sections of the 4G Blacklist, namely:

These two sections have been removed from the 4G Blacklist and relegated to “optional” status because they are no longer necessary. Put simply, the 4G Blacklist provides better protection with fewer lines of code. Even so, each of these blacklists have been updated with hundreds of new directives and will be made available here at Perishable Press in the near future. But for now, let’s return to the business at hand..

Presenting the Perishable Press 4G Blacklist

As is custom here at Perishable Press, I present the complete code first, and then walk through the usage instructions and code explanations. So, without furhter ado, here is the much-anticipated 4G Blacklist [for personal use only – may not be posted elsewhere without proper link attribution]:

### PERISHABLE PRESS 4G BLACKLIST ###

# ESSENTIALS
RewriteEngine on
ServerSignature Off
Options All -Indexes
Options +FollowSymLinks

# FILTER REQUEST METHODS
<IfModule mod_rewrite.c>
 RewriteCond %{REQUEST_METHOD} ^(TRACE|DELETE|TRACK) [NC]
 RewriteRule ^(.*)$ - [F,L]
</IfModule>

# BLACKLIST CANDIDATES
<Limit GET POST PUT>
 Order Allow,Deny
 Allow from all
 Deny from 75.126.85.215   "# blacklist candidate 2008-01-02 = admin-ajax.php attack "
 Deny from 128.111.48.138  "# blacklist candidate 2008-02-10 = cryptic character strings "
 Deny from 87.248.163.54   "# blacklist candidate 2008-03-09 = block administrative attacks "
 Deny from 84.122.143.99   "# blacklist candidate 2008-04-27 = block clam store loser "
 Deny from 210.210.119.145 "# blacklist candidate 2008-05-31 = block _vpi.xml attacks "
 Deny from 66.74.199.125   "# blacklist candidate 2008-10-19 = block mindless spider running "
 Deny from 203.55.231.100  "# 1048 attacks in 60 minutes"
 Deny from 24.19.202.10    "# 1629 attacks in 90 minutes"
</Limit>

# QUERY STRING EXPLOITS
<IfModule mod_rewrite.c>
 RewriteCond %{QUERY_STRING} \.\.\/    [NC,OR]
 RewriteCond %{QUERY_STRING} boot\.ini [NC,OR]
 RewriteCond %{QUERY_STRING} tag\=     [NC,OR]
 RewriteCond %{QUERY_STRING} ftp\:     [NC,OR]
 RewriteCond %{QUERY_STRING} http\:    [NC,OR]
 RewriteCond %{QUERY_STRING} https\:   [NC,OR]
 RewriteCond %{QUERY_STRING} mosConfig [NC,OR]
 RewriteCond %{QUERY_STRING} ^.*(\[|\]|\(|\)|<|>|'|"|;|\?|\*).* [NC,OR]
 RewriteCond %{QUERY_STRING} ^.*(%22|%27|%3C|%3E|%5C|%7B|%7C).* [NC,OR]
 RewriteCond %{QUERY_STRING} ^.*(%0|%A|%B|%C|%D|%E|%F|127\.0).* [NC,OR]
 RewriteCond %{QUERY_STRING} ^.*(globals|encode|config|localhost|loopback).* [NC,OR]
 RewriteCond %{QUERY_STRING} ^.*(request|select|insert|union|declare|drop).* [NC]
 RewriteRule ^(.*)$ - [F,L]
</IfModule>

# CHARACTER STRINGS
<IfModule mod_alias.c>
 # BASIC CHARACTERS
 RedirectMatch 403 \,
 RedirectMatch 403 \:
 RedirectMatch 403 \;
 RedirectMatch 403 \=
 RedirectMatch 403 \@
 RedirectMatch 403 \[
 RedirectMatch 403 \]
 RedirectMatch 403 \^
 RedirectMatch 403 \`
 RedirectMatch 403 \{
 RedirectMatch 403 \}
 RedirectMatch 403 \~
 RedirectMatch 403 \"
 RedirectMatch 403 \$
 RedirectMatch 403 \<
 RedirectMatch 403 \>
 RedirectMatch 403 \|
 RedirectMatch 403 \.\.
 RedirectMatch 403 \/\/
 RedirectMatch 403 \%0
 RedirectMatch 403 \%A
 RedirectMatch 403 \%B
 RedirectMatch 403 \%C
 RedirectMatch 403 \%D
 RedirectMatch 403 \%E
 RedirectMatch 403 \%F
 RedirectMatch 403 \%22
 RedirectMatch 403 \%27
 RedirectMatch 403 \%28
 RedirectMatch 403 \%29
 RedirectMatch 403 \%3C
 RedirectMatch 403 \%3E
 RedirectMatch 403 \%3F
 RedirectMatch 403 \%5B
 RedirectMatch 403 \%5C
 RedirectMatch 403 \%5D
 RedirectMatch 403 \%7B
 RedirectMatch 403 \%7C
 RedirectMatch 403 \%7D
 # COMMON PATTERNS
 Redirectmatch 403 \_vpi
 RedirectMatch 403 \.inc
 Redirectmatch 403 xAou6
 Redirectmatch 403 db\_name
 Redirectmatch 403 select\(
 Redirectmatch 403 convert\(
 Redirectmatch 403 \/query\/
 RedirectMatch 403 ImpEvData
 Redirectmatch 403 \.XMLHTTP
 Redirectmatch 403 proxydeny
 RedirectMatch 403 function\.
 Redirectmatch 403 remoteFile
 Redirectmatch 403 servername
 Redirectmatch 403 \&rptmode\=
 Redirectmatch 403 sys\_cpanel
 RedirectMatch 403 db\_connect
 RedirectMatch 403 doeditconfig
 RedirectMatch 403 check\_proxy
 Redirectmatch 403 system\_user
 Redirectmatch 403 \/\(null\)\/
 Redirectmatch 403 clientrequest
 Redirectmatch 403 option\_value
 RedirectMatch 403 ref\.outcontrol
 # SPECIFIC EXPLOITS
 RedirectMatch 403 errors\.
 RedirectMatch 403 config\.
 RedirectMatch 403 include\.
 RedirectMatch 403 display\.
 RedirectMatch 403 register\.
 Redirectmatch 403 password\.
 RedirectMatch 403 maincore\.
 RedirectMatch 403 authorize\.
 Redirectmatch 403 macromates\.
 RedirectMatch 403 head\_auth\.
 RedirectMatch 403 submit\_links\.
 RedirectMatch 403 change\_action\.
 Redirectmatch 403 com\_facileforms\/
 RedirectMatch 403 admin\_db\_utilities\.
 RedirectMatch 403 admin\.webring\.docs\.
 Redirectmatch 403 Table\/Latest\/index\.
</IfModule>

That’s the juice right there. This 4G Blacklist is some powerful stuff, blocking and filtering a wide range of potential attacks and eliminating tons of malicious nonsense. Much care has been taken to beta test this firewall on multiple configurations running various types of software, however, due to my limited financial resources, it is impossible to test the 4G as comprehensively as I would have preferred. Even so, for the average site running typical software, everything should continue to work perfectly. With that in mind, please read through the remainder of the article before implementing the 4G Blacklist.

Installation and Usage

Before implementing the 4G Blacklist, ensure that you are equipped with the following system requirements:

Linux server running Apache
Enabled Apache module: mod_alias
Enabled Apache module: mod_rewrite
Ability to edit your site”s root htaccess file (or)
Ability to modify Apache’s server configuration file

With these requirements met, copy and paste the entire 4G Blacklist into either the root HTAccess file or the server configuration file ( httpd.conf ). After uploading, visit your site and check proper loading of as many different types of pages as possible. For example, if you are running a blogging platform (such as WordPress), test different page views (single, archive, category, home, etc.), log into and surf the admin pages (plugins, themes, options, posts, etc.), and also check peripheral elements such as individual images, available downloads, and alternate protocols (FTP, HTTPS, etc.).

While the 4G Blacklist is designed to target only the bad guys, the regular expressions used in the list may interfere with legitimate URL or file access. If the directives in the blacklist are blocking a specific URL, the browsing device will display a 403 Forbidden error; similarily, if the blacklist happens to block a file or resource required for some script to function properly, the script (JavaScript, PHP, etc.) may simply stop working. If you experience either of these scenarios after installing the blacklist, don’t panic! Simply check the blocked URL or file, locate the matching blacklist string, and disable the directive by placing a pound sign ( # ) at the beginning of the associated line. Once the correct line is commented out, the blocked URL should load normally. Also, if you do happen to experience any conflicts involving the 4G Blacklist, please leave a comment or contact me directly.

Set for Stun

As my readers know, I am serious about site security. Nothing gets my juices flowing like the thought of chopping up mindless cracker whores into small, square chunks and feeding their still-twitching flesh to a pack of starving mongrels. That’s good times, but unfortunately there are probably laws against stuff like that. So in the meantime, we take steps to secure our sites using the most effective tools at our disposal. There is no one single magic bullet that will keep the unscrupulous bastards from exploiting and damaging your site, but there are many cumulative steps that may be taken to form a solid security strategy. Within this cumulative context, the 4G Blacklist recognizes and immunizes against a broad array of common attack elements, thereby maximizing resources while providing solid defense against malicious attacks.

Many Thanks

A huge “Thank You” to the dedicated people who helped beta test the 4G Blacklist. Your excellent feedback played an instrumental role in the development of this version. Thank you!

Next Up

Next up in the March 2009 Blacklist Series: The Ultimate User-Agent Blacklist. Don’t miss it!

Updates

Since the release of the 4G Blacklist, several users have discovered issues with the following 4G directives.

Joomla

In the query-string section, Joomla users should delete the following patterns:

request
config
[
]

In the character-string section, Joomla users should comment-out or delete the following lines:

RedirectMatch 403 \,
RedirectMatch 403 \;
RedirectMatch 403 config\.
RedirectMatch 403 register\.

WordPress

In the query-string section of the 4G Blacklist, the following changes have been made:

"%3D" character-string has been changed to "%5C"

Likewise, in the character-string section, the following change has been made:

"wp\_" character-string has been removed

And in the request-method filtering section, the following change has been made:

"HEAD" method has been removed

Also, the following changes may be necessary according to which plugins you have installed:

Ozh' Admin Drop Down Menu - remove "drop" from the query-string rules
WordPress' Akismet - remove "config" from the query-string rules

OpenID

OpenID users should read the information in this comment.

SMF

SMF users should read the information in this comment.

vBulletin

vBulletin users should read the information in these comments.

About the Author

Jeff Starr = Designer. Developer. Producer. Writer. Editor. Etc.

233 responses to “The Perishable Press 4G Blacklist”

Steve 2010/05/11 5:27 pm

Igor,

Okay I understand now. Do appreciate it .

Steve 2010/05/11 6:41 pm

This is what I get.

Access Forbidden

Access denied. Please click on the back button to return to the former page.

Garrett W. 2010/05/11 7:53 pm

Igor: so .htaccess files are only obeyed at the level closest to the resource being accessed, and not all the way up the chain?

My understanding was that .htaccess parsing/obedience started at the lowest directory level (i.e. closest to the filesystem root) containing such a file, and if that didn’t block or redirect the traffic, parsing continued at the next-higher .htaccess file in the hierarchy.
So I am incorrect?

Igor 2010/05/11 9:30 pm

Garret, you are correct! I had erred on the side of caution and copied certain of the root’s .htaccess security elements to another .htaccess lurking within a subdirectory.

But I just did a little experiment and discovered that this was unnecessary! The root’s .htaccess parameters are indeed inherited when not specifically overridden by the local .htaccess.

Now one of my .htaccess is much shorter than before as a result of this new knowledge.

Thank you, Garrett.

Dave Stuttard 2010/05/12 1:14 am

Boys! Stop! 4G is a block of server (Apache) instructions to ADD to .htaccess, which can include other instructions such as redirects. 4G should just be in the .htaccess at the root where it will protect the whole website. Any .htaccess in a lower level directory only affects that directory and any directories below it (it does not cancel the main .htaccess – it adds to it for that directory only). Whenever any address on your website is requested (either from outside or via links on your pages), the instructions in .htaccess files are read by Apache before doing anything, so this introduces a delay and it is not necessary (or wise) to have multiple 4gs.

Igor 2010/05/18 4:35 pm

Hey guys, how do I ban an unknown IP address that shows up in the logs as:

.

My host does not always provide an IP address. I have searched on Google for an answer to this literally for hours, and nobody seems to know anything. The only advice I found was to contact my host company. I’d rather just ban unknown IP addresses in .htaccess and be done with it. Has anyone else encountered this issue?

Dave 2010/05/19 2:19 am

Igor, if you see a hostname in your logs, you can go here and get the IP:

http://www.hcidata.info/host2ip.cgi

Or if its a User Agent you want to ban use that if it’s shown, eg:

RewriteCond %{HTTP_USER_AGENT} ia_archiver [NC,OR]
RewriteCond %{HTTP_USER_AGENT} nutch [NC]
RewriteRule ^.* - [F,L]

(in this example, any UA with ‘nutch’ anywhere in it will be banned, because there is no ‘^’ before ‘nutch’). Don’t forget the ‘NC’.

Igor 2010/05/22 3:03 pm

Thanks Dave. I do indeed have many user agent bans. Jeff suggested plenty of good ones, and I used his list as the beginning of my own. Here’s what I was referring to in my comment, taken from my server log:

. - - [30/Apr/2010:08:28:36 -0600] "GET /messageboard/register.shtml HTTP/1.0" 200 6321 "http://www.google.com/" "Mozilla/4.7 (compatible; OffByOne; Windows 2000) Webster Pro V3.4"

As you can see, no IP address is specified. Where you see a “.” at the beginning of the line should be either an IP address or else a hostname. Here it is again:

. - - [30/Apr/2010:23:14:33 -0600] "GET /messageboard/login.shtml HTTP/1.0" 200 39966 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322; Alexa Toolbar; (R1 1.5))"

Another case is shown above. This has a different User-Agent; thus it would not appear possible to ban by User-Agent alone. My question is, how is it possible the IP address does not appear in the log (only a “.”)? It appears to me based upon the activity that this is a spambot attempting to post spam.

I have googled about this issue before, but have not found a method of banning bots that do not reveal their origin IP address.

Dave 2010/05/23 4:27 am

Igor, it’s not for me to rule, but I suggest that your requests for help concerning banning of specific IPs and UAs should be directed to something like http://www.webmasterworld.com/. Jeff’s 4G Blacklist does include a few of these, but only to supplement the main function of the blacklist, which is to catch morons in a more unique and general way – however, here’s a few more comments from me for what they’re worth that may help you:

I don’t know why an IP (or Hostname, which is just as good) is not shown sometimes…
I have cPanel and I look at ‘Latest Visitors’ each day (can’t be arsed to plough through server logs) to check for suspicious behaviour. Each entry shows me:

‘Host’, which may be either an IP or a Hostname or a mixture of the two or nothing.

The file requested with its code (200, 301, 403, 404, 410, etc).

‘Referrer’, which will be blank if it’s from a UA or the file on your website if it’s an internal link.

and ‘Agent’, which can be either User Agent details or a surfer’s browser details.

Then its all detective work and common sense deciding if any should be banned:

If an IP or Hostname is shown, check it out in http://www.hcidata.info/host2ip.cgi and do a reverse lookup there (that should give you a clue). And if an IP is shown, put it in your address bar to see if its a website (another clue). If neither are shown, move on…

If its a file that should normally only be requested by a link on your own website, any referrer should be a file on your website (duh!); legitimate direct requests (‘Referrer’ blank) can be made by legitimate Search Engines (UAs) that you have not banned (Google, Slurp, MSN, etc). Of course an exception is where someone has used a ‘Favourites’ link in their browser.

Agent? This should tell you if its Google, Slurp, MSN, etc, or an unknown UA, otherwise it will be a string representing someone’s browser. If a URL is included, click on it to see who it is. If you can make out tne UA name, check it out in http://browsers.garykeith.com/tools/agent-checker.asp. This seems to be a good checker, based on a lot of research – it tells you if its banned or if its just a browser.

There may be other clues from behaviour on your website and if someone is fishing using rubbish URLs or just grabbing images, css files and js files. It’s a big subject, which Jeff has studied to make 4G and there will always be the clever twerps one step ahead of us (hiding their IP/Hostname!).

Dave 2010/05/23 5:24 am

Further to my last post, I’ve looked at Webster Pro – this UA needs to be banned as a spam harvester (I suggest ‘Webster’ is enough – they use Webster.Pro and Webster Pro). Alexa Toolbar; (R1 1.5) is from an IE6 browser and apparently not necessary to ban, but it’s interestingly the one looking at ‘login.shtml’! maybe they used Webster to harvest an email address, then tested it via a browser. I use captchas in contact forms, email validators in those and registration forms and I obfuscate email links, to minimise risk of being spambotted. Hope that helps.

Jeff Starr 2010/05/23 11:46 am • Post Author

@Igor: Have you tried blocking the “dot” (“ . ”) IP address? Something like this may work:

RewriteCond %{REMOTE_ADDR} ^\.$ [NC]
RewriteRule .* - [F,L]

If that doesn’t work, you may try just blocking any IP address with any single character:

RewriteCond %{REMOTE_ADDR} ^.$ [NC]
RewriteRule .* - [F,L]

I would also consider adding a line for empty IP address requests:

RewriteCond %{REMOTE_ADDR} ^.$ [NC]
RewriteCond %{REMOTE_ADDR} ^$ [NC]
RewriteRule .* - [F,L]

Note that I haven’t tested these – but they seem logically sound.

Let me know if they work/don’t work, or if I am missing something obvious..

Edited to replace the [OR] flags with [NC] flags

Igor 2010/05/23 2:43 pm

Thanks Jeff! I tried your third variation:

# FILTER REMOTES WITH NO IP ADDRESS
RewriteCond %{REMOTE_ADDR} ^.$ [NC]
RewriteCond %{REMOTE_ADDR} ^$ [NC]
RewriteRule .* - [F,L]

It did not cause any unexpected surprises. ;-) I’ll dig into my logs again next week and see whether Mr. Doesn’t-wanna-reveal-his-IP-address gets the 403 he deserves.

Dave: thanks for the info. I did check webmasterworld, and in fact they discussed this very issue a couple of times but didn’t offer any .htaccess solution. Thank goodness for Perishable Press! As for User-Agents, often they are forged, so a bot using a certain user-agent does not mean the user-agent is bad. I had googled Webster before and found a mixed verdict (once I got past all the references to the *Dictionary*), but will heed your advice.

I watch AWSTATS but like digging through the logs to get the real story. Never know what you may find. Much of the traffic AWSTATS claims is human is really bot. I was amazed by how many bots are hitting us all the time disguised as humans. Most of them go straight for the forums, trying to register and login with one second elapsing between attempts, so they are easy to spot. I have had some success dosing the dumber spambots with spam poison.

Comments are closed for this post. Something to add? Let me know.

Steve 2010/05/11 5:27 pm

Igor,

Okay I understand now. Do appreciate it .
Steve 2010/05/11 6:41 pm

This is what I get.

Access Forbidden

Access denied. Please click on the back button to return to the former page.
Garrett W. 2010/05/11 7:53 pm

Igor: so .htaccess files are only obeyed at the level closest to the resource being accessed, and not all the way up the chain?

My understanding was that .htaccess parsing/obedience started at the lowest directory level (i.e. closest to the filesystem root) containing such a file, and if that didn’t block or redirect the traffic, parsing continued at the next-higher .htaccess file in the hierarchy.
So I am incorrect?
Igor 2010/05/11 9:30 pm

Garret, you are correct! I had erred on the side of caution and copied certain of the root’s .htaccess security elements to another .htaccess lurking within a subdirectory.

But I just did a little experiment and discovered that this was unnecessary! The root’s .htaccess parameters are indeed inherited when not specifically overridden by the local .htaccess.

Now one of my .htaccess is much shorter than before as a result of this new knowledge.

Thank you, Garrett.
Dave Stuttard 2010/05/12 1:14 am

Boys! Stop! 4G is a block of server (Apache) instructions to ADD to .htaccess, which can include other instructions such as redirects. 4G should just be in the .htaccess at the root where it will protect the whole website. Any .htaccess in a lower level directory only affects that directory and any directories below it (it does not cancel the main .htaccess – it adds to it for that directory only). Whenever any address on your website is requested (either from outside or via links on your pages), the instructions in .htaccess files are read by Apache before doing anything, so this introduces a delay and it is not necessary (or wise) to have multiple 4gs.
Igor 2010/05/18 4:35 pm

Hey guys, how do I ban an unknown IP address that shows up in the logs as:

.

My host does not always provide an IP address. I have searched on Google for an answer to this literally for hours, and nobody seems to know anything. The only advice I found was to contact my host company. I’d rather just ban unknown IP addresses in .htaccess and be done with it. Has anyone else encountered this issue?
Dave 2010/05/19 2:19 am

Igor, if you see a hostname in your logs, you can go here and get the IP:

http://www.hcidata.info/host2ip.cgi

Or if its a User Agent you want to ban use that if it’s shown, eg:

RewriteCond %{HTTP_USER_AGENT} ia_archiver [NC,OR]
RewriteCond %{HTTP_USER_AGENT} nutch [NC]
RewriteRule ^.* - [F,L]

(in this example, any UA with ‘nutch’ anywhere in it will be banned, because there is no ‘^’ before ‘nutch’). Don’t forget the ‘NC’.
Igor 2010/05/22 3:03 pm

Thanks Dave. I do indeed have many user agent bans. Jeff suggested plenty of good ones, and I used his list as the beginning of my own. Here’s what I was referring to in my comment, taken from my server log:

. - - [30/Apr/2010:08:28:36 -0600] "GET /messageboard/register.shtml HTTP/1.0" 200 6321 "http://www.google.com/" "Mozilla/4.7 (compatible; OffByOne; Windows 2000) Webster Pro V3.4"

As you can see, no IP address is specified. Where you see a “.” at the beginning of the line should be either an IP address or else a hostname. Here it is again:

. - - [30/Apr/2010:23:14:33 -0600] "GET /messageboard/login.shtml HTTP/1.0" 200 39966 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322; Alexa Toolbar; (R1 1.5))"

Another case is shown above. This has a different User-Agent; thus it would not appear possible to ban by User-Agent alone. My question is, how is it possible the IP address does not appear in the log (only a “.”)? It appears to me based upon the activity that this is a spambot attempting to post spam.

I have googled about this issue before, but have not found a method of banning bots that do not reveal their origin IP address.
Dave 2010/05/23 4:27 am

Igor, it’s not for me to rule, but I suggest that your requests for help concerning banning of specific IPs and UAs should be directed to something like http://www.webmasterworld.com/. Jeff’s 4G Blacklist does include a few of these, but only to supplement the main function of the blacklist, which is to catch morons in a more unique and general way – however, here’s a few more comments from me for what they’re worth that may help you:

I don’t know why an IP (or Hostname, which is just as good) is not shown sometimes…
I have cPanel and I look at ‘Latest Visitors’ each day (can’t be arsed to plough through server logs) to check for suspicious behaviour. Each entry shows me:

‘Host’, which may be either an IP or a Hostname or a mixture of the two or nothing.

The file requested with its code (200, 301, 403, 404, 410, etc).

‘Referrer’, which will be blank if it’s from a UA or the file on your website if it’s an internal link.

and ‘Agent’, which can be either User Agent details or a surfer’s browser details.

Then its all detective work and common sense deciding if any should be banned:

If an IP or Hostname is shown, check it out in http://www.hcidata.info/host2ip.cgi and do a reverse lookup there (that should give you a clue). And if an IP is shown, put it in your address bar to see if its a website (another clue). If neither are shown, move on…

If its a file that should normally only be requested by a link on your own website, any referrer should be a file on your website (duh!); legitimate direct requests (‘Referrer’ blank) can be made by legitimate Search Engines (UAs) that you have not banned (Google, Slurp, MSN, etc). Of course an exception is where someone has used a ‘Favourites’ link in their browser.

Agent? This should tell you if its Google, Slurp, MSN, etc, or an unknown UA, otherwise it will be a string representing someone’s browser. If a URL is included, click on it to see who it is. If you can make out tne UA name, check it out in http://browsers.garykeith.com/tools/agent-checker.asp. This seems to be a good checker, based on a lot of research – it tells you if its banned or if its just a browser.

There may be other clues from behaviour on your website and if someone is fishing using rubbish URLs or just grabbing images, css files and js files. It’s a big subject, which Jeff has studied to make 4G and there will always be the clever twerps one step ahead of us (hiding their IP/Hostname!).
Dave 2010/05/23 5:24 am

Further to my last post, I’ve looked at Webster Pro – this UA needs to be banned as a spam harvester (I suggest ‘Webster’ is enough – they use Webster.Pro and Webster Pro). Alexa Toolbar; (R1 1.5) is from an IE6 browser and apparently not necessary to ban, but it’s interestingly the one looking at ‘login.shtml’! maybe they used Webster to harvest an email address, then tested it via a browser. I use captchas in contact forms, email validators in those and registration forms and I obfuscate email links, to minimise risk of being spambotted. Hope that helps.
Jeff Starr 2010/05/23 11:46 am • Post Author

@Igor: Have you tried blocking the “dot” (“ . ”) IP address? Something like this may work:

RewriteCond %{REMOTE_ADDR} ^\.$ [NC]
RewriteRule .* - [F,L]

If that doesn’t work, you may try just blocking any IP address with any single character:

RewriteCond %{REMOTE_ADDR} ^.$ [NC]
RewriteRule .* - [F,L]

I would also consider adding a line for empty IP address requests:

RewriteCond %{REMOTE_ADDR} ^.$ [NC]
RewriteCond %{REMOTE_ADDR} ^$ [NC]
RewriteRule .* - [F,L]

Note that I haven’t tested these – but they seem logically sound.

Let me know if they work/don’t work, or if I am missing something obvious..

Edited to replace the [OR] flags with [NC] flags
Igor 2010/05/23 2:43 pm

Thanks Jeff! I tried your third variation:

# FILTER REMOTES WITH NO IP ADDRESS
RewriteCond %{REMOTE_ADDR} ^.$ [NC]
RewriteCond %{REMOTE_ADDR} ^$ [NC]
RewriteRule .* - [F,L]

It did not cause any unexpected surprises. ;-) I’ll dig into my logs again next week and see whether Mr. Doesn’t-wanna-reveal-his-IP-address gets the 403 he deserves.

Dave: thanks for the info. I did check webmasterworld, and in fact they discussed this very issue a couple of times but didn’t offer any .htaccess solution. Thank goodness for Perishable Press! As for User-Agents, often they are forged, so a bot using a certain user-agent does not mean the user-agent is bad. I had googled Webster before and found a mixed verdict (once I got past all the references to the *Dictionary*), but will heed your advice.

I watch AWSTATS but like digging through the logs to get the real story. Never know what you may find. Much of the traffic AWSTATS claims is human is really bot. I was amazed by how many bots are hitting us all the time disguised as humans. Most of them go straight for the forums, trying to register and login with one second elapsing between attempts, so they are easy to spot. I have had some success dosing the dumber spambots with spam poison.

« Previous Comments • 1…1415161718…20 • Newer Comments »