Celebrating 20 years online :)
Web Dev + WordPress + Security

HTAccess Spring Cleaning 2009

Just like last year, this Spring I have been taking some time to do some general maintenance here at Perishable Press. This includes everything from fixing broken links and resolving errors to optimizing scripts and eliminating unnecessary plugins. I’ll admit, this type of work is often quite dull, however I always enjoy the process of cleaning up my HTAccess files. In this post, I share some of the changes made to my HTAccess files and explain the reasoning behind each modification. Some of the changes may surprise you! ;)

Optimizing a few rewrite rules

Here are some meditations for optimizing useful rewrite rules.

Improving robots.txt redirection

These changes were made in the HTAccess file for my WordPress subdirectory “/press/”. First, I removed the following robots.txt rewrite rules:

<IfModule mod_rewrite.c>
	RewriteCond %{REQUEST_URI} !^/robots\.txt [NC]
	RewriteCond %{REQUEST_URI} robots\.txt [NC]
	RewriteRule (.*) https://perishablepress.com/robots.txt [R=301,L] 

This code is now replaced with the following, more elegant directive:

RedirectMatch 301 robots\.txt https://perishablepress.com/robots.txt


Thanks to a modification by Webrocker, this directive now works when placed in the HTAccess file of the root directory:

RedirectMatch 301 ^/(.*)/robots\.txt https://perishablepress.com/robots.txt

This rule is perfect to redirecting the hundreds of daily requests for misplaced robots.txt files such as the following:


..ad nauseam. This sort of redundant scanning for nonexistent files consumes valuable resources and wastes bandwidth. Nice to know that a single line of HTAccess eliminates the confusion once and for all.

Improving favicon.ico redirection

Similar to the previous robots.txt directives, this chunk of code was also removed from my /press/ subdirectory:

<IfModule mod_rewrite.c>
	RewriteCond %{REQUEST_URI} !^/favicon\.ico [NC]
	RewriteCond %{REQUEST_URI} favicon\.ico [NC]
	RewriteRule (.*) https://perishablepress.com/favicon.ico [R=301,L] 

While that method is certainly effective at redirecting those ridiculous favicon requests, I have since then developed a far more efficient technique:

RedirectMatch 301 favicon\.ico https://perishablepress.com/favicon.ico
RedirectMatch 301 favicon\.gif https://perishablepress.com/favicon.ico


Thanks to a modification by Webrocker and a bit of consolidation by Louis, these directives may be merged into a single rule that works even when placed in the HTAccess file of the root directory:

RedirectMatch 301 ^/(.*)/favicon\.(ico|gif) https://perishablepress.com/favicon.ico

Here, I am using two directives a single directive to handle similarly annoying requests for misplaced favicon.ico and favicon.gif files. It’s just more pathetic exploit scanning by clueless script idiots, but this method works perfectly for stopping the desperation.

Dropping the hotlink protection

This one may surprise the die-hard anti-hotlinkers out there, but I think it’s for the best. For years, I had been using the following technique for hotlink protection (in both /press/ subdirectory and server root directory):

<IfModule mod_rewrite.c>
	RewriteCond %{HTTP_REFERER}     !^$
	RewriteCond %{REQUEST_FILENAME} -f
	RewriteCond %{REQUEST_FILENAME} \.(gif|jpg|jpeg|png|bmp|tiff?|js|css|zip|mp3|wmv|mpe?g|swf)$ [NC]
	RewriteCond %{HTTP_REFERER}     !^https?://([^.]+\.)?perishablepress\.                       [NC]
	RewriteCond %{HTTP_REFERER}     !^https?://([^.]+\.)?moseslakeforum\.                        [NC]
	RewriteCond %{HTTP_REFERER}     !^https?://([^.]+\.)?deadletterart\.                         [NC]
	RewriteCond %{HTTP_REFERER}     !^https?://([^.]+\.)?augustklotz\.                           [NC]
	RewriteCond %{HTTP_REFERER}     !^https?://([^.]+\.)?perishable\.                            [NC]
	RewriteCond %{HTTP_REFERER}     !^https?://([^.]+\.)?monzilla\.                              [NC]
	RewriteCond %{HTTP_REFERER}     !^https?://([^.]+\.)?mindfeed\.                              [NC]
	RewriteCond %{HTTP_REFERER}     !^https?://([^.]+\.)?feedburner\.                            [NC]
	RewriteCond %{HTTP_REFERER}     !^https?://([^.]+\.)?planetozh\.                             [NC]
	RewriteCond %{HTTP_REFERER}     !^https?://([^.]+\.)?netvibes\.                              [NC]
	RewriteCond %{HTTP_REFERER}     !^https?://([^.]+\.)?google\.                                [NC]
	RewriteRule .*\.(gif|jpg|jpeg|png|bmp|tiff?|js|css|zip|mp3|wmv|mpe?g|swf)$ https://perishablepress.com/hotlink.jpe [R,NC,L]
	#RewriteRule .*\.(gif|jpg|jpeg|png|bmp|tiff?|js|css|zip|mp3|wmv|mpe?g|swf)$ - [F,NC,L]

And it worked great — never had a problem with anyone hotlinking my images. The funny thing is that, given the types of peripheral imagery and informational diagrams that I use here at Perishable Press, I probably wouldn’t have had any hotlinking problems in the first place. Sure, if I were posting killer pix of hot babes and fast cars, then the anti-hotlinking rules would be mandatory. But I’m not, and the esoteric little deco graphics and design diagrams just aren’t worth the extra processing requirements of the aforementioned set of anti-hotlinking directives. Besides, I keep a close eye on my access and error logs, so if someone wants to wipe strong, I am well-equipped to get tough on messes. ;)

Centralizing the 4G Blacklist

Also removed from my /press/ subdirectory is the 4G Blacklist. Over the course of the blacklist development process, my various domains had accumulated a disparate collection of blacklist directives. So, during this round of HTAccess spring cleaning, I removed the differently versioned blacklists from many different domains and subdirectories and consolidated everything into a single, omnipotent blacklist in the root directory of my server. Now, the directives are applied across all of my sites from a single, easy-to-update location.

One part of the removed blacklist that wasn’t released with the latest version of the 4G Blacklist is the extended collection of blocked IP addresses:

Important! Newer versions of Apache stopped supporting “same-line” comments, as used in the following code snippet. For example, a line such as "# 57 spam attempts" will result in a 500 server error on newer versions of Apache. Instead, put the comments on their own line, beginning with a pound sign #, and without the wrapping quotes.
	Order Allow,Deny
	Allow from all
	Deny from   "# blacklist candidate 2008-01-02"
	Deny from  "# blacklist candidate 2008-02-10"
	Deny from   "# blacklist candidate 2008-03-09"
	Deny from   "# blacklist candidate 2008-04-27"
	Deny from "# blacklist candidate 2008-05-31"
	Deny from   "# blacklist candidate 2008-10-19"
	Deny from   "# 163 hits in 44 minutes"
	Deny from  "# 101 hits in 120 minutes"
	Deny from    "# 93 hits in 15 minutes"
	Deny from    "# quintessential images"
	Deny from    "# 1629 attacks in 90 minutes"
	Deny from  "# 1048 hits in 60 minutes"
	Deny from   "# 166 hits in 45 minutes"
	Deny from   "# 75 hits in 30 minutes"
	Deny from    "# relentless spammer"
	Deny from     "# 31 charcode hits"
	Deny from  "# 124 bg image hits"
	Deny from    "# 57 spam attempts"
	Deny from   "# relentless spammer"
	Deny from    "# relentless spammer"
	Deny from  "# relentless spammer"

I didn’t re-include these directives in the centralized root blacklist because every year or so I like to reboot my banned IP list and start fresh. Here is a similar IP-list dump from 2007.

Another part of the 4G Blacklist that was removed permanently was the “slimmed-down” version of the Ultimate User-Agent Blacklist:

SetEnvIfNoCase User-Agent ^$                keep_out
SetEnvIfNoCase User-Agent "Y\!OASIS\/TEST"  keep_out
SetEnvIfNoCase User-Agent "libwww\-perl"    keep_out
SetEnvIfNoCase User-Agent "Jakarta.Commons" keep_out
SetEnvIfNoCase User-Agent "MJ12bot"         keep_out
SetEnvIfNoCase User-Agent "Nutch"           keep_out
SetEnvIfNoCase User-Agent "cr4nk"           keep_out
SetEnvIfNoCase User-Agent "MOT\-MPx220"     keep_out
SetEnvIfNoCase User-Agent "SiteCrawler"     keep_out
SetEnvIfNoCase User-Agent "SiteSucker"      keep_out
SetEnvIfNoCase User-Agent "Doubanbot"       keep_out
SetEnvIfNoCase User-Agent "Sogou"           keep_out
<Limit GET POST>
	Order Allow,Deny
	Allow from all
	Deny from env=keep_out

This was my 2008-2009 personal user-agent blacklist that included only the worst of the worst offenders as manifested in my error and access logs. The list is highly effective, but has been refined even further to include only the most heinous agents:

SetEnvIfNoCase User-Agent "libwww" keep_out
SetEnvIfNoCase User-Agent "DotBot" keep_out
SetEnvIfNoCase User-Agent "Nutch"  keep_out
SetEnvIfNoCase User-Agent "cr4nk"  keep_out
	Order Allow,Deny
	Allow from all
	Deny from env=keep_out

If you only block four user agents this year, libwww, DotBot, Nutch, and cr4nk will certainly maximize your return on investment.

Ready for Summer!

Over the course of the previous year, I have had the privilege of learning a great deal about Apache’s amazingly useful HTAccess directives. The most important thing that I am realizing when it comes to optimizing your HTAccess strategy is that the old saying, “less is more,” is absolutely true. So many HTAccess files are completely overloaded with extraneous rules and pointless directives. Hopefully articles such as this will help you make wise decisions concerning your own HTAccess strategy.

About the Author
Jeff Starr = Web Developer. Security Specialist. WordPress Buff.
BBQ Pro: The fastest firewall to protect your WordPress.

34 responses to “HTAccess Spring Cleaning 2009”

  1. Jeff, thanks for the spring cleaning tip, I always enjoy visiting your site, learning new things, helps me to improve my own site.

  2. Thanks for some great tips Jeff. One thing, I’ve just tried installing this on my site:

    RedirectMatch 301 robots.txt http://sltaylor.co.uk/robots.txt

    If I try to access http://sltaylor.co.uk/about/robots.txt, it gives a Redirect Loop error (finishing with http://sltaylor.co.uk/robots.txt in the address bar). I’m no .htaccess expert, but does the new code need something something to match on URLs with something between the domain and “robots.txt”?

  3. Eeek! That’s meant to be a version of your new robots.txt redirect for my own site… Not sure what happened there.

  4. @Ken: My pleasure — thanks for the feedback =)

    @Steve: I added the “301” in your first comment to resolve the character-display issue. That may also be the reason for the faulty redirect, possibly related to different versions of Apache and their differing requirements for RedirectMatch. I am also going to update the article with this information.

  5. Austin Dworaczyk Wiltshire 2009/05/11 11:42 pm

    Good thing I checked these comments!

    First things first. Thanks alot! Seriously, I love this blog, and just added a couple of your tricks to my own .htaccess.

    However, I am also receiving the same problem as Steve Taylor. I’ll check back to see if it’s resolved in a few days.

  6. Perfect, thanks Jeff. FYI I’m on Apache/2.0.54.

  7. Austin Dworaczyk Wiltshire 2009/05/12 8:16 am

    Ah, yes forgot to mention. I’m on Apache 2.2.11.

  8. Jeff Starr 2009/05/12 9:07 am

    I need a server running a newer version of Apache. I have been using 1.3.41 for eons and do not have access to anything more recent. If anyone has any ideas that 1) won’t break the bank, and 2) won’t consume my limited free time, please let me know. I would love to test these directives and see what’s up, but until I can get ahold of something more recent, it will have to wait.

    In the meantime, I assure anyone running older versions of Apache that the directives work perfectly.

  9. Austin Dworaczyk Wiltshire 2009/05/12 10:44 am

    The thing is, according to the docs, it should be working fine.


  10. Austin Dworaczyk Wiltshire 2009/05/12 11:21 am

    Oh, and is it normal for me to receive A LOT of 301 status codes in my logs after instating the favicon fix that you posted?

    Or am I just under attack?

  11. Yes, it should work fine, but apparently something is amiss. Not sure about the 301 codes. Shoot me an email with a good sampling of the log entries and I will take a look (jeff at this domain).

  12. Hi there Jeff,

    I’m no regex guru, but couldn’t the two following lines be combined in one?

    RedirectMatch 301 favicon.ico https://perishablepress.com/favicon.ico
    RedirectMatch 301 favicon.gif https://perishablepress.com/favicon.ico

    Some like RedirectMatch 301 favicon.(ico|gif) https://perishablepress.com/favicon.ico, I don’t know.

    Also, I hope you have a great summer. It’s starting to be really hot here in the South of France! Cheers :)

Comments are closed for this post. Something to add? Let me know.
Perishable Press is operated by Jeff Starr, a professional web developer and book author with two decades of experience. Here you will find posts about web development, WordPress, security, and more »
Blackhole Pro: Trap bad bots in a virtual black hole.
Crazy that we’re almost halfway thru 2024.
I live right next door to the absolute loudest car in town. And the owner loves to drive it.
8G Firewall now out of beta testing, ready for use on production sites.
It's all about that ad revenue baby.
Note to self: encrypting 500 GB of data on my iMac takes around 8 hours.
Getting back into things after a bit of a break. Currently 7° F outside. Chillz.
Get news, updates, deals & tips via email.
Email kept private. Easy unsubscribe anytime.