HTAccess Spring Cleaning 2009
Just like last year, this Spring I have been taking some time to do some general maintenance here at Perishable Press. This includes everything from fixing broken links and resolving errors to optimizing scripts and eliminating unnecessary plugins. I’ll admit, this type of work is often quite dull, however I always enjoy the process of cleaning up my HTAccess files. In this post, I share some of the changes made to my HTAccess files and explain the reasoning behind each modification. Some of the changes may surprise you! ;)
Optimizing a few rewrite rules
Here are some meditations for optimizing useful rewrite rules.
Improving robots.txt redirection
These changes were made in the HTAccess file for my WordPress subdirectory “/press/
”. First, I removed the following robots.txt rewrite rules:
# REDIRECT ROBOTS.TXT
<IfModule mod_rewrite.c>
RewriteCond %{REQUEST_URI} !^/robots\.txt [NC]
RewriteCond %{REQUEST_URI} robots\.txt [NC]
RewriteRule (.*) https://perishablepress.com/robots.txt [R=301,L]
</IfModule>
This code is now replaced with the following, more elegant directive:
RedirectMatch 301 robots\.txt https://perishablepress.com/robots.txt
Update
Thanks to a modification by Webrocker, this directive now works when placed in the HTAccess file of the root directory:
RedirectMatch 301 ^/(.*)/robots\.txt https://perishablepress.com/robots.txt
This rule is perfect to redirecting the hundreds of daily requests for misplaced robots.txt
files such as the following:
https://perishablepress.com/press/robots.txt
https://perishablepress.com/press/about/robots.txt
https://perishablepress.com/press/2009/05/09/robots.txt
https://perishablepress.com/press/tag/blacklist/robots.txt
..ad nauseam. This sort of redundant scanning for nonexistent files consumes valuable resources and wastes bandwidth. Nice to know that a single line of HTAccess eliminates the confusion once and for all.
Improving favicon.ico redirection
Similar to the previous robots.txt
directives, this chunk of code was also removed from my /press/
subdirectory:
# REDIRECT FAVICON.ICO
<IfModule mod_rewrite.c>
RewriteCond %{REQUEST_URI} !^/favicon\.ico [NC]
RewriteCond %{REQUEST_URI} favicon\.ico [NC]
RewriteRule (.*) https://perishablepress.com/favicon.ico [R=301,L]
</IfModule>
While that method is certainly effective at redirecting those ridiculous favicon requests, I have since then developed a far more efficient technique:
# REDIRECT FAVICON.ICO & FAVICON.GIF
RedirectMatch 301 favicon\.ico https://perishablepress.com/favicon.ico
RedirectMatch 301 favicon\.gif https://perishablepress.com/favicon.ico
Update
Thanks to a modification by Webrocker and a bit of consolidation by Louis, these directives may be merged into a single rule that works even when placed in the HTAccess file of the root directory:
# REDIRECT FAVICON.ICO & FAVICON.GIF
RedirectMatch 301 ^/(.*)/favicon\.(ico|gif) https://perishablepress.com/favicon.ico
Here, I am using two directives a single directive to handle similarly annoying requests for misplaced favicon.ico
and favicon.gif
files. It’s just more pathetic exploit scanning by clueless script idiots, but this method works perfectly for stopping the desperation.
Dropping the hotlink protection
This one may surprise the die-hard anti-hotlinkers out there, but I think it’s for the best. For years, I had been using the following technique for hotlink protection (in both /press/
subdirectory and server root directory):
# HOTLINK PROTECTION
<IfModule mod_rewrite.c>
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{REQUEST_FILENAME} -f
RewriteCond %{REQUEST_FILENAME} \.(gif|jpg|jpeg|png|bmp|tiff?|js|css|zip|mp3|wmv|mpe?g|swf)$ [NC]
RewriteCond %{HTTP_REFERER} !^https?://([^.]+\.)?perishablepress\. [NC]
RewriteCond %{HTTP_REFERER} !^https?://([^.]+\.)?moseslakeforum\. [NC]
RewriteCond %{HTTP_REFERER} !^https?://([^.]+\.)?deadletterart\. [NC]
RewriteCond %{HTTP_REFERER} !^https?://([^.]+\.)?augustklotz\. [NC]
RewriteCond %{HTTP_REFERER} !^https?://([^.]+\.)?perishable\. [NC]
RewriteCond %{HTTP_REFERER} !^https?://([^.]+\.)?monzilla\. [NC]
RewriteCond %{HTTP_REFERER} !^https?://([^.]+\.)?mindfeed\. [NC]
RewriteCond %{HTTP_REFERER} !^https?://([^.]+\.)?feedburner\. [NC]
RewriteCond %{HTTP_REFERER} !^https?://([^.]+\.)?planetozh\. [NC]
RewriteCond %{HTTP_REFERER} !^https?://([^.]+\.)?netvibes\. [NC]
RewriteCond %{HTTP_REFERER} !^https?://([^.]+\.)?google\. [NC]
RewriteRule .*\.(gif|jpg|jpeg|png|bmp|tiff?|js|css|zip|mp3|wmv|mpe?g|swf)$ https://perishablepress.com/hotlink.jpe [R,NC,L]
#RewriteRule .*\.(gif|jpg|jpeg|png|bmp|tiff?|js|css|zip|mp3|wmv|mpe?g|swf)$ - [F,NC,L]
</IfModule>
And it worked great — never had a problem with anyone hotlinking my images. The funny thing is that, given the types of peripheral imagery and informational diagrams that I use here at Perishable Press, I probably wouldn’t have had any hotlinking problems in the first place. Sure, if I were posting killer pix of hot babes and fast cars, then the anti-hotlinking rules would be mandatory. But I’m not, and the esoteric little deco graphics and design diagrams just aren’t worth the extra processing requirements of the aforementioned set of anti-hotlinking directives. Besides, I keep a close eye on my access and error logs, so if someone wants to wipe strong, I am well-equipped to get tough on messes. ;)
Centralizing the 4G Blacklist
Also removed from my /press/
subdirectory is the 4G Blacklist. Over the course of the blacklist development process, my various domains had accumulated a disparate collection of blacklist directives. So, during this round of HTAccess spring cleaning, I removed the differently versioned blacklists from many different domains and subdirectories and consolidated everything into a single, omnipotent blacklist in the root directory of my server. Now, the directives are applied across all of my sites from a single, easy-to-update location.
One part of the removed blacklist that wasn’t released with the latest version of the 4G Blacklist is the extended collection of blocked IP addresses:
"# 57 spam attempts"
will result in a 500 server error on newer versions of Apache. Instead, put the comments on their own line, beginning with a pound sign #
, and without the wrapping quotes.# BLACKLIST CANDIDATES
<Limit GET POST PUT>
Order Allow,Deny
Allow from all
Deny from 75.126.85.215 "# blacklist candidate 2008-01-02"
Deny from 128.111.48.138 "# blacklist candidate 2008-02-10"
Deny from 87.248.163.54 "# blacklist candidate 2008-03-09"
Deny from 84.122.143.99 "# blacklist candidate 2008-04-27"
Deny from 210.210.119.145 "# blacklist candidate 2008-05-31"
Deny from 66.74.199.125 "# blacklist candidate 2008-10-19"
Deny from 68.226.72.159 "# 163 hits in 44 minutes"
Deny from 86.121.210.195 "# 101 hits in 120 minutes"
Deny from 80.57.69.139 "# 93 hits in 15 minutes"
Deny from 217.6.22.218 "# quintessential images"
Deny from 24.19.202.10 "# 1629 attacks in 90 minutes"
Deny from 203.55.231.100 "# 1048 hits in 60 minutes"
Deny from 77.229.156.72 "# 166 hits in 45 minutes"
Deny from 89.122.29.127 "# 75 hits in 30 minutes"
Deny from 80.206.129.3 "# relentless spammer"
Deny from 64.15.69.17 "# 31 charcode hits"
Deny from 77.103.132.126 "# 124 bg image hits"
Deny from 80.13.62.213 "# 57 spam attempts"
Deny from 91.148.84.119 "# relentless spammer"
Deny from 88.170.42.61 "# relentless spammer"
Deny from 220.181.61.231 "# relentless spammer"
</Limit>
I didn’t re-include these directives in the centralized root blacklist because every year or so I like to reboot my banned IP list and start fresh. Here is a similar IP-list dump from 2007.
Another part of the 4G Blacklist that was removed permanently was the “slimmed-down” version of the Ultimate User-Agent Blacklist:
# BLACKLISTED USER AGENTS
SetEnvIfNoCase User-Agent ^$ keep_out
SetEnvIfNoCase User-Agent "Y\!OASIS\/TEST" keep_out
SetEnvIfNoCase User-Agent "libwww\-perl" keep_out
SetEnvIfNoCase User-Agent "Jakarta.Commons" keep_out
SetEnvIfNoCase User-Agent "MJ12bot" keep_out
SetEnvIfNoCase User-Agent "Nutch" keep_out
SetEnvIfNoCase User-Agent "cr4nk" keep_out
SetEnvIfNoCase User-Agent "MOT\-MPx220" keep_out
SetEnvIfNoCase User-Agent "SiteCrawler" keep_out
SetEnvIfNoCase User-Agent "SiteSucker" keep_out
SetEnvIfNoCase User-Agent "Doubanbot" keep_out
SetEnvIfNoCase User-Agent "Sogou" keep_out
<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=keep_out
</Limit>
This was my 2008-2009 personal user-agent blacklist that included only the worst of the worst offenders as manifested in my error and access logs. The list is highly effective, but has been refined even further to include only the most heinous agents:
# USER AGENTS
SetEnvIfNoCase User-Agent "libwww" keep_out
SetEnvIfNoCase User-Agent "DotBot" keep_out
SetEnvIfNoCase User-Agent "Nutch" keep_out
SetEnvIfNoCase User-Agent "cr4nk" keep_out
<Limit GET POST PUT>
Order Allow,Deny
Allow from all
Deny from env=keep_out
</Limit>
If you only block four user agents this year, libwww
, DotBot
, Nutch
, and cr4nk
will certainly maximize your return on investment.
Ready for Summer!
Over the course of the previous year, I have had the privilege of learning a great deal about Apache’s amazingly useful HTAccess directives. The most important thing that I am realizing when it comes to optimizing your HTAccess strategy is that the old saying, “less is more,” is absolutely true. So many HTAccess files are completely overloaded with extraneous rules and pointless directives. Hopefully articles such as this will help you make wise decisions concerning your own HTAccess strategy.
34 responses to “HTAccess Spring Cleaning 2009”
Hey Louis, that’s a great point — don’t know why I didn’t think of it myself (I must be getting old!).
I wish it would warm up here in Washington state — we are still stuck in the 10- and 15-degree range, with lots of cold wind and rain. Yuck.
Any case, stay cool over there and enjoy your Summer. Don’t be a stranger! :)
I think dreamhost uses apache 2. $5.95 a month I think , just don’t use auto pay just manually play lol.
Oops I said play I meant pay. Sorry about that typo.
Thanks for a great list. Have implemented the changes you suggested.
Great work and thanks for sharing
//Artur
Regarding the infinite loop error: make sure you place the directive in the htaccess file in the directory _below_ the root. ie
server.domain/diretory/.htaccess
not
server.domain/.htaccess
I made this mistake initially, and noticed upon reading the article a second time, that Jeff wrote:
now it works like a charm :-)
As to the “server.domain/diretory/.htaccess” option, I can’t really implement it.
My wordpress installation *is* in the root folder. Not some subdirectory labeled “blog” or “press”.
And if you just need a newer apache server to test on, why not just install something like XAMPP or WAMP? That way you can run all of your tests from home, and not need to purchase a new server.
@Nicholas Storman: Thanks for the tip — I may end up doing something like for testing purposes.
@Artur: My pleasure — glad to be of service! :)
@Webrocker: Good catch! That would certainly explain the issue. Btw, which version of Apache are you using?
@18 Jeff Starr: As far as I can tell from the phpinfo() it’s Apache/1.3
@19 Austin Dworaczyk Wiltshire: I’m no mod_rewrite expert, but I think you need to modify the RedirectMatch directive to make it work from the root-level – something like:
RedirectMatch 301 ^/(.*)/robots.txt http://domain/robots.txt
i.e. you want to “catch” all erroneous requests to “robots.txt” inside any directory below the rootlevel (where no robots.txt exists) and redirect that to the “robots.txt” at the rootlevel.
Damn. I tried WebRocker’s solution and unfortunately I received a 500 Internal server error when I tried to implement it.
I tried it exactly like this, and I receive that dreaded error:
RedirectMatch 301 ^/(.*)/robots.txt http://www.adubvideo.net/robots.txt
I don’t know what the darn issue is. Could it be that I am on some weird setup with my shared hosting?
Ahaaha!! It actually works!
RedirectMatch 301 ^/(.*)/robots.txt http://www.adubvideo.net/robots.txt
Works fine actually!
Note to self: nano does some weird shit with new lines.
Once I opened it in a proper editor, and got rid of the
.
in therobots.txt
section, everything works dandy!Yup, I did. But that was only one of the problems.
@Austin: I have XAMPP setup and use it for as much as possible, but many different WordPress, PHP, and even HTAccess scripts and directives fail to behave as they otherwise would when implemented in a non-localhost environment. I wish this weren’t the case, or wish I knew of a workaround for every localhost-related issue I have encountered, but until I do, I will need to pursue other options for testing on newer versions of Apache.
@Webrocker: You nailed it. I tested your modified
RedirectMatch
directive on Apache 1.3 and it works like a charm. I am guessing it will also work on newer versions of Apache, but again, this hasn’t been tested (if anyone could verify that this works on Apache 2+ that would be awesome).The weird part about all of this is that when I tried both of the originally given
RedirectMatch
directives in the root directory of this site, one of them worked and the other did not. The favicon.ico redirected without a hitch, but the robots.txt redirected in an infinite loop. Truly bizarre, especially considering they are fundamentally the same type of redirect. After adding the^/(.*)/
prefix to therobots\.txt
expression, everything worked perfectly. Again, bizarre.Another interesting aspect of the
favicon\.gif
directive is that the root case: “http://domain.tld/favicon.gif
” does not redirect to the actual icon. All other cases (non-rootfavicon.gif
redirects) seem to work just fine.