Ultimate .htaccess Blacklist 2: Compressed Version
In our original htaccess blacklist article, we provide an extensive list of bad user agents. This so-called “Ultimate htaccess Blacklist” works great at blocking many different online villains: spammers, scammers, scrapers, scrappers, rippers, leechers — you name it. Yet, despite its usefulness, there is always room for improvement.
For example, as reader Greg suggests, a compressed version of the blacklist would be very useful. In this post, we present a compressed version of our Ultimate htaccess Blacklist that features around 50 new agents. Whereas the original blacklist is approximately 8.6KB in size, the compressed version is only 3.4KB, even with the additional agents. Overall, the compressed version requires fewer system resources to block a greater number of bad agents.
# Ultimate htaccess Blacklist 2 from Perishable Press
# Deny domain access to spammers and other scumbags
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ADSARobot|ah-ha|almaden|aktuelles|Anarchie|amzn_assoc|ASPSeek|ASSORT|ATHENS|Atomz|attach|attache|autoemailspider|BackWeb|Bandit|BatchFTP|bdfetch|big.brother|BlackWidow|bmclient|Boston\ Project|BravoBrian\ SpiderEngine\ MarcoPolo|Bot\ mailto:craftbot@yahoo.com|Buddy|Bullseye|bumblebee|capture|CherryPicker|ChinaClaw|CICC|clipping|Collector|Copier|Crescent|Crescent\ Internet\ ToolPak|Custo|cyberalert|DA$|Deweb|diagem|Digger|Digimarc|DIIbot|DISCo|DISCo\ Pump|DISCoFinder|Download\ Demon|Download\ Wonder|Downloader|Drip|DSurf15a|DTS.Agent|EasyDL|eCatch|ecollector|efp@gmx\.net|Email\ Extractor|EirGrabber|email|EmailCollector|EmailSiphon|EmailWolf|Express\ WebPictures|ExtractorPro|EyeNetIE|FavOrg|fastlwspider|Favorites\ Sweeper|Fetch|FEZhead|FileHound|FlashGet\ WebWasher|FlickBot|fluffy|FrontPage|GalaxyBot|Generic|Getleft|GetRight|GetSmart|GetWeb!|GetWebPage|gigabaz|Girafabot|Go\!Zilla|Go!Zilla|Go-Ahead-Got-It|GornKer|gotit|Grabber|GrabNet|Grafula|Green\ Research|grub-client|Harvest|hhjhj@yahoo|hloader|HMView|HomePageSearch|http\ generic|HTTrack|httpdown|httrack|ia_archiver|IBM_Planetwide|Image\ Stripper|Image\ Sucker|imagefetch|IncyWincy|Indy*Library|Indy\ Library|informant|Ingelin|InterGET|Internet\ Ninja|InternetLinkagent|Internet\ Ninja|InternetSeer\.com|Iria|Irvine|JBH*agent|JetCar|JOC|JOC\ Web\ Spider|JustView|KWebGet|Lachesis|larbin|LeechFTP|LexiBot|lftp|libwww|likse|Link|Link*Sleuth|LINKS\ ARoMATIZED|LinkWalker|LWP|lwp-trivial|Mag-Net|Magnet|Mac\ Finder|Mag-Net|Mass\ Downloader|MCspider|Memo|Microsoft.URL|MIDown\ tool|Mirror|Missigua\ Locator|Mister\ PiX|MMMtoCrawl\/UrlDispatcherLLL|^Mozilla$|Mozilla.*Indy|Mozilla.*NEWT|Mozilla*MSIECrawler|MS\ FrontPage*|MSFrontPage|MSIECrawler|MSProxy|multithreaddb|nationaldirectory|Navroad|NearSite|NetAnts|NetCarta|NetMechanic|netprospector|NetResearchServer|NetSpider|Net\ Vampire|NetZIP|NetZip\ Downloader|NetZippy|NEWT|NICErsPRO|Ninja|NPBot|Octopus|Offline\ Explorer|Offline\ Navigator|OpaL|Openfind|OpenTextSiteCrawler|OrangeBot|PageGrabber|Papa\ Foto|PackRat|pavuk|pcBrowser|PersonaPilot|Ping|PingALink|Pockey|Proxy|psbot|PSurf|puf|Pump|PushSite|QRVA|RealDownload|Reaper|Recorder|ReGet|replacer|RepoMonkey|Robozilla|Rover|RPT-HTTPClient|Rsync|Scooter|SearchExpress|searchhippo|searchterms\.it|Second\ Street\ Research|Seeker|Shai|Siphon|sitecheck|sitecheck.internetseer.com|SiteSnagger|SlySearch|SmartDownload|snagger|Snake|SpaceBison|Spegla|SpiderBot|sproose|SqWorm|Stripper|Sucker|SuperBot|SuperHTTP|Surfbot|SurfWalker|Szukacz|tAkeOut|tarspider|Teleport\ Pro|Templeton|TrueRobot|TV33_Mercator|UIowaCrawler|UtilMind|URLSpiderPro|URL_Spider_Pro|Vacuum|vagabondo|vayala|visibilitygap|VoidEYE|vspider|Web\ Downloader|w3mir|Web\ Data\ Extractor|Web\ Image\ Collector|Web\ Sucker|Wweb|WebAuto|WebBandit|web\.by\.mail|Webclipping|webcollage|webcollector|WebCopier|webcraft@bea|webdevil|webdownloader|Webdup|WebEMailExtrac|WebFetch|WebGo\ IS|WebHook|Webinator|WebLeacher|WEBMASTERS|WebMiner|WebMirror|webmole|WebReaper|WebSauger|Website|Website\ eXtractor|Website\ Quester|WebSnake|Webster|WebStripper|websucker|webvac|webwalk|webweasel|WebWhacker|WebZIP|Wget|Whacker|whizbang|WhosTalking|Widow|WISEbot|WWWOFFLE|x-Tractor|^Xaldon\ WebSpider|WUMPUS|Xenu|XGET|Zeus.*Webster|Zeus [NC]
RewriteRule ^.* - [F,L]
For more information, please see our original htaccess blacklist article, the Ultimate htaccess Blacklist. And you also may be interested checking out the new and improved 6G Firewall.
Update: 2008/04/30
The blacklist has been edited to remove the DA
character string. This is to prevent blocking of certain validation services such as those provided via the W3C. Thanks to John S. Britsios for identifying and sharing this information. :)
Update: 2008/05/04
The blacklist has been edited to (re)include the DA$
character string. Previously, the DA
string matched various validation services because of the “da” string found in the terms “validator”, “validation”, etc. As reader Max explains, we can avoid this problem by appending a $
onto DA
. Thus the blacklist has been edited to include the DA$
character string, which protects against the DA
bot while allowing us to use various validation services. Thanks Max! ;)
61 responses to “Ultimate .htaccess Blacklist 2: Compressed Version”
Let’s try to keep it on-topic, Pablo. Thanks! :)
sorry jeff, i thought this was o topic.
my apologies.
Alright, I suppose if you look at previous comments, your question does seem relevant. Sorry for jumping at you, just trying to keep the noise level down on these ongoing comment threads. Hopefully no hard feelings..
Looking at your question, it seems as if your site may have been Google-jacked or else compromised with some sort of redirection/proxy script. I wish I could say for sure, but there may be many things that you will need to check before identifying the culprit.
The first thing I would do is verify that all of the files on my server are intact and operating as intended. During that check, I would also keep an eye open for any files that may have been added, especially in higher-level directories. Also, a sitewide search for the term “
tshake.com
” and other related terms/URLs may prove very enlightening.After verifying site integrity, I would then search for information on “Google-jacking” and follow any leads that are generated there. I would also check my site on other search engines as well; if the redirect is only happening for Google, it is likely due to hijacking.
I hope that provides some help, Pablo. My apologies for my previous response.
Regards,
Jeff
This doesn’t seem to help me, I may even have the wrong idea… but could I use version 1 and version 2 at the same time?
I seem to have a very intelligent sort of Spam system attacking my site, almost the human type.
Cheers
@Json: Careful with that HTAccess, Eugene! If you are actually contemplating using both versions of this blacklist, it may be best if you do a little more research on the subject to understand what you are doing. So to answer your question: “no”, by all means use the compressed version of the blacklist and call it good. Both versions would be around 90% redundant and thus wasteful of resources.
@Jeff Starr: Thank you.
Cheers
I found a testing system, YIPPEEE!
http://www.botsvsbrowsers.com/SimulateUserAgent.asp
This is neat! Add anything that exists in the “Ultimate htaccess Blacklist 1 OR 2” you get your “Test Page” instead. I kept searching, believing there was one.
Cheers
Json, you are the man! Thanks for finding and sharing that excellent resource with us. I didn’t realize that something like that was available on the Web. Many thanks!
Cheers,
Jeff
@Jeff Starr: Anytime…
I am still a bit lost as to the “/”, “\”, “./” or “.\” (without quotes)
Some of the pages I read have it with “/” or “./” others have it with “\” or “.\”
Are you referring to the code presented in this article, the no-referrers article, or the eight ways to blacklist article? You have commented on all three, and I am unable to locate any such code in the article above this particular comment thread. Please try to be specific — there are hundreds of posts and thousands of comments here at Perishable Press! ;)
Sorry, it’s this one;
Block Spam by Denying Access to No-Referrer Requests
I can’t figure out if the folder is needed and which slash to use?
RewriteCond %{REQUEST_URI} .folder/wp-comments-post\.php*
OR
RewriteCond %{REQUEST_URI} .folder\wp-comments-post\.php*
Cheers
I thought I answered this question already on the no-referrer comment thread.. But anyway, in a nutshell, the folder isn’t required at all unless you need to specify that particular version of the file. If you decide you need the folder, use a forward slash after its name. The backslash (
\
) is an escape character that tells Apache to take the following character as a literal entity. Thus, in your second example, you would be escaping the letter “w”, which is pointless because it is already being taken literally. Let me know if this (and the other response on the no-referrer article) is not clear.