Ultimate .htaccess Blacklist 2: Compressed Version
In our original htaccess blacklist article, we provide an extensive list of bad user agents. This so-called “Ultimate htaccess Blacklist” works great at blocking many different online villains: spammers, scammers, scrapers, scrappers, rippers, leechers — you name it. Yet, despite its usefulness, there is always room for improvement.
For example, as reader Greg suggests, a compressed version of the blacklist would be very useful. In this post, we present a compressed version of our Ultimate htaccess Blacklist that features around 50 new agents. Whereas the original blacklist is approximately 8.6KB in size, the compressed version is only 3.4KB, even with the additional agents. Overall, the compressed version requires fewer system resources to block a greater number of bad agents.
# Ultimate htaccess Blacklist 2 from Perishable Press
# Deny domain access to spammers and other scumbags
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ADSARobot|ah-ha|almaden|aktuelles|Anarchie|amzn_assoc|ASPSeek|ASSORT|ATHENS|Atomz|attach|attache|autoemailspider|BackWeb|Bandit|BatchFTP|bdfetch|big.brother|BlackWidow|bmclient|Boston\ Project|BravoBrian\ SpiderEngine\ MarcoPolo|Bot\ mailto:craftbot@yahoo.com|Buddy|Bullseye|bumblebee|capture|CherryPicker|ChinaClaw|CICC|clipping|Collector|Copier|Crescent|Crescent\ Internet\ ToolPak|Custo|cyberalert|DA$|Deweb|diagem|Digger|Digimarc|DIIbot|DISCo|DISCo\ Pump|DISCoFinder|Download\ Demon|Download\ Wonder|Downloader|Drip|DSurf15a|DTS.Agent|EasyDL|eCatch|ecollector|efp@gmx\.net|Email\ Extractor|EirGrabber|email|EmailCollector|EmailSiphon|EmailWolf|Express\ WebPictures|ExtractorPro|EyeNetIE|FavOrg|fastlwspider|Favorites\ Sweeper|Fetch|FEZhead|FileHound|FlashGet\ WebWasher|FlickBot|fluffy|FrontPage|GalaxyBot|Generic|Getleft|GetRight|GetSmart|GetWeb!|GetWebPage|gigabaz|Girafabot|Go\!Zilla|Go!Zilla|Go-Ahead-Got-It|GornKer|gotit|Grabber|GrabNet|Grafula|Green\ Research|grub-client|Harvest|hhjhj@yahoo|hloader|HMView|HomePageSearch|http\ generic|HTTrack|httpdown|httrack|ia_archiver|IBM_Planetwide|Image\ Stripper|Image\ Sucker|imagefetch|IncyWincy|Indy*Library|Indy\ Library|informant|Ingelin|InterGET|Internet\ Ninja|InternetLinkagent|Internet\ Ninja|InternetSeer\.com|Iria|Irvine|JBH*agent|JetCar|JOC|JOC\ Web\ Spider|JustView|KWebGet|Lachesis|larbin|LeechFTP|LexiBot|lftp|libwww|likse|Link|Link*Sleuth|LINKS\ ARoMATIZED|LinkWalker|LWP|lwp-trivial|Mag-Net|Magnet|Mac\ Finder|Mag-Net|Mass\ Downloader|MCspider|Memo|Microsoft.URL|MIDown\ tool|Mirror|Missigua\ Locator|Mister\ PiX|MMMtoCrawl\/UrlDispatcherLLL|^Mozilla$|Mozilla.*Indy|Mozilla.*NEWT|Mozilla*MSIECrawler|MS\ FrontPage*|MSFrontPage|MSIECrawler|MSProxy|multithreaddb|nationaldirectory|Navroad|NearSite|NetAnts|NetCarta|NetMechanic|netprospector|NetResearchServer|NetSpider|Net\ Vampire|NetZIP|NetZip\ Downloader|NetZippy|NEWT|NICErsPRO|Ninja|NPBot|Octopus|Offline\ Explorer|Offline\ Navigator|OpaL|Openfind|OpenTextSiteCrawler|OrangeBot|PageGrabber|Papa\ Foto|PackRat|pavuk|pcBrowser|PersonaPilot|Ping|PingALink|Pockey|Proxy|psbot|PSurf|puf|Pump|PushSite|QRVA|RealDownload|Reaper|Recorder|ReGet|replacer|RepoMonkey|Robozilla|Rover|RPT-HTTPClient|Rsync|Scooter|SearchExpress|searchhippo|searchterms\.it|Second\ Street\ Research|Seeker|Shai|Siphon|sitecheck|sitecheck.internetseer.com|SiteSnagger|SlySearch|SmartDownload|snagger|Snake|SpaceBison|Spegla|SpiderBot|sproose|SqWorm|Stripper|Sucker|SuperBot|SuperHTTP|Surfbot|SurfWalker|Szukacz|tAkeOut|tarspider|Teleport\ Pro|Templeton|TrueRobot|TV33_Mercator|UIowaCrawler|UtilMind|URLSpiderPro|URL_Spider_Pro|Vacuum|vagabondo|vayala|visibilitygap|VoidEYE|vspider|Web\ Downloader|w3mir|Web\ Data\ Extractor|Web\ Image\ Collector|Web\ Sucker|Wweb|WebAuto|WebBandit|web\.by\.mail|Webclipping|webcollage|webcollector|WebCopier|webcraft@bea|webdevil|webdownloader|Webdup|WebEMailExtrac|WebFetch|WebGo\ IS|WebHook|Webinator|WebLeacher|WEBMASTERS|WebMiner|WebMirror|webmole|WebReaper|WebSauger|Website|Website\ eXtractor|Website\ Quester|WebSnake|Webster|WebStripper|websucker|webvac|webwalk|webweasel|WebWhacker|WebZIP|Wget|Whacker|whizbang|WhosTalking|Widow|WISEbot|WWWOFFLE|x-Tractor|^Xaldon\ WebSpider|WUMPUS|Xenu|XGET|Zeus.*Webster|Zeus [NC]
RewriteRule ^.* - [F,L]
For more information, please see our original htaccess blacklist article, the Ultimate htaccess Blacklist. And you also may be interested checking out the new and improved 6G Firewall.
Update: 2008/04/30
The blacklist has been edited to remove the DA
character string. This is to prevent blocking of certain validation services such as those provided via the W3C. Thanks to John S. Britsios for identifying and sharing this information. :)
Update: 2008/05/04
The blacklist has been edited to (re)include the DA$
character string. Previously, the DA
string matched various validation services because of the “da” string found in the terms “validator”, “validation”, etc. As reader Max explains, we can avoid this problem by appending a $
onto DA
. Thus the blacklist has been edited to include the DA$
character string, which protects against the DA
bot while allowing us to use various validation services. Thanks Max! ;)
61 responses to “Ultimate .htaccess Blacklist 2: Compressed Version”
Thanks for the great information. Much to learn about this .htaccess stuff!
This is off topic but I want to know how to block leeching programs that download whole sites or all images according to sizes. They kill tons of bandwidth. How can I block them?
thanks
Yes, there are many, many such programs. The blacklist attempts to block a decent number of them, but there are still many more to identify and add to the list. If you happen to notice a few particular, repeat mass-downloaders in your access logs, I recommend targeting them directly and/or adding their associated user-agent(s) to the blacklist. I hope this helps!
Hey thanks for the great reminder. I of all people should be reviewing my logs regularly just to make sure there aren’t any such hackers trying to get into the sensitive “admin” areas of my e-commerce site! Just imagine the danger that lurks around when we are not being watchful of this kind of stuff!
Absolutely, Dave — apathetic and/or ignorant webmasters are exactly what spammers, crackers and other scumbags are targeting. The bad news is that such vulnerable sites become an even larger threat once compromised. What’s that old saying? “An ounce of prevention is worth a pound of spam..” ;)
great list.I’ve added it to one of my sites.
I’m also thinking of adding the following but would like your opinion where or not its worthwhile
as documented here by andrew
http://www.andrewjmorris.com/site-hijacking-part-2.htm#comments
@flash arcade games: that is the first time I have heard of that particular type of site “hijacking.” The PHP code itself is fairly simple, but implemented sitewide on a busy site, it could reduce performance. Every request for every page on your site would be running the script. Thus, unless you have a small site and/or lite traffic, I would only use that technique if I discovered that my site was being ripped. In any case, it is very interesting — thanks for pointing it out!
Hi, im trying to block all webproxies to see my webpage but its impossile. It does not work for me.
I updated my .htaccess file with the code in this article, and i can access perfectly my home page using hidemyass.com
Any idea?
I need to block this because i recently ban abusive users from my home page, and they are using web proxies to spam my forum and register a lot of trolls accounts.
@ronaquin496: Unfortunately, there is no way to block all web proxies. If a page exists publicly on the Web, it is accessible by anyone (depending on skill level, etc.) who desires access. Of course, the blacklist presented above won’t block hidemyass.com because it’s not on the list.
@Perishable: Web proxies are very hard to block and almost impossible. The only thing I could come up with is checking referer (if it is not blocked) and then making a script to visit it in the future and see if it looks like a proxy script, and if so, block that referer. If a proxy blocks the referer from passing itself (as most do), then its tough luck.. Actually I just remembered you can probably grab the URL location via javascript and compare it to your domains.. That might actually be the best solution for web proxies.
Thanks for updating the list to include validators. I was wondering why w3c was having issues.
when i google my site, it gets re-directed to a proxy site tshake.com
google has indexed this page, for the past 2 weeks.
any tips on how to fix this