Jump Menu : Content | Explore | Comments | Search | Home | Sitemap | Contact | Login | Access.

Ultimate htaccess Blacklist 2 (Compressed Version)

[ Keywords: htaccess, rewrite, blacklist, block, deny, spam, spammers, scrapers, rippers ]

[ Image: Lunar Eclipse ]

In our original htaccess blacklist article, we provide an extensive list of bad user agents. This so-called “Ultimate htaccess Blacklist” works great at blocking many different online villains: spammers, scammers, scrapers, scrappers, rippers, leechers — you name it. Yet, despite its usefulness, there is always room for improvement. For example, as reader Greg suggests, a compressed version of the blacklist would be very useful. In this post, we present a compressed version of our Ultimate htaccess Blacklist that features around 50 new agents. Whereas the original blacklist is approximately 8.6KB in size, the compressed version is only 3.4KB, even with the additional agents. Overall, the compressed version requires fewer system resources to block a greater number of bad agents.

# Ultimate htaccess Blacklist 2 from Perishable Press
# Deny domain access to spammers and other scumbags
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ADSARobot|ah-ha|almaden|aktuelles|Anarchie|amzn_assoc|ASPSeek|ASSORT|ATHENS|Atomz|attach|attache|autoemailspider|BackWeb|Bandit|BatchFTP|bdfetch|big.brother|BlackWidow|bmclient|Boston\ Project|BravoBrian\ SpiderEngine\ MarcoPolo|Bot\ mailto:craftbot@yahoo.com|Buddy|Bullseye|bumblebee|capture|CherryPicker|ChinaClaw|CICC|clipping|Collector|Copier|Crescent|Crescent\ Internet\ ToolPak|Custo|cyberalert|DA$|Deweb|diagem|Digger|Digimarc|DIIbot|DISCo|DISCo\ Pump|DISCoFinder|Download\ Demon|Download\ Wonder|Downloader|Drip|DSurf15a|DTS.Agent|EasyDL|eCatch|ecollector|efp@gmx\.net|Email\ Extractor|EirGrabber|email|EmailCollector|EmailSiphon|EmailWolf|Express\ WebPictures|ExtractorPro|EyeNetIE|FavOrg|fastlwspider|Favorites\ Sweeper|Fetch|FEZhead|FileHound|FlashGet\ WebWasher|FlickBot|fluffy|FrontPage|GalaxyBot|Generic|Getleft|GetRight|GetSmart|GetWeb!|GetWebPage|gigabaz|Girafabot|Go\!Zilla|Go!Zilla|Go-Ahead-Got-It|GornKer|gotit|Grabber|GrabNet|Grafula|Green\ Research|grub-client|Harvest|hhjhj@yahoo|hloader|HMView|HomePageSearch|http\ generic|HTTrack|httpdown|httrack|ia_archiver|IBM_Planetwide|Image\ Stripper|Image\ Sucker|imagefetch|IncyWincy|Indy*Library|Indy\ Library|informant|Ingelin|InterGET|Internet\ Ninja|InternetLinkagent|Internet\ Ninja|InternetSeer\.com|Iria|Irvine|JBH*agent|JetCar|JOC|JOC\ Web\ Spider|JustView|KWebGet|Lachesis|larbin|LeechFTP|LexiBot|lftp|libwww|likse|Link|Link*Sleuth|LINKS\ ARoMATIZED|LinkWalker|LWP|lwp-trivial|Mag-Net|Magnet|Mac\ Finder|Mag-Net|Mass\ Downloader|MCspider|Memo|Microsoft.URL|MIDown\ tool|Mirror|Missigua\ Locator|Mister\ PiX|MMMtoCrawl\/UrlDispatcherLLL|^Mozilla$|Mozilla.*Indy|Mozilla.*NEWT|Mozilla*MSIECrawler|MS\ FrontPage*|MSFrontPage|MSIECrawler|MSProxy|multithreaddb|nationaldirectory|Navroad|NearSite|NetAnts|NetCarta|NetMechanic|netprospector|NetResearchServer|NetSpider|Net\ Vampire|NetZIP|NetZip\ Downloader|NetZippy|NEWT|NICErsPRO|Ninja|NPBot|Octopus|Offline\ Explorer|Offline\ Navigator|OpaL|Openfind|OpenTextSiteCrawler|OrangeBot|PageGrabber|Papa\ Foto|PackRat|pavuk|pcBrowser|PersonaPilot|Ping|PingALink|Pockey|Proxy|psbot|PSurf|puf|Pump|PushSite|QRVA|RealDownload|Reaper|Recorder|ReGet|replacer|RepoMonkey|Robozilla|Rover|RPT-HTTPClient|Rsync|Scooter|SearchExpress|searchhippo|searchterms\.it|Second\ Street\ Research|Seeker|Shai|Siphon|sitecheck|sitecheck.internetseer.com|SiteSnagger|SlySearch|SmartDownload|snagger|Snake|SpaceBison|Spegla|SpiderBot|sproose|SqWorm|Stripper|Sucker|SuperBot|SuperHTTP|Surfbot|SurfWalker|Szukacz|tAkeOut|tarspider|Teleport\ Pro|Templeton|TrueRobot|TV33_Mercator|UIowaCrawler|UtilMind|URLSpiderPro|URL_Spider_Pro|Vacuum|vagabondo|vayala|visibilitygap|VoidEYE|vspider|Web\ Downloader|w3mir|Web\ Data\ Extractor|Web\ Image\ Collector|Web\ Sucker|Wweb|WebAuto|WebBandit|web\.by\.mail|Webclipping|webcollage|webcollector|WebCopier|webcraft@bea|webdevil|webdownloader|Webdup|WebEMailExtrac|WebFetch|WebGo\ IS|WebHook|Webinator|WebLeacher|WEBMASTERS|WebMiner|WebMirror|webmole|WebReaper|WebSauger|Website|Website\ eXtractor|Website\ Quester|WebSnake|Webster|WebStripper|websucker|webvac|webwalk|webweasel|WebWhacker|WebZIP|Wget|Whacker|whizbang|WhosTalking|Widow|WISEbot|WWWOFFLE|x-Tractor|^Xaldon\ WebSpider|WUMPUS|Xenu|XGET|Zeus.*Webster|Zeus [NC]
RewriteRule ^.* - [F,L]

For more information, please see our original htaccess blacklist article, the Ultimate htaccess Blacklist.

Update: (April 30th, 2008) the blacklist has been edited to remove the DA character string. This is to prevent blocking of certain validation services such as those provided via the W3C. Thanks to John S. Britsios for identifying and sharing this information. :)

Update: (May 4th, 2008) the blacklist has been edited to (re)include the DA$ character string. Previously, the DA string matched various validation services because of the “da” string found in the terms “validator”, “validation”, etc. As reader Max explains, we can avoid this problem by appending a $ onto DA. Thus the blacklist has been edited to include the DA$ character string, which protects against the DA bot while allowing us to use various validation services. Thanks Max! ;)

Related articles

About this article

This is article #420, posted by Perishable on Monday, October 15, 2007 @ 11:14am. Categorized as Function, and tagged with apache, blacklist, htaccess, list, mod_rewrite, security, spam, tricks. Updated on May 04, 2008. Visited 14188 times. 38 Responses »

BookmarkTrackbackCommentSubscribeExplore

« Miscellaneous Happenings • Up • New Version of Category LiveBookmarks Plus for WordPress 2.3 »


38 Responses

1 • October 16, 2007 at 9:26 am — Simeon_seo says:

Why you block “Yandex”? It is a spider of the russian most popular search engine.

2 • October 16, 2007 at 9:45 am — Perishable says:

Good question.. adding it to the list was suggested by a fellow forum member awhile ago. I sniffed around Google a bit but found no evidence of maleficence, so I have removed it from the list. Thanks for pointing it out to us ;)

3 • November 9, 2007 at 11:17 am — John says:

How does this affect blacklist affect Google crawling your site? There is no side effect?

4 • November 10, 2007 at 6:53 pm — Perishable says:

John, rest assured that there are absolutely no googlebots mentioned in the blacklist. We are targeting the bad guys here, not legitimate agents such as Google, Yahoo!, and MSN. No worries! ;)

5 • December 11, 2007 at 7:00 pm — Lisa says:

Excellent! This is much more better than the first list you made. I didn’t know that the user-agents name can all be in one line.

This may be offtopic but I have a question. Can I use the same technique (one line) to block ip address using the “deny from ip” code?

6 • December 11, 2007 at 10:22 pm — Perishable says:

Lisa you are wearing me out! :)

Anyway, to answer your question, yes, you can list multiple IP addresses in one line when using a “deny from ip” directive. Here is the syntax that you would use:

deny from 123.123.123 456.456.456 789.789.789

..etc. Notice that we are using spaces to separate the individual IPs — no commas allowed ;)

7 • January 21, 2008 at 11:09 pm — bjarbj78 says:

Hei.

Thanks for this list.
I have my own .htaccess file that are included in my PHP-Nuke site. And i won’t mess it up, so not sure what is
RewriteBase / ?

And where should i but the RewriteEngine On ?On the top of my .htaccess or belove some RewriteCond and RewriteRule ?

Is very important that I don’t mess up my .htaccess, hehe

Thanks for very good homepage you got.

Best Regards

bjarbj78

8 • January 22, 2008 at 8:57 am — Perishable says:

Hei bjarbj78,

RewriteBase specifies the default URL for a directory. It is often used to remove the local directory prefix in order for the rewrite rules to act upon the correct file path. Don’t worry, if it sounds confusing, it is — I still find myself referring to the source when trying to explain it to people :)

Regards,
Jeff

9 • January 23, 2008 at 9:57 am — bjarbj78 says:

Thank you very much Jeff :)
Do you know how to block proxy servers? I make a big list about 9139 proxy domains. But didn’t work when i use:

deny from proxydomain.com proxydomain2.com and so on.

Regards
bjarbj78

10 • January 23, 2008 at 10:38 pm — Perishable says:

The question is, even if you could use htaccess to block over 9,000 domains, would you really want to? If you consider the potential performance hit and excessive load on server resources associated with the perpetual processing of such a monstrous list, it may inspire you to seek a healthier, perhaps more effective alternative..

For example, here is the code that I use for stopping 99% of the proxies that attempt to access one of my sites:

# prevent proxy access
RewriteEngine on
RewriteCond %{HTTP:VIA} !^$ [OR]
RewriteCond %{HTTP:FORWARDED} !^$ [OR]
RewriteCond %{HTTP:USERAGENT_VIA} !^$ [OR]
RewriteCond %{HTTP:X_FORWARDED_FOR} !^$ [OR]
RewriteCond %{HTTP:PROXY_CONNECTION} !^$ [OR]
RewriteCond %{HTTP:XPROXY_CONNECTION} !^$ [OR]
RewriteCond %{HTTP:HTTP_PC_REMOTE_ADDR} !^$ [OR]
RewriteCond %{HTTP:HTTP_CLIENT_IP} !^$
RewriteRule .* - [F]

Place in root htaccess and take ’er for a spin via any proxy service(s) of your choice.. - it may not be perfect, but it’s lightweight, concise, and very effective ;)

11 • January 24, 2008 at 6:14 am — bjarbj78 says:

Hi Jeff.

Thank you very much. It’s works :)
lol, and i used almost 2 - 3 hours to complete my list, hehe

Regards

Bjørn

12 • January 24, 2008 at 6:32 am — bjarbj78 says:

Hi Jeff.

I found those lines to:

rewritecond %{HTTP:Forwarded-For} !^$ [OR]
rewritecond %{HTTP:X-Forwarded} !^$ [OR]

Can I use those lines too?

13 • January 24, 2008 at 1:10 pm — Perishable says:

Hi Bjørn,

Glad it worked for you! I have found it to be much more efficient than trying to block the endless swarm of proxy servers individually — things change waay too quickly for that. As for the additional rewrite conditions you mention, I have not used them.. perhaps you could test their effectiveness and share the results with us? ;)

Regards,
Jeff

14 • January 25, 2008 at 11:10 am — Lisa says:

Hi again Jeff :)

I have a question, can I use a wildcard to block an IP.. eg.:
deny from 74.54.143.*
or
deny from 74.54.*.*

The above IP Ranges has been scraping some of my blog contents and I’m getting tired of blocking them one by one.

I’ve Googled for an answer but can’t seem to find any hint on wildcards.

Hope you can help me ;)

15 • January 27, 2008 at 9:24 am — Perishable says:

Hi Lisa :)

Using htaccess, wildcards are not necessary when specifying specific ranges of IP addresses. Simply truncate the address to the target range, for example:

deny from 74.54.

..would block every IP beginning with the 74.54. prefix:

74.54.1
   .
   .
   .
74.54.255
74.54.255.1
   .
   .
   .
74.54.255.255

..etc. Of course, you could also use wildcards if so desired:

deny from 74.54.*

..which would block the exact same range as before. It’s totally your choice! :)

16 • January 27, 2008 at 9:32 am — Perishable says:

Ah, one more thing that I should point out.. Be careful when using the “dot” (.) in your blocked IP ranges. If we omit the dot from the example in the previous comment, we would block a different set of IP addresses:

74.540
   .
   .
   .
74.549
74.540.1
   .
   .
   .
74.549.255

..etc. Inclusion or exclusion of the dot can make all of the difference!

17 • January 27, 2008 at 9:38 am — Lisa says:

Thanks Jeff :) You’re a life saver!!

18 • January 30, 2008 at 4:33 am — Pádraig Brady says:

It would be cool if you could provide
just the last 2 lines in a downloadable file,
that you would update and others could download periodically to
construct their .httaccess files from

cheers,
Pádraig.

19 • January 30, 2008 at 4:39 am — Pádraig Brady says:

Hey you have httrack and ia_archiver in there?

20 • January 30, 2008 at 4:04 pm — Perishable says:

Pádraig Brady,

Yes, that is an excellent idea. I am currently in the process of building the next version of the Ultimate htaccess Blacklist (v.3), and will definitely post an article once it is finished. As you can see, this post features the second version of the Blacklist, and I continue to post many useful tips and tricks concerning all things htaccess, PHP, XHTML, CSS, and all of the other web-related acronyms ;) Thus, my advice to people who want to stay current with the Ultimate htaccess Blacklist as well as other security and anti-spam information is to subscribe to my feed — I guarantee you won’t regret it!

Regards,
Jeff_

21 • March 5, 2008 at 5:50 am — Peter says:

I’m testing this list and I’m having a problem with RewriteBase /

I’d prefer to use this list in my http.config but RewriteBase / is throwing an error “only valid on a per dir config files”. It it required for this list to work?

22 • March 5, 2008 at 10:12 am — Perishable says:

Hi Peter, you are correct — you must remove the RewriteBase / directive when using the blacklist in your httpd.conf file. It is meant for per-directory htaccess files and is not needed when working with httpd.conf.

23 • March 5, 2008 at 12:00 pm — Peter says:

:) COOL!!

Thanks

24 • March 15, 2008 at 12:34 pm — Max says:

After including this list in my file, I noticed that the W3 HTML and CSS validator would no longer work on my site, presumably because agent “Jigsaw/2.2.5 W3C_CSS_Validator_JFouffa/2.0″ etc has “DA” in the title. This can be resolved by changing agent “DA” to “DA$”.

25 • March 15, 2008 at 5:54 pm — Perishable says:

Thanks for the tip, Max. I will definitely include this modification in the next iteration of the blacklist. Cheers! ;)

26 • April 21, 2008 at 2:04 am — Fury says:

Hi!
Add “anonymouse” to the http_user_agent list

27 • April 21, 2008 at 7:51 am — Perishable says:

Sure, I am more than willing to add it the next incarnation, provided some sort of explanation as to why “anonymouse” (or any other user agent for that matter) deserves to be blacklisted. Drop a link and we’ll go from there.. ;)

28 • April 23, 2008 at 7:58 am — Peter says:

anonymouse is a proxy site loaded with scumbags. I had added it to my list awhile ago along with SurveyBot|Nikto|MEGAUPLOAD|anonymouse|Java/1.0|CMS\ Spider

29 • April 23, 2008 at 8:33 am — Perishable says:

Yes, I have also read elsewhere that blocking anonymouse is a wise move. I will add the directive to the next incarnation of the blacklist. Thanks for helping to improve the list! :)

30 • April 29, 2008 at 12:58 pm — John S. Britsios says:

I appreciate very much your great work, and I read your tutorial with great interest. I only have a experienced one problem that I could figure out how to solve it.

When I use your blocking list, I cannot use the W3C HTML and CSS validator for validating my site.

Also I cannot use the http://www.htmlhelp.com/cgi-bin/validate.cgi validator either.

Can you tell which user agent is in the list that blocks the above validators?

Thanks you very much in advance.

31 • April 29, 2008 at 2:14 pm — Perishable says:

Hi John, I’m glad you enjoy the article. As for the blacklist, are you sure that it is the cause of the problem? It may very well be, however, many people use the list without any similar issues. Nonetheless, if you have determined that the source of the conflict involves the blacklist rules, a bit of further testing should help you target the problematic user agents. First, remove the first half of the user agents and test if things are working. If so, replace half of the removed agents and test again. Likewise, if the issue did not resolve after removing the first half, repeat the process with the second half. By repeating this process a few times (it shouldn’t take too long), you should be able to identify the conflicting agents. I hope that makes sense. I would do this exercise myself, but I am literally swamped with a million other tasks and chores that I have to get done. In any case, I hope this information is useful, as I definitely want to help you implement successfully the ultimate htaccess blacklist. Finally, if you do find problematic user agent(s), please drop a comment and share with the community here at Perishable Press.

Cheers,
Jeff

32 • April 30, 2008 at 11:57 am — John S. Britsios says:

Thanks Jeff for the reply.
I found out which was blocking W3C Validator.

It was: DA

I just thought of sharing.

Thanks again for the great work.

Cheers,

John

33 • April 30, 2008 at 1:32 pm — Perishable says:

Hi John, thank you for taking the time to share this information. I have removed the DA character string from the list and updated the article so that others may benefit from your work (and generosity). Thank you for helping to improve the quality of the blacklist! :)

Kind regards,
Jeff

34 • May 1, 2008 at 2:51 am — Max says:

John; you’re right - DA does block WWW validators - see my post #24 above!

35 • May 1, 2008 at 5:28 am — Perishable says:

Doh! I completely forgot about Max’s comment concerning the DA string! Apparently age is directly proportional to forgetfulness (I should have scrolled thru the comments before responding) — would’ve saved everyone some time. Max’s solution also enables us to continue blocking the DA agent while allowing access to the various validation services. Once I return to the computer (this weekend), I will update the article and the blacklist with this information. Huge apologies to Max — thanks for the gentle reminder ;)

36 • May 4, 2008 at 8:30 am — Perishable says:

The blacklist and article have been updated to (re)include the DA$ character string. Many thanks to Max for pointing this out. Cheers!

37 • May 12, 2008 at 11:26 am — TBC says:

Thanks for the great information. Much to learn about this .htaccess stuff!

Drop a comment

Trackbacks / Pingbacks

  1. Dealing With Bad Web Bots | derick.in

Set CSS to lite theme
Set CSS to dark theme