4G Series: The Ultimate User-Agent Blacklist, Featuring Over 1200 Bad Bots

by Jeff Starr on Sunday, March 29, 2009 59 Responses

[ Image: Inverted Eclipse ] As discussed in my recent article, Eight Ways to Blacklist with Apache’s mod_rewrite, one method of stopping spammers, scrapers, email harvesters, and malicious bots is to blacklist their associated user agents. Apache enables us to target bad user agents by testing the user-agent string against a predefined blacklist of unwanted visitors. Any bot identifying itself as one of the blacklisted agents is immediately and quietly denied access. While this certainly isn’t the most effective method of securing your site against malicious behavior, it may certainly provide another layer of protection.

Even so, there are several things to consider before choosing to implement an extensive user-agent blacklist on your site. First and most importantly is the transient nature of the user agent itself. On most systems, the user-agent variable is easy to change, making it possible for bot owners to use any user-agent name they wish. Once a bad bot makes the rounds, becomes known, and is blacklisted, the bot owner need only modify or change its declared user agent and they’re back in business. User-agent names are constantly invented, spoofed, or otherwise altered in order to operate beneath — or above — the virtual radar. Thus, a user-agent blacklist is a high-maintenance affair, requiring continuous cultivation in order to maintain relevancy and effectiveness.

Performance is another important issue to consider. While a well-maintained user-agent blacklist may average a reasonable number of user agents, blacklists that are simply appended with new names will eventually grow painfully large and ultimately decrease server performance. Then you’re left with a never-ending blacklist of retired user agents that fails to protect your site while slowing things down to a virtual crawl (no pun intended). And despite your best intentions, we both know that taking time for periodic “blacklist maintenance” is a luxury that simply doesn’t exist, at least for most of us.

As if those reasons weren’t enough to persuade you against using an ultimate user-agent blacklist, here is another: the 4G Blacklist. Put simply, the 4G Blacklist is a more effective way to protect your site against a wide variety of spam, exploits, and malicious attacks. Unlike huge lists of banned user agents, the 4G Blacklist requires zero maintenance, consumes fewer resources, and may retain its effectiveness indefinitely. But alas, for those of you who are still determined to get your hands on the latest “ultimate” user-agent blacklist, here you go..

The Ultimate User-Agent Blacklist

As you may recall, the original Ultimate HTAccess Blacklist was released here at Perishable Press a couple of years ago. Then, several months later, I added more bad user agents, compressed the list into single-line format, and released the Ultimate HTAccess Blacklist 2. This list contained over 300 bad bots and was generally well-received by the community, protecting many sites against a plethora of site rippers, grabbers, spammers, harvesters, bad bots, and other online scum. When used as a solid foundation on which to build and cultivate your own user-agent blacklist, the Ultimate HTAccess Blacklist can do wonders to improve overall performance, decrease site maintenance, and reduce server expense.

Now, in this new and improved version of the Ultimate User-Agent Blacklist, I have integrated my recent collection 1 of actively malicious bad bots to more than quadruple the number of blocked user agents. This new list features a whopping 1211 blacklisted user agents, including three of my own creation 2 to be used exclusively for my diabolical and obsessive monitoring purposes (insert maniacal laughter here). Also, as with the second version of the user-agent blacklist, this new version is written in compressed, single-line format to facilitate usability and performance.

So, without further ado, here is the third incarnation of the Ultimate User-Agent Blacklist. Simply copy and paste the following code into the root HTAccess file of your site to enjoy a serious reduction in wasted bandwidth, stolen resources, and comment spam. Remember to backup your stuff before you meddle with things, and always test, test, test whenever implementing HTAccess directives.

# PERISHABLE PRESS ULTIMATE USER-AGENT BLACKLIST

<IfModule mod_rewrite.c>
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^$|\<|\>|\'|\%|\_iRc|\_Works|\@\$x|\<\?|\$x0e|\+select\+|\+union\+|1\,\1\,1\,|2icommerce|3GSE|4all|59\.64\.153\.|88\.0\.106\.|98|85\.17\.|A\_Browser|ABAC|Abont|abot|Accept|Access|Accoo|AceFTP|Acme|ActiveTouristBot|Address|Adopt|adress|adressendeutschland|ADSARobot|agent|ah\-ha|Ahead|AESOP\_com\_SpiderMan|aipbot|Alarm|Albert|Alek|Alexibot|Alligator|AllSubmitter|alma|almaden|ALot|Alpha|aktuelles|Akregat|Amfi|amzn\_assoc|Anal|Anarchie|andit|Anon|AnotherBot|Ansearch|AnswerBus|antivirx|Apexoo|appie|Aqua_Products|Arachmo|archive|arian|ASPSe|ASSORT|aster|Atari|ATHENS|AtHome|Atlocal|Atomic_Email_Hunter|Atomz|Atrop|^attach|attrib|autoemailspider|autohttp|axod|batch|b2w|Back|BackDoorBot|BackStreet|BackWeb|Badass|Baid|Bali|Bandit|Baidu|Barry|BasicHTTP|BatchFTP|bdfetch|beat|Become|Beij|BenchMark|berts|bew|big.brother|Bigfoot|Bilgi|Bison|Bitacle|Biz360|Black|Black.Hole|BlackWidow|bladder.fusion|Blaiz|Blog.Checker|Blogl|BlogPeople|Blogshares.Spiders|Bloodhound|Blow|bmclient|Board|BOI|boitho|Bond|Bookmark.search.tool|boris|Bost|Boston.Project|BotRightHere|Bot.mailto:craftbot@yahoo.com|BotALot|botpaidtoclick|botw|brandwatch|BravoBrian|Brok|Bropwers|Broth|browseabit|BrowseX|Browsezilla|Bruin|bsalsa|Buddy|Build|Built|Bulls|bumblebee|Bunny|Busca|Busi|Buy|bwh3|c\-spider|CafeK|Cafi|camel|Cand|captu|Catch|cd34|Ceg|CFNetwork|cgichk|Cha0s|Chang|chaos|Char|char\(32\,35\)|charlotte|CheeseBot|Chek|CherryPicker|chill|ChinaClaw|CICC|Cisco|Cita|Clam|Claw|Click.Bot|clipping|clshttp|Clush|COAST|ColdFusion|Coll|Comb|commentreader|Compan|contact|Control|contype|Conc|Conv|Copernic|Copi|Copy|Coral|Corn|core-project|cosmos|costa|cr4nk|crank|craft|Crap|Crawler0|Crazy|Cres|cs\-CZ|cuill|Curl|Custo|Cute|CSHttp|Cyber|cyberalert|^DA$|daoBot|DARK|Data|Daten|Daum|dcbot|dcs|Deep|DepS|Detect|Deweb|Diam|Digger|Digimarc|digout4uagent|DIIbot|Dillo|Ding|DISC|discobot|Disp|Ditto|DLC|DnloadMage|DotBot|Doubanbot|Download|Download.Demon|Download.Devil|Download.Wonder|Downloader|drag|DreamPassport|Drec|Drip|dsdl|dsok|DSurf|DTAAgent|DTS|Dual|dumb|DynaWeb|e\-collector|eag|earn|EARTHCOM|EasyDL|ebin|EBM-APPLE|EBrowse|eCatch|echo|ecollector|Edco|edgeio|efp\@gmx\.net|EirGrabber|email|Email.Extractor|EmailCollector|EmailSearch|EmailSiphon|EmailWolf|Emer|empas|Enfi|Enhan|Enterprise\_Search|envolk|erck|EroCr|ESurf|Eval|Evil|Evere|EWH|Exabot|Exact|EXPLOITER|Expre|Extra|ExtractorPro|EyeN|FairAd|Fake|FANG|FAST|fastlwspider|FavOrg|Favorites.Sweeper|Faxo|FDM\_1|FDSE|fetch|FEZhead|Filan|FileHound|find|Firebat|Firefox.2\.0|Firs|Flam|Flash|FlickBot|Flip|fluffy|flunky|focus|Foob|Fooky|Forex|Forum|ForV|Fost|Foto|Foun|Franklin.Locator|freefind|FreshDownload|FrontPage|FSurf|Fuck|Fuer|futile|Fyber|Gais|GalaxyBot|Galbot|Gamespy\_Arcade|GbPl|Gener|geni|Geona|Get|gigabaz|Gira|Ginxbot|gluc|glx.?v|gnome|Go.Zilla|Goldfire|Google.Wireless.Transcoder|Googlebot\-Image|Got\-It|GOFORIT|gonzo|GornKer|GoSearch|^gotit$|gozilla|grab|Grabber|GrabNet|Grub|Grup|Graf|Green.Research|grub|grub\-client|gsa\-cra|GSearch|GT\:\:WWW|GuideBot|guruji|gvfs|Gyps|hack|haha|hailo|Harv|Hatena|Hax|Head|Helm|herit|hgre|hhjhj\@yahoo|Hippo|hloader|HMView|holm|holy|HomePageSearch|HooWWWer|HouxouCrawler|HMSE|HPPrint|htdig|HTTPConnect|httpdown|http.generic|HTTPGet|httplib|HTTPRetriever|HTTrack|human|Huron|hverify|Hybrid|Hyper|ia\_archiver|iaskspi|IBM\_Planetwide|iCCra|ichiro|ID\-Search|IDA|IDBot|IEAuto|IEMPT|iexplore\.exe|iGetter|Ilse|Iltrov|Image|Image.Stripper|Image.Sucker|imagefetch|iimds\_monitor|Incutio|IncyWincy|Indexer|Industry.Program|Indy|InetURL|informant|InfoNav|InfoTekies|Ingelin|Innerpr|Inspect|InstallShield.DigitalWizard|Insuran\.|Intellig|Intelliseek|InterGET|Internet.Ninja|Internet.x|Internet\_Explorer|InternetLinkagent|InternetSeer.com|Intraf|IP2|Ipsel|Iria|IRLbot|Iron33|Irvine|ISC\_Sys|iSilo|ISRCCrawler|ISSpi|IUPUI.Research.Bot|Jady|Jaka|Jam|^Java|java\/|Java\(tm\)|JBH.agent|Jenny|JetB|JetC|jeteye|jiro|JoBo|JOC|jupit|Just|Jyx|Kapere|kash|Kazo|KBee|Kenjin|Kernel|Keywo|KFSW|KKma|Know|kosmix|KRAE|KRetrieve|Krug|ksibot|ksoap|Kum|KWebGet|Lachesis|lanshan|Lapo|larbin|leacher|leech|LeechFTP|LeechGet|leipzig\.de|Lets|Lexi|lftp|Libby|libcrawl|libcurl|libfetch|libghttp|libWeb|libwhisker|libwww|libwww\-FM|libwww\-perl|LightningDownload|likse|Linc|Link|Link.Sleuth|LinkextractorPro|Linkie|LINKS.ARoMATIZED|LinkScan|linktiger|LinkWalker|Lint|List|lmcrawler|LMQ|LNSpiderguy|loader|LocalcomBot|Locu|London|lone|looksmart|loop|Lork|LTH\_|lwp\-request|LWP|lwp-request|lwp-trivial|Mac.Finder|Macintosh\;.I\;.PPC|Mac\_F|magi|Mag\-Net|Magnet|Magp|Mail.Sweeper|main|majest|Mam|Mana|MarcoPolo|mark.blonin|MarkWatch|MaSagool|Mass|Mass.Downloader|Mata|mavi|McBot|Mecha|MCspider|mediapartners|^Memo|MEGAUPLOAD|MetaProducts.Download.Express|Metaspin|Mete|Microsoft.Data.Access|Microsoft.URL|Microsoft\_Internet\_Explorer|MIDo|MIIx|miner|Mira|MIRE|Mirror|Miss|Missauga|Missigua.Locator|Missouri.College.Browse|Mist|Mizz|MJ12|mkdb|mlbot|MLM|MMMoCrawl|MnoG|moge|Moje|Monster|Monza.Browser|Mooz|Moreoverbot|MOT\-MPx220|mothra\/netscan|mouse|MovableType|Mozdex|Mozi\!|^Mozilla$|Mozilla\/1\.22|Mozilla\/22|^Mozilla\/3\.0.\(compatible|Mozilla\/3\.Mozilla\/2\.01|Mozilla\/4\.0\(compatible|Mozilla\/4\.08|Mozilla\/4\.61.\(Macintosh|Mozilla\/5\.0|Mozilla\/7\.0|Mozilla\/8|Mozilla\/9|Mozilla\:|Mozilla\/Firefox|^Mozilla.*Indy|^Mozilla.*NEWT|^Mozilla*MSIECrawler|Mp3Bot|MPF|MRA|MS.FrontPage|MS.?Search|MSFrontPage|MSIE\_6\.0|MSIE6|MSIECrawler|msnbot\-media|msnbot\-Products|MSNPTC|MSProxy|MSRBOT|multithreaddb|musc|MVAC|MWM|My\_age|MyApp|MyDog|MyEng|MyFamilyBot|MyGetRight|MyIE2|mysearch|myurl|NAG|NAMEPROTECT|NASA.Search|nationaldirectory|Naver|Navr|Near|NetAnts|netattache|Netcach|NetCarta|Netcraft|NetCrawl|NetMech|netprospector|NetResearchServer|NetSp|Net.Vampire|netX|NetZ|Neut|newLISP|NewsGatorInbox|NEWT|NEWT.ActiveX|Next|^NG|NICE|nikto|Nimb|Ninja|Ninte|NIPGCrawler|Noga|nogo|Noko|Nomad|Norb|noxtrumbot|NPbot|NuSe|Nutch|Nutex|NWSp|Obje|Ocel|Octo|ODI3|oegp|Offline|Offline.Explorer|Offline.Navigator|OK.Mozilla|omg|Omni|Onfo|onyx|OpaL|OpenBot|Openf|OpenTextSiteCrawler|OpenU|Orac|OrangeBot|Orbit|Oreg|osis|Outf|Owl|P3P|PackRat|PageGrabber|PagmIEDownload|pansci|Papa|Pars|Patw|pavu|Pb2Pb|pcBrow|PEAR|PEER|PECL|pepe|Perl|PerMan|PersonaPilot|Persuader|petit|PHP|PHP.vers|PHPot|Phras|PicaLo|Piff|Pige|pigs|^Ping|Pingd|PingALink|Pipe|Plag|Plant|playstarmusic|Pluck|Pockey|POE\-Com|Poirot|Pomp|Port.Huron|Post|powerset|Preload|press|Privoxy|Probe|Program.Shareware|Progressive.Download|ProPowerBot|prospector|Provider.Protocol.Discover|ProWebWalker|Prowl|Proxy|Prozilla|psbot|PSurf|psycheclone|^puf$|Pulse|Pump|PushSite|PussyCat|PuxaRapido|PycURL|Pyth|PyQ|QuepasaCreep|Query|Quest|QRVA|Qweer|radian|Radiation|Rambler|RAMP|RealDownload|Reap|Recorder|RedCarpet|RedKernel|ReGet|relevantnoise|replacer|Repo|requ|Rese|Retrieve|Rip|Rix|RMA|Roboz|Rogue|Rover|RPT\-HTTP|Rsync|RTG30|.ru\)|ruby|Rufus|Salt|Sample|SAPO|Sauger|savvy|SBIder|SBP|SCAgent|scan|SCEJ\_|Sched|Schizo|Schlong|Schmo|Scout|Scooter|Scorp|ScoutOut|SCrawl|screen|script|SearchExpress|searchhippo|Searchme|searchpreview|searchterms|Second.Street.Research|Security.Kol|Seekbot|Seeker|Sega|Sensis|Sept|Serious|Sezn|Shai|Share|Sharp|Shaz|shell|shelo|Sherl|Shim|Shiretoko|ShopWiki|SickleBot|Simple|Siph|sitecheck|SiteCrawler|SiteSnagger|Site.Sniper|SiteSucker|sitevigil|SiteX|Sleip|Slide|Slurpy.Verifier|Sly|Smag|SmartDownload|Smurf|sna\-|snag|Snake|Snapbot|Snip|Snoop|So\-net|SocSci|sogou|Sohu|solr|sootle|Soso|SpaceBison|Spad|Span|spanner|Speed|Spegla|Sphere|Sphider|spider|SpiderBot|SpiderEngine|SpiderView|Spin|sproose|Spurl|Spyder|Squi|SQ.Webscanner|sqwid|Sqworm|SSM\_Ag|Stack|Stamina|stamp|Stanford|Statbot|State|Steel|Strateg|Stress|Strip|studybot|Style|subot|Suck|Sume|sun4m|Sunrise|SuperBot|SuperBro|Supervi|Surf4Me|SuperHTTP|Surfbot|SurfWalker|Susi|suza|suzu|Sweep|sygol|syncrisis|Systems|Szukacz|Tagger|Tagyu|tAke|Talkro|TALWinHttpClient|tamu|Tandem|Tarantula|tarspider|tBot|TCF|Tcs\/1|TeamSoft|Tecomi|Teleport|Telesoft|Templeton|Tencent|Terrawiz|Test|TexNut|trivial|Turnitin|The.Intraformant|TheNomad|Thomas|TightTwatBot|Timely|Titan|TMCrawler|TMhtload|toCrawl|Todobr|Tongco|topic|Torrent|Track|translate|Traveler|TREEVIEW|True|Tunnel|turing|Turnitin|TutorGig|TV33\_Mercator|Twat|Tweak|Twice|Twisted.PageGetter|Tygo|ubee|UCmore|UdmSearch|UIowaCrawler|Ultraseek|UMBC|unf|UniversalFeedParser|unknown|UPG1|UtilMind|URLBase|URL.Control|URL\_Spider\_Pro|urldispatcher|URLGetFile|urllib|URLSpiderPro|URLy|User\-Agent|UserAgent|USyd|Vacuum|vagabo|Valet|Valid|Vamp|vayala|VB\_|VCI|VERI\~LI|verif|versus|via|Viewer|virtual|visibilitygap|Visual|vobsub|Void|VoilaBot|voyager|vspider|VSyn|w\:PACBHO60|w0000t|W3C|w3m|w3search|walhello|Walker|Wand|WAOL|WAPT|Watch|Wavefire|wbdbot|Weather|web.by.mail|Web.Data.Extractor|Web.Downloader|Web.Ima|Web.Mole|Web.Sucker|Web2Mal|Web2WAP|WebaltBot|WebAuto|WebBandit|Webbot|WebCapture|WebCat|webcraft\@bea|Webclip|webcollage|WebCollector|WebCopier|WebCopy|WebCor|webcrawl|WebDat|WebDav|webdevil|webdownloader|Webdup|WebEMail|WebEMailExtrac|WebEnhancer|WebFetch|WebGo|WebHook|Webinator|WebInd|webitpr|WebFilter|WebFountain|WebLea|Webmaster|WebmasterWorldForumBot|WebMin|WebMirror|webmole|webpic|WebPin|WebPix|WebReaper|WebRipper|WebRobot|WebSauger|WebSite|Website.eXtractor|Website.Quester|WebSnake|webspider|Webster|WebStripper|websucker|WebTre|WebVac|webwalk|WebWasher|WebWeasel|WebWhacker|WebZIP|Wells|WEP\_S|WEP.Search.00|WeRelateBot|wget|Whack|Whacker|whiz|WhosTalking|Widow|Win67|window.location|Windows.95\;|Windows.95\)|Windows.98\;|Windows.98\)|Winodws|Wildsoft.Surfer|WinHT|winhttp|WinHttpRequest|WinHTTrack|Winnie.Poh|wire|WISEbot|wisenutbot|wish|Wizz|WordP|Works|world|WUMPUS|Wweb|WWWC|WWWOFFLE|WWW\-Collector|WWW.Mechanize|www.ranks.nl|wwwster|^x$|X12R1|x\-Tractor|Xaldon|Xenu|XGET|xirq|Y\!OASIS|Y\!Tunnel|yacy|YaDirectBot|Yahoo\-MMAudVid|YahooSeeker|YahooYSMcm|Yamm|Yand|yang|Yeti|Yoono|yori|Yotta|YTunnel|Zade|zagre|ZBot|Zeal|ZeBot|zerx|Zeus|ZIPCode|Zixy|zmao|Zyborg [NC]
RewriteRule ^(.*)$ - [F,L]
</IfModule>

And, for those of you who enjoy looking at long lists of bad robots, here is the same blacklist of 1211 banned user agents in uncompressed format:

[ Uncompressed view of the Ultimate User-Agent Blacklist (click image for full-size view) ]
Click image for full-size uncompressed view of 1211 blocked user-agents

I love this game :)

Notes

  • 1 Special thanks to “Mr. M” for graciously sharing his extensive user-agent list and granting permission to integrate them into this version of the blacklist. Thanks M! :)
  • 2 Free iPod Nano plus honorable mention in my next article for the first person to identify my three “imaginary” (i.e., fake) user agents. Good luck! ;)

About the author

[ Jeff Starr ]

Jeff Starr is a web developer, graphic designer and content producer with over 10 years of experience and a passion for quality and detail. Jeff is co-author of the book Digging into WordPress and strives to help people be the best they can be on the Web. + Follow Jeff on Twitter and subscribe to Perishable Press for awesome web-design content delivered fresh.


59 Responses

Add a comment

[ Gravatar Icon ]

Lisa#1

Wow! This is one *huge* list. You could’ve charged people just for viewing this post and I’m sure most of us wouldn’t mind forking out some money just to take a peek at this ;)

I’ve been getting some CPU Load spikes (no thanks to irresponsible bots & spammers *!%$) on my server for the past 2 weeks…I hope this blacklist can help bring down the load on the server.

Btw, what do you mean by ‘three “imaginary” user agents’? Googlebot, MSNbot, Yahoo! Slurp?

Anyways, thanks for this amazing post ! :D

[ Gravatar Icon ]

Donace | The Nexus#2

That list is to huge! lol, I tried to find the fake ones but then I looked at the list!

[ Gravatar Icon ]

Lisa#3

Lol, now I get it. There’s 3 *fake* user-agents in the list. Is it…”dumb”, “fuck” & “human”?

[ Gravatar Icon ]

Jeff Starr#4

@Lisa: I hope you don’t mean that I could have charged people to view the blacklist in like a “freakshow” kind of way. Like, “step right up and take a peek at the world’s most hideously long HTAccess Blacklist!” Weird carnival music playing in dark tents and that sort of thing..

And yes, I meant “fake” more than “imaginary”, but sometimes my mind just poops out after too much writing. I have updated the footnote with this term, btw. And, no, the three fake user agents are not as you suggest ;)

@Donace: I know! There are almost 4 times as many bad bots in this list than in the previous “Ultimate” blacklist. Plus, if you consider the fact that we are dealing with regular-expressions and pattern matching for each of the terms, the number of blocked bots is probably somewhere in the hundreds of thousands or even millions.

[ Gravatar Icon ]

Jonathan Ellse#5

@ Lisa
I get it, but can’t see how you can tell.

@ Jeff Brilliant list, thanks.

[ Gravatar Icon ]

Andrew#6

hm…

“addressendeutschland”, “Alek”, and “black.hole”?

That’s a huge list, though :D

[ Gravatar Icon ]

Jeff Starr#7

@Andrew: Nope, those strings address names of “real” user agents, believe it or not.. :)

[ Gravatar Icon ]

B. Moore#8

That list is insane!!!

But the threats we deal with are just as insane, eye for an eye right!

Thank you for taking the time to compile & keep your list current.

mucho gracias!

[ Gravatar Icon ]

Donace | The Nexus#9

I think looking at all of this it harks back to Louis earlier idea of a whitelist;

i.e. if url is not X then redirect to a 404; lets bring back ye olde simple pages and just integrate .js for ‘fancy’ features.

[ Gravatar Icon ]

eezz#10

Hi, I think this is fantastic. I have one issue, my server gives a 500 error and the log shows :

RewriteCond: cannot compile regular expression '^$|\\|\\'|\\%|\\_iRc|\\_Works|\\@\\$x|\\&lt;\\?|\\$x0e|\\+select\\+|\\+union\\+|1\\,\\1\\,1\\,|2icommerce|3GSE|4all|59\\.64\\.153\\.|88\\.0\\.106\\.| … etc

Let me know if you need any more info…

[ Gravatar Icon ]

eezz#11

Well, I got it working, needed to fix a few things and break it over two lines, here are the results, let me know if this will still work as I dont really know htaccess code that well:

[ Editor’s note: Click here to see this revised version of the user-agent blacklist ]

[ Gravatar Icon ]

eezz#12

Hi,

I found the issue was 1\,\1\,1\,...

Needed to be 1\,1\,1\, for it to work.

Plus some bots are blocked:

  • Googlebot-Image/1.0
  • Mediapartners-Google
  • Feedfetcher-Google
  • Mozilla/5.0 (compatible; Ask Jeeves/Teoma; +http://about.ask.com/en/docs/about/webmasters.shtml)

I think these are alright, so I removed:

aster
webmaster
mediapartners
Googlebot\-Image
image
fetch

[ Gravatar Icon ]

eezz#13

Hi again,

Something else I needed to add…

Firefox.2\.0 blocks valid Firefox UA eg:
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.20) Gecko/20081217 Firefox/2.0.0.20

Needed to either change to Firefox\.2\.0 or delete it.

Plus I need to separate the list into 2 lines other wise I get:
Invalid command 't|UserAgent|USyd|Vacuum|vagabo| … etc.

The number of characters on the line that allows it to work without error is around 8140.

[ Gravatar Icon ]

Jeff Starr#14

@Donace: That would be nice indeed, but we both know there’s no turning back now! ;P

@eezz: Thanks for the help fine-tuning the blacklist. I am going to crop out your revised rules and link to them from your comment in an external file.

[ Gravatar Icon ]

Deb Phillips#15

Re: eezz’s Comment #12: Would removing “fetch” (Feed-Fetcher.Google) prevent a subscriber from accessing my blog using Google Reader (or any other feed reader)? I assume it’s included in the list because it facilitates “stealing” website content?

[ Gravatar Icon ]

Deb Phillips#16

I’m a novice at a lot of this coding, so I’m going to ask something that is probably obvious to most of you, but….

In order to use the revised code linked to in Comment #11, do I simply PASTE IT IN PLACE OF the code that’s between RewriteEngine on and RewriteRule ^(.*)$ - [F,L] in the ORIGINAL blacklist?

Thanks for letting me ask a very elementary question!

[ Gravatar Icon ]

Jeff Starr#17

@Deb Phillips: Re: Comment #15, “fetch” unintentionally blocks Google Feedfetcher. As eezz points out, if you remove “fetch” Feedfetcher should again have access to your site. But then, so will any of the bad bots that were otherwise blocked by the rule. There is no reason that I know of to block Feedfetcher — it’s generally considered to one of the “good guys”.

Re: Comment #16, Yes, exactly. eezz’s revised code would go in the following location:

<IfModule mod_rewrite.c>
RewriteEngine on

# eezz's code goes here

RewriteRule ^(.*)$ - [F,L]
</IfModule>

You could even remove the empty line between eezz’s two directives if desired. Let me know if I may be of any further assistance with this :)

[ Gravatar Icon ]

Deb Phillips#18

Jeff, thanks for your answers to my Comments #15 and #16.

Follow-up question:
Is there a way to leave “fetch” in the blacklist but somehow allow access by Google Feedfetcher? (I have a feeling the answer may be “no,” but I wanted to ask.)

If not, then if I want to allow Google Feedfetcher access, I just have to accept that other bots using that rule will have access, right?

Thanks again for your help.

[ Gravatar Icon ]

Publix#19

Hello Jeff,

I have a question. Is the “IfModule mod_rewrite.c” and “IfModule” necessary?

[ Gravatar Icon ]

Publix#20

Hi again,

I’ve recently implemented this blacklist and I noticed a bot with the user agent “Custom Spider Bot 2.0″ has been blocked. The bot is from an advertising company so it’s not possible for me to request them to change their user-agent.

Is there any way to make an exception for the particular bot. If no, should I remove “|spider|” from the list?

Another question is if I add “|cool|” to the list, does this means that all bots that has the “cool” phrase as their user-agent such as “This is a cool bot” or “Cool bot was here” will get blocked?

Btw, I’m really noob when it comes to htaccess…

Sincerely,
Publix

[ Gravatar Icon ]

Jeff Starr#21

@Deb: Correct, by removing the “fetch” string, you are allowing anything that was previously blocked because of that string. There is a trade-off between allowing one or two important bots versus blocking a zillion bad ones. I would remove “fetch” if I were using the list myself. Feedfetcher is too important.

@Publix: The module-check containers are not required if you know for certain that your server will always have the required modules. Best practice is to leave them in the code, but feel free to remove them if you are sure about your configuration.

As for the pattern-matching questions, it is as you say. Any instance of the “whatever” character string that is found within the user-agent name will cause that particular user agent to be blocked. Also, this list specifies [NC], which means that the pattern matching is not case-sensitive.

[ Gravatar Icon ]

Ukr#22

Thank you for all of your great materials. Does your ultimate user agent blacklist include the following:

Toata dragostea mea pentru diavola

This idiot has been hitting all of my sites. Demonic jerk.

…And by the way, I have no love for the devil, by quoting this user agent’s name.

[ Gravatar Icon ]

Jeff Starr#23

@Ukr: Not currently, but it would be easy enough to block. I would add “diavola” to the proper alphabetical location in the list. And don’t forget to follow the pattern so that there is a vertical bar ( | ) before and after the term. Then, any user agent containing that character string whatsoever will be blocked. I hope that helps!

[ Gravatar Icon ]

john#24

I have added the list from http://perishablepress.com/press/wp-content/online/code/user-agent-blacklist_revised.txt to my htaccess. I noticed that the htaccess rule made a bit slow down my site. I would like to know how can I add the list by using these command:

SetEnvIf User-Agent "the list of bad bots here" bad-bot

order allow,deny
allow from all
deny from env=bad-bot

Thank you very much.

[ Gravatar Icon ]

Jeff Starr#25

Hi john,

Here is the official Apache documentation for the setenvif module:

http://httpd.apache.org/docs/2.2/mod/mod_setenvif.html

That is a great place to start. I wish I had more time to research this myself, so I hope that the information contained in the Apache docs will get you going in the right direction. Also, you may be able to learn something about the process by searching for “SetEnvIf User-Agent” on Google.

[ Gravatar Icon ]

Ravi#26

Hello,
Thanks for the list of user agents.
I find user agents like this:
mozilla 4.0
mozilla 5.0
etc..

However, when I check my logs, I find google and many other good bots are with mozialla 5.0

So, will this blog google and other good visitors?

Please refine the list.
Thank you.

[ Gravatar Icon ]

Rama#27

Hi,
I am not good at htaccess codes.
Will such a big htaccess file slow down the loading of my site?
web site loading speed is the greatest concern for me.

Anyhow, thanks for providing such a great code.
Thank you.

[ Gravatar Icon ]

Jeff Starr#28

@Ravi: I could have sworn Google was using their GoogleBot crawlers.. do you have any examples of log entries showing Google using “mozialla 5.0″?

@Rama: I have not tested this list specifically, but have seen much bigger lists in play that don’t seem to have much of an impact on performance. But then again, I’m not going to sit here and tell you that it doesn’t have an effect - the server has to process all of those matches for every valid page request, so probably not advisable for high-volume traffic sites and/or slow servers. One thing you can do to improve performance in general is to add the following line to your root htaccess file:

AllowOverride None

For more info on this method, see my article Stupid htaccess Tricks.

[ Gravatar Icon ]

PhilB WordPress#29

Googlebot is indeed using Mozilla (”mozialla” is either a typo or that goth waitress at the Hyde Park Cafe in Austin).

Googlebot-Mozilla-2.1
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

This is the one I catch following js links.

[ Gravatar Icon ]

Jeff Starr#30

@PhilB WordPress: Interesting, this is the first time I’ve seen it. Has anyone performed a forward-reverse IP lookup to verify identity? Could change the game (at least for me) just a little ;)

[ Gravatar Icon ]

PhilB WordPress#31

Here is the one IP that hit several sites I work with looking for various txt files:

221.131.61.08

[ Gravatar Icon ]

Jeff Starr#32

I’m not sure that IP is from Google.. I checked via http://www.lookupserver.com/ and it returned no result — something that generally doesn’t happen with legitimate names/IPs. Is that the only IP associated with the Googlebot-Mozilla-2.1 agent?

[ Gravatar Icon ]

PhilB WordPress#33

Bad copy on my part - but its actually been documented for at least three years. I thought it was common knowledge.

See:

http://ekstreme.com/thingsofsorts/seosem/googlebot-requested-a-css-file

and

http://www.google.com/support/forum/p/Webmasters/thread?tid=79b23caa49b801a3&hl=en

for just two of hundreds of examples. Its been listed on user-agents.org for awhile as well.

[ Gravatar Icon ]

Jeff Starr#34

“common knowledge” - yeah, right.

In any case, it’s certainly interesting, and I can’t help but wonder if this particular Googlebot isn’t exclusively concerned with locating code snippets for the Google Code search engine (as one of the ekstreme comments also suggest). If so, then blocking it may serve a purpose, especially for anyone who would like to keep their hidden content away from Google. If not, then I suppose it is just as easy to remove the Mozilla patterns from the blacklist. Either way, it is very interesting to hear about this. Thank you for bringing it to my attention.

[ Gravatar Icon ]

Randomnesss#35

You SERIOUSLY tell us to ban curl/libcurl/PyCurl? Get a brain lol

[ Gravatar Icon ]

Jeff Starr#36

Actually, no. I merely provide information. As an informed, intelligent human being, you get to decide for yourself whether or not to use the info or make any changes according to your needs.

[ Gravatar Icon ]

RS2IP#37

Hi,

First of all I would like to thank you for the biggest user-agent blacklist :)

Only one downside though, when using this, I can’t verify my site with Google Webmaster Central…it gives a 403 Forbidden error..

[ Gravatar Icon ]

Jeff Starr#38

Hi RS2IP, there are some rules in the list that may be blocking that particular user agent. You could resolve this by determining the user agent that is being blocked (Google Webmaster Central) and then searching the list for anything might be blocking it. Conversely, you could use the halving method to diagnose and remove the offending rule(s). Totally doable! :)

[ Gravatar Icon ]

BK#39

I copy and paste the referrers list, save my .htaccess file but as soon as I do I get a 500 error.

First with referrers, here is the error log

/bla/bla/public_html/bla1/.htaccess: RewriteCond: bad flag delimiters, referer: http://www.bla1.bla.com/

When I removed all the lines such as the below line it worked. Anything with https seems to not work, do I need to preceed that with
RewriteCond %{HTTP_REFERER}

https?:\/\/([^\/]*\.)?skin-trt\.boom\.ru

SECOND PROBLEM, when I copy and paste the 4G Series: The Ultimate User-Agent Blacklist, Featuring Over 1200 Bad Bots

I get the below errors

/bla/bla/public_html/bla1/.htaccess: RewriteCond: cannot compile regular expression '^$|\\|\\'|\\%|\\_iRc|\\_Works|\\@\\$x|\\&lt;\\?|\\$x0e|\\+select\\+|\\+union\\+|1\\,\\1\\,1', referer: http://www.bla1.bla.com/contact-us/contacts/webmaster

and

/bla/bla/public_html/bla1/.htaccess: RewriteCond: cannot compile regular expression '^$|\\|\\'|\\%|\\_iRc|\\_Works|\\@\\$x|\\&lt;\\?|\\$x0e|\\+select\\+|\\+union\\+|1\\,\\1\\,1\\,|2icommerce| and this continues until the end of the list.

What did I do wrong?

When I use the 2G compressed list it works with no problems.
I have not tried the 3G black list.
Lastly, does the 4G blacklist do everything that 2 & 3G do?

[ Gravatar Icon ]

RapidShare to IP Address#40

It seems that Google also uses the following user-agent “Google-Site-Verification/1.0”. So the solution is to remove “verif” from the blacklist.

Another issue I’m encountering is that users using Windows Vista with Firefox 3.5.5 or Internet Explorer 8 are getting 403 Forbidden errors…

P.S. I just realised that I accidentally posted the above message at the wrong post. Sorry about that..

[ Gravatar Icon ]

Jeff Starr#41

@BK: not sure about the first issue, but the second problem I think has been addressed in a previous comment. The problem is the immense length of the expression, and the solution is to break it into two different lines, which has been graciously done for us by eezz. You can grab the dual-line version here:

http://perishablepress.com/press/wp-content/online/code/user-agent-blacklist_revised.txt

And, the 4G does most of what the previous versions do, plus a whole lot more. Much of the rules that were changed were done so to optimize performance, precision, and effectiveness of the blacklist. You can read more about the theory behind the 4G Blacklist for more information.

[ Gravatar Icon ]

Jeff Starr#42

@RapidShare to IP Address: Thanks for the heads up on the verif fix for Google.

I don’t know what the issue could be with Vista Fx or IE8, but identifying the offending term should be relatively easy using the halving method of identifying problematic code.

[ Gravatar Icon ]

zae#43

Hi,

Thank for providing this great list.

I create a log file and found that “baidu.com” caught by term “spider”:

---
HTTP_USER_AGENT: Baiduspider+(+http://www.baidu.com/search/spider.htm)
---

FYI, baidu.com is legitimate search engine.

@zae

[ Gravatar Icon ]

zae#44

This user agent also has been blocked by keyword “PHP”.

HTTP_USER_AGENT == Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 FirePHP/0.3

I think, all visitor with FirePHP add-on installed on they mozilla browser will be blocked by keyword “PHP

Thanks,
zae

[ Gravatar Icon ]

zae#45

YahooMobile-CrawlerEngine has been blocked by keyword “Link”. Here the user-agent:

HTTP_USER_AGENT == "Nokia6682/2.0 (3.01.1) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 configuration/CLDC-1.1 UP.Link/6.3.0.0.0 (compatible;YahooSeeker/M1A1-R2D2; http://help.yahoo.com/help/us/ysearch/crawling/crawling-01.html)"

For information, see http://help.yahoo.com/help/us/ysearch/crawling/crawling-01.html

Thanks,
zae

[ Gravatar Icon ]

zae#46

Other keyword block YahooMobile are “seeker” and “YahooSeeker”.

If we want to block YahooMobile, just keep that keyword. Otherwise, we should delete keyword: Link, seeker and YahooSeeker.

Thanks,
zae

[ Gravatar Icon ]

zae#47

Another report…

- Keyword “98”: block visitor using Windows 98

- Keyword “MEGAUPLOAD”: block visitor who use Firefox browser with “megaupload” add-on installed.

- Kewword “wire”: block some valid browser, including Safari on Mac.

- Keyword “agent”: block some of visitor with Firefox on Linux, ie this user-agent: “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.3; ips-agent) Gecko/20090824 Fedora/1.0.7-1.1.fc4 Firefox/3.5.3”. Note: that come from my friend computer, using Linux-Fedora.

I’ll add another report later.

Thanks,
zae

[ Gravatar Icon ]

david#48

Thanks for the user agent list. I’m a novice at htaccess codes.

I’m using eezz’s revised code, but it block opera mini user agent “Opera/9.50 (J2ME/MIDP; Opera Mini/4.0.9800/209; U; en)” and “BlackBerry9700/5.0.0.321 Profile/MIDP-2.1 Configuration/CLDC-1.1 VendorID/308”.

How to unblock this user agent?

[ Gravatar Icon ]

multiverse#49

Hello Jeff,

I have deployed the revised version of this list, which has had a very positive effect on my bandwidth and site performance. Kudos on an excellent article, and thanks for making it available for us all. I have received a number of reports where people aren’t being allowed onto the site. I was successful at finding one of the problems and fixing it (AtHome is a Norwegian ISP), not a problem so much as an unwanted filter. Alas the two lingering issues I have after a great deal of analysis (links to test files, log diving and etc.) I am unable to solve. I want to show you one of the problems hoping that you can give some advice on how to solve that type of issue as it relates to your filters.

Access Log Hit
http://pastebin.com/m7891b746

This hit results in a redirection to another website, which is what I want it to do. The trouble is that I cannot find in your rewrite conditions what is being caught. Can you give me some clue?

Many thanks,
multiverse

[ Gravatar Icon ]

Jeff Starr#50

@multiverse: the easiest way to isolate problem code is to remove half of it, check for proper functionality, and then remove another half, and so on until you discover the culprit. More info on this technique here.

[ Gravatar Icon ]

multiverse#51

@Jeff Thanks for taking the time to reply. That’s obviously an excellent strategy, and I’ll begin to use to troubleshoot this issue. Again, many thanks.

[ Gravatar Icon ]

Shane#52

Thanks Jeff for the excellent list.
My Joomla site is pumping 30gig of bandwidth/mth, averaging 12k unique visits. Wonder if that’s normal?

Tried your first list, but got 500 error. Used Eeze’s list, made the other alterations suggested further down the list, and it works like a dream.

I’ll watch bandwidth and see what happens.

[ Gravatar Icon ]

Derek#53

hola.

Like many said, you saved me hours and many concerns about security.

Put up a donate button.

Thanx a million

[ Gravatar Icon ]

multiverse#54

@Jeff @everyone I wanted to follow up on my problem and what I’ve learned about it. After posting my question here, most complaints were from mobile users. The iPhone’s Safari works just fine on my 3GS. However our Android users and Windows CE users were the ones reporting continued issues. I decided that users who were being filtered out were in such a small population of users that it was more important to have the site security and that they would have to do without until we move to the next generation of filtering.

That’s a terrible strategy, because we expect mobile device users to increase going forward. With that in mind, one of my co-admins suggested we deploy the Tapatalk plugin, which has enabled our mobile users to have access. That too is a bad strategy, which we hope to take care of when we move to the next generation of filtering, though it works as short term kludge.

[ Gravatar Icon ]

Jeff Starr#55

@multiverse: Thanks for the information - much appreciated.

Trackbacks / Pingbacks
  1. Highlight post author in comments in Wordpress | Your Site is Valid Blog
  2. 晓闻心雨 » 十招阻止WordPress中的垃圾评论
  3. Spami durmanın 10 yolu | nettuts
  4. Le top 10 des astuces anti-spam pour Wordpress » Inside da web
Share your thoughts..

Read Comment Policy

Comment Rules: No spam. No profanity. Use your real name. You may use simple HTML tags for style. Wrap all code in <code> tags. Learn more.



Attention: Do NOT follow this link!