4G Series: The Ultimate User-Agent Blacklist, Featuring Over 1200 Bad Bots

Posted on March 29, 2009 in Websites by

[ Image: Inverted Eclipse ] As discussed in my recent article, Eight Ways to Blacklist with Apache’s mod_rewrite, one method of stopping spammers, scrapers, email harvesters, and malicious bots is to blacklist their associated user agents. Apache enables us to target bad user agents by testing the user-agent string against a predefined blacklist of unwanted visitors. Any bot identifying itself as one of the blacklisted agents is immediately and quietly denied access. While this certainly isn’t the most effective method of securing your site against malicious behavior, it may certainly provide another layer of protection.

Even so, there are several things to consider before choosing to implement an extensive user-agent blacklist on your site. First and most importantly is the transient nature of the user agent itself. On most systems, the user-agent variable is easy to change, making it possible for bot owners to use any user-agent name they wish. Once a bad bot makes the rounds, becomes known, and is blacklisted, the bot owner need only modify or change its declared user agent and they’re back in business. User-agent names are constantly invented, spoofed, or otherwise altered in order to operate beneath — or above — the virtual radar. Thus, a user-agent blacklist is a high-maintenance affair, requiring continuous cultivation in order to maintain relevancy and effectiveness.

Performance is another important issue to consider. While a well-maintained user-agent blacklist may average a reasonable number of user agents, blacklists that are simply appended with new names will eventually grow painfully large and ultimately decrease server performance. Then you’re left with a never-ending blacklist of retired user agents that fails to protect your site while slowing things down to a virtual crawl (no pun intended). And despite your best intentions, we both know that taking time for periodic “blacklist maintenance” is a luxury that simply doesn’t exist, at least for most of us.

As if those reasons weren’t enough to persuade you against using an ultimate user-agent blacklist, here is another: the 4G Blacklist. Put simply, the 4G Blacklist is a more effective way to protect your site against a wide variety of spam, exploits, and malicious attacks. Unlike huge lists of banned user agents, the 4G Blacklist requires zero maintenance, consumes fewer resources, and may retain its effectiveness indefinitely. But alas, for those of you who are still determined to get your hands on the latest “ultimate” user-agent blacklist, here you go..

The Ultimate User-Agent Blacklist

As you may recall, the original Ultimate HTAccess Blacklist was released here at Perishable Press a couple of years ago. Then, several months later, I added more bad user agents, compressed the list into single-line format, and released the Ultimate HTAccess Blacklist 2. This list contained over 300 bad bots and was generally well-received by the community, protecting many sites against a plethora of site rippers, grabbers, spammers, harvesters, bad bots, and other online scum. When used as a solid foundation on which to build and cultivate your own user-agent blacklist, the Ultimate HTAccess Blacklist can do wonders to improve overall performance, decrease site maintenance, and reduce server expense.

Now, in this new and improved version of the Ultimate User-Agent Blacklist, I have integrated my recent collection 1 of actively malicious bad bots to more than quadruple the number of blocked user agents. This new list features a whopping 1211 blacklisted user agents, including three of my own creation 2 to be used exclusively for my diabolical and obsessive monitoring purposes (insert maniacal laughter here). Also, as with the second version of the user-agent blacklist, this new version is written in compressed, single-line format to facilitate usability and performance.

So, without further ado, here is the third incarnation of the Ultimate User-Agent Blacklist. Simply copy and paste the following code into the root HTAccess file of your site to enjoy a serious reduction in wasted bandwidth, stolen resources, and comment spam. Remember to backup your stuff before you meddle with things, and always test, test, test whenever implementing HTAccess directives.

# PERISHABLE PRESS ULTIMATE USER-AGENT BLACKLIST

<IfModule mod_rewrite.c>
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^$|\<|\>|\'|\%|\_iRc|\_Works|\@\$x|\<\?|\$x0e|\+select\+|\+union\+|1\,\1\,1\,|2icommerce|3GSE|4all|59\.64\.153\.|88\.0\.106\.|98|85\.17\.|A\_Browser|ABAC|Abont|abot|Accept|Access|Accoo|AceFTP|Acme|ActiveTouristBot|Address|Adopt|adress|adressendeutschland|ADSARobot|agent|ah\-ha|Ahead|AESOP\_com\_SpiderMan|aipbot|Alarm|Albert|Alek|Alexibot|Alligator|AllSubmitter|alma|almaden|ALot|Alpha|aktuelles|Akregat|Amfi|amzn\_assoc|Anal|Anarchie|andit|Anon|AnotherBot|Ansearch|AnswerBus|antivirx|Apexoo|appie|Aqua_Products|Arachmo|archive|arian|ASPSe|ASSORT|aster|Atari|ATHENS|AtHome|Atlocal|Atomic_Email_Hunter|Atomz|Atrop|^attach|attrib|autoemailspider|autohttp|axod|batch|b2w|Back|BackDoorBot|BackStreet|BackWeb|Badass|Baid|Bali|Bandit|Baidu|Barry|BasicHTTP|BatchFTP|bdfetch|beat|Become|Beij|BenchMark|berts|bew|big.brother|Bigfoot|Bilgi|Bison|Bitacle|Biz360|Black|Black.Hole|BlackWidow|bladder.fusion|Blaiz|Blog.Checker|Blogl|BlogPeople|Blogshares.Spiders|Bloodhound|Blow|bmclient|Board|BOI|boitho|Bond|Bookmark.search.tool|boris|Bost|Boston.Project|BotRightHere|Bot.mailto:craftbot@yahoo.com|BotALot|botpaidtoclick|botw|brandwatch|BravoBrian|Brok|Bropwers|Broth|browseabit|BrowseX|Browsezilla|Bruin|bsalsa|Buddy|Build|Built|Bulls|bumblebee|Bunny|Busca|Busi|Buy|bwh3|c\-spider|CafeK|Cafi|camel|Cand|captu|Catch|cd34|Ceg|CFNetwork|cgichk|Cha0s|Chang|chaos|Char|char\(32\,35\)|charlotte|CheeseBot|Chek|CherryPicker|chill|ChinaClaw|CICC|Cisco|Cita|Clam|Claw|Click.Bot|clipping|clshttp|Clush|COAST|ColdFusion|Coll|Comb|commentreader|Compan|contact|Control|contype|Conc|Conv|Copernic|Copi|Copy|Coral|Corn|core-project|cosmos|costa|cr4nk|crank|craft|Crap|Crawler0|Crazy|Cres|cs\-CZ|cuill|Curl|Custo|Cute|CSHttp|Cyber|cyberalert|^DA$|daoBot|DARK|Data|Daten|Daum|dcbot|dcs|Deep|DepS|Detect|Deweb|Diam|Digger|Digimarc|digout4uagent|DIIbot|Dillo|Ding|DISC|discobot|Disp|Ditto|DLC|DnloadMage|DotBot|Doubanbot|Download|Download.Demon|Download.Devil|Download.Wonder|Downloader|drag|DreamPassport|Drec|Drip|dsdl|dsok|DSurf|DTAAgent|DTS|Dual|dumb|DynaWeb|e\-collector|eag|earn|EARTHCOM|EasyDL|ebin|EBM-APPLE|EBrowse|eCatch|echo|ecollector|Edco|edgeio|efp\@gmx\.net|EirGrabber|email|Email.Extractor|EmailCollector|EmailSearch|EmailSiphon|EmailWolf|Emer|empas|Enfi|Enhan|Enterprise\_Search|envolk|erck|EroCr|ESurf|Eval|Evil|Evere|EWH|Exabot|Exact|EXPLOITER|Expre|Extra|ExtractorPro|EyeN|FairAd|Fake|FANG|FAST|fastlwspider|FavOrg|Favorites.Sweeper|Faxo|FDM\_1|FDSE|fetch|FEZhead|Filan|FileHound|find|Firebat|Firefox.2\.0|Firs|Flam|Flash|FlickBot|Flip|fluffy|flunky|focus|Foob|Fooky|Forex|Forum|ForV|Fost|Foto|Foun|Franklin.Locator|freefind|FreshDownload|FrontPage|FSurf|Fuck|Fuer|futile|Fyber|Gais|GalaxyBot|Galbot|Gamespy\_Arcade|GbPl|Gener|geni|Geona|Get|gigabaz|Gira|Ginxbot|gluc|glx.?v|gnome|Go.Zilla|Goldfire|Google.Wireless.Transcoder|Googlebot\-Image|Got\-It|GOFORIT|gonzo|GornKer|GoSearch|^gotit$|gozilla|grab|Grabber|GrabNet|Grub|Grup|Graf|Green.Research|grub|grub\-client|gsa\-cra|GSearch|GT\:\:WWW|GuideBot|guruji|gvfs|Gyps|hack|haha|hailo|Harv|Hatena|Hax|Head|Helm|herit|hgre|hhjhj\@yahoo|Hippo|hloader|HMView|holm|holy|HomePageSearch|HooWWWer|HouxouCrawler|HMSE|HPPrint|htdig|HTTPConnect|httpdown|http.generic|HTTPGet|httplib|HTTPRetriever|HTTrack|human|Huron|hverify|Hybrid|Hyper|ia\_archiver|iaskspi|IBM\_Planetwide|iCCra|ichiro|ID\-Search|IDA|IDBot|IEAuto|IEMPT|iexplore\.exe|iGetter|Ilse|Iltrov|Image|Image.Stripper|Image.Sucker|imagefetch|iimds\_monitor|Incutio|IncyWincy|Indexer|Industry.Program|Indy|InetURL|informant|InfoNav|InfoTekies|Ingelin|Innerpr|Inspect|InstallShield.DigitalWizard|Insuran\.|Intellig|Intelliseek|InterGET|Internet.Ninja|Internet.x|Internet\_Explorer|InternetLinkagent|InternetSeer.com|Intraf|IP2|Ipsel|Iria|IRLbot|Iron33|Irvine|ISC\_Sys|iSilo|ISRCCrawler|ISSpi|IUPUI.Research.Bot|Jady|Jaka|Jam|^Java|java\/|Java\(tm\)|JBH.agent|Jenny|JetB|JetC|jeteye|jiro|JoBo|JOC|jupit|Just|Jyx|Kapere|kash|Kazo|KBee|Kenjin|Kernel|Keywo|KFSW|KKma|Know|kosmix|KRAE|KRetrieve|Krug|ksibot|ksoap|Kum|KWebGet|Lachesis|lanshan|Lapo|larbin|leacher|leech|LeechFTP|LeechGet|leipzig\.de|Lets|Lexi|lftp|Libby|libcrawl|libcurl|libfetch|libghttp|libWeb|libwhisker|libwww|libwww\-FM|libwww\-perl|LightningDownload|likse|Linc|Link|Link.Sleuth|LinkextractorPro|Linkie|LINKS.ARoMATIZED|LinkScan|linktiger|LinkWalker|Lint|List|lmcrawler|LMQ|LNSpiderguy|loader|LocalcomBot|Locu|London|lone|looksmart|loop|Lork|LTH\_|lwp\-request|LWP|lwp-request|lwp-trivial|Mac.Finder|Macintosh\;.I\;.PPC|Mac\_F|magi|Mag\-Net|Magnet|Magp|Mail.Sweeper|main|majest|Mam|Mana|MarcoPolo|mark.blonin|MarkWatch|MaSagool|Mass|Mass.Downloader|Mata|mavi|McBot|Mecha|MCspider|mediapartners|^Memo|MEGAUPLOAD|MetaProducts.Download.Express|Metaspin|Mete|Microsoft.Data.Access|Microsoft.URL|Microsoft\_Internet\_Explorer|MIDo|MIIx|miner|Mira|MIRE|Mirror|Miss|Missauga|Missigua.Locator|Missouri.College.Browse|Mist|Mizz|MJ12|mkdb|mlbot|MLM|MMMoCrawl|MnoG|moge|Moje|Monster|Monza.Browser|Mooz|Moreoverbot|MOT\-MPx220|mothra\/netscan|mouse|MovableType|Mozdex|Mozi\!|^Mozilla$|Mozilla\/1\.22|Mozilla\/22|^Mozilla\/3\.0.\(compatible|Mozilla\/3\.Mozilla\/2\.01|Mozilla\/4\.0\(compatible|Mozilla\/4\.08|Mozilla\/4\.61.\(Macintosh|Mozilla\/5\.0|Mozilla\/7\.0|Mozilla\/8|Mozilla\/9|Mozilla\:|Mozilla\/Firefox|^Mozilla.*Indy|^Mozilla.*NEWT|^Mozilla*MSIECrawler|Mp3Bot|MPF|MRA|MS.FrontPage|MS.?Search|MSFrontPage|MSIE\_6\.0|MSIE6|MSIECrawler|msnbot\-media|msnbot\-Products|MSNPTC|MSProxy|MSRBOT|multithreaddb|musc|MVAC|MWM|My\_age|MyApp|MyDog|MyEng|MyFamilyBot|MyGetRight|MyIE2|mysearch|myurl|NAG|NAMEPROTECT|NASA.Search|nationaldirectory|Naver|Navr|Near|NetAnts|netattache|Netcach|NetCarta|Netcraft|NetCrawl|NetMech|netprospector|NetResearchServer|NetSp|Net.Vampire|netX|NetZ|Neut|newLISP|NewsGatorInbox|NEWT|NEWT.ActiveX|Next|^NG|NICE|nikto|Nimb|Ninja|Ninte|NIPGCrawler|Noga|nogo|Noko|Nomad|Norb|noxtrumbot|NPbot|NuSe|Nutch|Nutex|NWSp|Obje|Ocel|Octo|ODI3|oegp|Offline|Offline.Explorer|Offline.Navigator|OK.Mozilla|omg|Omni|Onfo|onyx|OpaL|OpenBot|Openf|OpenTextSiteCrawler|OpenU|Orac|OrangeBot|Orbit|Oreg|osis|Outf|Owl|P3P|PackRat|PageGrabber|PagmIEDownload|pansci|Papa|Pars|Patw|pavu|Pb2Pb|pcBrow|PEAR|PEER|PECL|pepe|Perl|PerMan|PersonaPilot|Persuader|petit|PHP|PHP.vers|PHPot|Phras|PicaLo|Piff|Pige|pigs|^Ping|Pingd|PingALink|Pipe|Plag|Plant|playstarmusic|Pluck|Pockey|POE\-Com|Poirot|Pomp|Port.Huron|Post|powerset|Preload|press|Privoxy|Probe|Program.Shareware|Progressive.Download|ProPowerBot|prospector|Provider.Protocol.Discover|ProWebWalker|Prowl|Proxy|Prozilla|psbot|PSurf|psycheclone|^puf$|Pulse|Pump|PushSite|PussyCat|PuxaRapido|PycURL|Pyth|PyQ|QuepasaCreep|Query|Quest|QRVA|Qweer|radian|Radiation|Rambler|RAMP|RealDownload|Reap|Recorder|RedCarpet|RedKernel|ReGet|relevantnoise|replacer|Repo|requ|Rese|Retrieve|Rip|Rix|RMA|Roboz|Rogue|Rover|RPT\-HTTP|Rsync|RTG30|.ru\)|ruby|Rufus|Salt|Sample|SAPO|Sauger|savvy|SBIder|SBP|SCAgent|scan|SCEJ\_|Sched|Schizo|Schlong|Schmo|Scout|Scooter|Scorp|ScoutOut|SCrawl|screen|script|SearchExpress|searchhippo|Searchme|searchpreview|searchterms|Second.Street.Research|Security.Kol|Seekbot|Seeker|Sega|Sensis|Sept|Serious|Sezn|Shai|Share|Sharp|Shaz|shell|shelo|Sherl|Shim|Shiretoko|ShopWiki|SickleBot|Simple|Siph|sitecheck|SiteCrawler|SiteSnagger|Site.Sniper|SiteSucker|sitevigil|SiteX|Sleip|Slide|Slurpy.Verifier|Sly|Smag|SmartDownload|Smurf|sna\-|snag|Snake|Snapbot|Snip|Snoop|So\-net|SocSci|sogou|Sohu|solr|sootle|Soso|SpaceBison|Spad|Span|spanner|Speed|Spegla|Sphere|Sphider|spider|SpiderBot|SpiderEngine|SpiderView|Spin|sproose|Spurl|Spyder|Squi|SQ.Webscanner|sqwid|Sqworm|SSM\_Ag|Stack|Stamina|stamp|Stanford|Statbot|State|Steel|Strateg|Stress|Strip|studybot|Style|subot|Suck|Sume|sun4m|Sunrise|SuperBot|SuperBro|Supervi|Surf4Me|SuperHTTP|Surfbot|SurfWalker|Susi|suza|suzu|Sweep|sygol|syncrisis|Systems|Szukacz|Tagger|Tagyu|tAke|Talkro|TALWinHttpClient|tamu|Tandem|Tarantula|tarspider|tBot|TCF|Tcs\/1|TeamSoft|Tecomi|Teleport|Telesoft|Templeton|Tencent|Terrawiz|Test|TexNut|trivial|Turnitin|The.Intraformant|TheNomad|Thomas|TightTwatBot|Timely|Titan|TMCrawler|TMhtload|toCrawl|Todobr|Tongco|topic|Torrent|Track|translate|Traveler|TREEVIEW|True|Tunnel|turing|Turnitin|TutorGig|TV33\_Mercator|Twat|Tweak|Twice|Twisted.PageGetter|Tygo|ubee|UCmore|UdmSearch|UIowaCrawler|Ultraseek|UMBC|unf|UniversalFeedParser|unknown|UPG1|UtilMind|URLBase|URL.Control|URL\_Spider\_Pro|urldispatcher|URLGetFile|urllib|URLSpiderPro|URLy|User\-Agent|UserAgent|USyd|Vacuum|vagabo|Valet|Valid|Vamp|vayala|VB\_|VCI|VERI\~LI|verif|versus|via|Viewer|virtual|visibilitygap|Visual|vobsub|Void|VoilaBot|voyager|vspider|VSyn|w\:PACBHO60|w0000t|W3C|w3m|w3search|walhello|Walker|Wand|WAOL|WAPT|Watch|Wavefire|wbdbot|Weather|web.by.mail|Web.Data.Extractor|Web.Downloader|Web.Ima|Web.Mole|Web.Sucker|Web2Mal|Web2WAP|WebaltBot|WebAuto|WebBandit|Webbot|WebCapture|WebCat|webcraft\@bea|Webclip|webcollage|WebCollector|WebCopier|WebCopy|WebCor|webcrawl|WebDat|WebDav|webdevil|webdownloader|Webdup|WebEMail|WebEMailExtrac|WebEnhancer|WebFetch|WebGo|WebHook|Webinator|WebInd|webitpr|WebFilter|WebFountain|WebLea|Webmaster|WebmasterWorldForumBot|WebMin|WebMirror|webmole|webpic|WebPin|WebPix|WebReaper|WebRipper|WebRobot|WebSauger|WebSite|Website.eXtractor|Website.Quester|WebSnake|webspider|Webster|WebStripper|websucker|WebTre|WebVac|webwalk|WebWasher|WebWeasel|WebWhacker|WebZIP|Wells|WEP\_S|WEP.Search.00|WeRelateBot|wget|Whack|Whacker|whiz|WhosTalking|Widow|Win67|window.location|Windows.95\;|Windows.95\)|Windows.98\;|Windows.98\)|Winodws|Wildsoft.Surfer|WinHT|winhttp|WinHttpRequest|WinHTTrack|Winnie.Poh|wire|WISEbot|wisenutbot|wish|Wizz|WordP|Works|world|WUMPUS|Wweb|WWWC|WWWOFFLE|WWW\-Collector|WWW.Mechanize|www.ranks.nl|wwwster|^x$|X12R1|x\-Tractor|Xaldon|Xenu|XGET|xirq|Y\!OASIS|Y\!Tunnel|yacy|YaDirectBot|Yahoo\-MMAudVid|YahooSeeker|YahooYSMcm|Yamm|Yand|yang|Yeti|Yoono|yori|Yotta|YTunnel|Zade|zagre|ZBot|Zeal|ZeBot|zerx|Zeus|ZIPCode|Zixy|zmao|Zyborg [NC]
RewriteRule ^(.*)$ - [F,L]
</IfModule>

And, for those of you who enjoy looking at long lists of bad robots, here is the same blacklist of 1211 banned user agents in uncompressed format:

[ Uncompressed view of the Ultimate User-Agent Blacklist (click image for full-size view) ]
Click image for full-size uncompressed view of 1211 blocked user-agents

I love this game :)

Notes

  • 1 Special thanks to “Mr. M” for graciously sharing his extensive user-agent list and granting permission to integrate them into this version of the blacklist. Thanks M! :)
  • 2 Free iPod Nano plus honorable mention in my next article for the first person to identify my three “imaginary” (i.e., fake) user agents. Good luck! ;)

Related articles

71 Responses

  1. [ Gravatar Icon ] Lisa says:

    Wow! This is one *huge* list. You could’ve charged people just for viewing this post and I’m sure most of us wouldn’t mind forking out some money just to take a peek at this ;)

    I’ve been getting some CPU Load spikes (no thanks to irresponsible bots & spammers *!%$) on my server for the past 2 weeks…I hope this blacklist can help bring down the load on the server.

    Btw, what do you mean by ‘three “imaginary” user agents’? Googlebot, MSNbot, Yahoo! Slurp?

    Anyways, thanks for this amazing post ! :D

  2. That list is to huge! lol, I tried to find the fake ones but then I looked at the list!

  3. [ Gravatar Icon ] Lisa says:

    Lol, now I get it. There’s 3 *fake* user-agents in the list. Is it…”dumb”, “fuck” & “human”?

  4. [ Gravatar Icon ] Jeff Starr says:

    @Lisa: I hope you don’t mean that I could have charged people to view the blacklist in like a “freakshow” kind of way. Like, “step right up and take a peek at the world’s most hideously long HTAccess Blacklist!” Weird carnival music playing in dark tents and that sort of thing..

    And yes, I meant “fake” more than “imaginary”, but sometimes my mind just poops out after too much writing. I have updated the footnote with this term, btw. And, no, the three fake user agents are not as you suggest ;)

    @Donace: I know! There are almost 4 times as many bad bots in this list than in the previous “Ultimate” blacklist. Plus, if you consider the fact that we are dealing with regular-expressions and pattern matching for each of the terms, the number of blocked bots is probably somewhere in the hundreds of thousands or even millions.

  5. @ Lisa
    I get it, but can’t see how you can tell.

    @ Jeff Brilliant list, thanks.

  6. [ Gravatar Icon ] Andrew says:

    hm…

    “addressendeutschland”, “Alek”, and “black.hole”?

    That’s a huge list, though :D

  7. [ Gravatar Icon ] Jeff Starr says:

    @Andrew: Nope, those strings address names of “real” user agents, believe it or not.. :)

  8. [ Gravatar Icon ] B. Moore says:

    That list is insane!!!

    But the threats we deal with are just as insane, eye for an eye right!

    Thank you for taking the time to compile & keep your list current.

    mucho gracias!

  9. I think looking at all of this it harks back to Louis earlier idea of a whitelist;

    i.e. if url is not X then redirect to a 404; lets bring back ye olde simple pages and just integrate .js for ‘fancy’ features.

  10. [ Gravatar Icon ] eezz says:

    Hi, I think this is fantastic. I have one issue, my server gives a 500 error and the log shows :

    RewriteCond: cannot compile regular expression '^$|\\|\\'|\\%|\\_iRc|\\_Works|\\@\\$x|\\&lt;\\?|\\$x0e|\\+select\\+|\\+union\\+|1\\,\\1\\,1\\,|2icommerce|3GSE|4all|59\\.64\\.153\\.|88\\.0\\.106\\.| … etc

    Let me know if you need any more info…

  11. [ Gravatar Icon ] eezz says:

    Well, I got it working, needed to fix a few things and break it over two lines, here are the results, let me know if this will still work as I dont really know htaccess code that well:

    [ Editor’s note: Click here to see this revised version of the user-agent blacklist ]

  12. [ Gravatar Icon ] eezz says:

    Hi,

    I found the issue was 1\,\1\,1\,...

    Needed to be 1\,1\,1\, for it to work.

    Plus some bots are blocked:

    • Googlebot-Image/1.0
    • Mediapartners-Google
    • Feedfetcher-Google
    • Mozilla/5.0 (compatible; Ask Jeeves/Teoma; +http://about.ask.com/en/docs/about/webmasters.shtml)

    I think these are alright, so I removed:

    aster
    webmaster
    mediapartners
    Googlebot\-Image
    image
    fetch

  13. [ Gravatar Icon ] eezz says:

    Hi again,

    Something else I needed to add…

    Firefox.2\.0 blocks valid Firefox UA eg:
    Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.20) Gecko/20081217 Firefox/2.0.0.20

    Needed to either change to Firefox\.2\.0 or delete it.

    Plus I need to separate the list into 2 lines other wise I get:
    Invalid command 't|UserAgent|USyd|Vacuum|vagabo| … etc.

    The number of characters on the line that allows it to work without error is around 8140.

  14. [ Gravatar Icon ] Jeff Starr says:

    @Donace: That would be nice indeed, but we both know there’s no turning back now! ;P

    @eezz: Thanks for the help fine-tuning the blacklist. I am going to crop out your revised rules and link to them from your comment in an external file.

  15. [ Gravatar Icon ] Deb Phillips says:

    Re: eezz’s Comment #12: Would removing “fetch” (Feed-Fetcher.Google) prevent a subscriber from accessing my blog using Google Reader (or any other feed reader)? I assume it’s included in the list because it facilitates “stealing” website content?

  16. [ Gravatar Icon ] Deb Phillips says:

    I’m a novice at a lot of this coding, so I’m going to ask something that is probably obvious to most of you, but….

    In order to use the revised code linked to in Comment #11, do I simply PASTE IT IN PLACE OF the code that’s between RewriteEngine on and RewriteRule ^(.*)$ - [F,L] in the ORIGINAL blacklist?

    Thanks for letting me ask a very elementary question!

  17. [ Gravatar Icon ] Jeff Starr says:

    @Deb Phillips: Re: Comment #15, “fetch” unintentionally blocks Google Feedfetcher. As eezz points out, if you remove “fetch” Feedfetcher should again have access to your site. But then, so will any of the bad bots that were otherwise blocked by the rule. There is no reason that I know of to block Feedfetcher — it’s generally considered to one of the “good guys”.

    Re: Comment #16, Yes, exactly. eezz’s revised code would go in the following location:

    <IfModule mod_rewrite.c>
    RewriteEngine on

    # eezz's code goes here

    RewriteRule ^(.*)$ - [F,L]
    </IfModule>

    You could even remove the empty line between eezz’s two directives if desired. Let me know if I may be of any further assistance with this :)

  18. [ Gravatar Icon ] Deb Phillips says:

    Jeff, thanks for your answers to my Comments #15 and #16.

    Follow-up question:
    Is there a way to leave “fetch” in the blacklist but somehow allow access by Google Feedfetcher? (I have a feeling the answer may be “no,” but I wanted to ask.)

    If not, then if I want to allow Google Feedfetcher access, I just have to accept that other bots using that rule will have access, right?

    Thanks again for your help.

  19. [ Gravatar Icon ] Publix says:

    Hello Jeff,

    I have a question. Is the “IfModule mod_rewrite.c” and “IfModule” necessary?

  20. [ Gravatar Icon ] Publix says:

    Hi again,

    I’ve recently implemented this blacklist and I noticed a bot with the user agent “Custom Spider Bot 2.0″ has been blocked. The bot is from an advertising company so it’s not possible for me to request them to change their user-agent.

    Is there any way to make an exception for the particular bot. If no, should I remove “|spider|” from the list?

    Another question is if I add “|cool|” to the list, does this means that all bots that has the “cool” phrase as their user-agent such as “This is a cool bot” or “Cool bot was here” will get blocked?

    Btw, I’m really noob when it comes to htaccess…

    Sincerely,
    Publix

  21. [ Gravatar Icon ] Jeff Starr says:

    @Deb: Correct, by removing the “fetch” string, you are allowing anything that was previously blocked because of that string. There is a trade-off between allowing one or two important bots versus blocking a zillion bad ones. I would remove “fetch” if I were using the list myself. Feedfetcher is too important.

    @Publix: The module-check containers are not required if you know for certain that your server will always have the required modules. Best practice is to leave them in the code, but feel free to remove them if you are sure about your configuration.

    As for the pattern-matching questions, it is as you say. Any instance of the “whatever” character string that is found within the user-agent name will cause that particular user agent to be blocked. Also, this list specifies [NC], which means that the pattern matching is not case-sensitive.

  22. [ Gravatar Icon ] Ukr says:

    Thank you for all of your great materials. Does your ultimate user agent blacklist include the following:

    Toata dragostea mea pentru diavola

    This idiot has been hitting all of my sites. Demonic jerk.

    …And by the way, I have no love for the devil, by quoting this user agent’s name.

  23. [ Gravatar Icon ] Jeff Starr says:

    @Ukr: Not currently, but it would be easy enough to block. I would add “diavola” to the proper alphabetical location in the list. And don’t forget to follow the pattern so that there is a vertical bar ( | ) before and after the term. Then, any user agent containing that character string whatsoever will be blocked. I hope that helps!

  24. [ Gravatar Icon ] john says:

    I have added the list from http://perishablepress.com/press/wp-content/online/code/user-agent-blacklist_revised.txt to my htaccess. I noticed that the htaccess rule made a bit slow down my site. I would like to know how can I add the list by using these command:

    SetEnvIf User-Agent "the list of bad bots here" bad-bot

    order allow,deny
    allow from all
    deny from env=bad-bot

    Thank you very much.

  25. [ Gravatar Icon ] Jeff Starr says:

    Hi john,

    Here is the official Apache documentation for the setenvif module:

    http://httpd.apache.org/docs/2.2/mod/mod_setenvif.html

    That is a great place to start. I wish I had more time to research this myself, so I hope that the information contained in the Apache docs will get you going in the right direction. Also, you may be able to learn something about the process by searching for “SetEnvIf User-Agent” on Google.

  26. [ Gravatar Icon ] Ravi says:

    Hello,
    Thanks for the list of user agents.
    I find user agents like this:
    mozilla 4.0
    mozilla 5.0
    etc..

    However, when I check my logs, I find google and many other good bots are with mozialla 5.0

    So, will this blog google and other good visitors?

    Please refine the list.
    Thank you.

  27. [ Gravatar Icon ] Rama says:

    Hi,
    I am not good at htaccess codes.
    Will such a big htaccess file slow down the loading of my site?
    web site loading speed is the greatest concern for me.

    Anyhow, thanks for providing such a great code.
    Thank you.

  28. [ Gravatar Icon ] Jeff Starr says:

    @Ravi: I could have sworn Google was using their GoogleBot crawlers.. do you have any examples of log entries showing Google using “mozialla 5.0″?

    @Rama: I have not tested this list specifically, but have seen much bigger lists in play that don’t seem to have much of an impact on performance. But then again, I’m not going to sit here and tell you that it doesn’t have an effect - the server has to process all of those matches for every valid page request, so probably not advisable for high-volume traffic sites and/or slow servers. One thing you can do to improve performance in general is to add the following line to your root htaccess file:

    AllowOverride None

    For more info on this method, see my article Stupid htaccess Tricks.

  29. Googlebot is indeed using Mozilla (”mozialla” is either a typo or that goth waitress at the Hyde Park Cafe in Austin).

    Googlebot-Mozilla-2.1
    Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

    This is the one I catch following js links.

  30. [ Gravatar Icon ] Jeff Starr says:

    @PhilB WordPress: Interesting, this is the first time I’ve seen it. Has anyone performed a forward-reverse IP lookup to verify identity? Could change the game (at least for me) just a little ;)

  31. Here is the one IP that hit several sites I work with looking for various txt files:

    221.131.61.08

  32. [ Gravatar Icon ] Jeff Starr says:

    I’m not sure that IP is from Google.. I checked via http://www.lookupserver.com/ and it returned no result — something that generally doesn’t happen with legitimate names/IPs. Is that the only IP associated with the Googlebot-Mozilla-2.1 agent?

  33. Bad copy on my part - but its actually been documented for at least three years. I thought it was common knowledge.

    See:

    http://ekstreme.com/thingsofsorts/seosem/googlebot-requested-a-css-file

    and

    http://www.google.com/support/forum/p/Webmasters/thread?tid=79b23caa49b801a3&hl=en

    for just two of hundreds of examples. Its been listed on user-agents.org for awhile as well.

  34. [ Gravatar Icon ] Jeff Starr says:

    “common knowledge” - yeah, right.

    In any case, it’s certainly interesting, and I can’t help but wonder if this particular Googlebot isn’t exclusively concerned with locating code snippets for the Google Code search engine (as one of the ekstreme comments also suggest). If so, then blocking it may serve a purpose, especially for anyone who would like to keep their hidden content away from Google. If not, then I suppose it is just as easy to remove the Mozilla patterns from the blacklist. Either way, it is very interesting to hear about this. Thank you for bringing it to my attention.

  35. [ Gravatar Icon ] Randomnesss says:

    You SERIOUSLY tell us to ban curl/libcurl/PyCurl? Get a brain lol

  36. [ Gravatar Icon ] Jeff Starr says:

    Actually, no. I merely provide information. As an informed, intelligent human being, you get to decide for yourself whether or not to use the info or make any changes according to your needs.

  37. [ Gravatar Icon ] RS2IP says:

    Hi,

    First of all I would like to thank you for the biggest user-agent blacklist :)

    Only one downside though, when using this, I can’t verify my site with Google Webmaster Central…it gives a 403 Forbidden error..

  38. [ Gravatar Icon ] Jeff Starr says:

    Hi RS2IP, there are some rules in the list that may be blocking that particular user agent. You could resolve this by determining the user agent that is being blocked (Google Webmaster Central) and then searching the list for anything might be blocking it. Conversely, you could use the halving method to diagnose and remove the offending rule(s). Totally doable! :)

  39. [ Gravatar Icon ] BK says:

    I copy and paste the referrers list, save my .htaccess file but as soon as I do I get a 500 error.

    First with referrers, here is the error log

    /bla/bla/public_html/bla1/.htaccess: RewriteCond: bad flag delimiters, referer: http://www.bla1.bla.com/

    When I removed all the lines such as the below line it worked. Anything with https seems to not work, do I need to preceed that with
    RewriteCond %{HTTP_REFERER}

    https?:\/\/([^\/]*\.)?skin-trt\.boom\.ru

    SECOND PROBLEM, when I copy and paste the 4G Series: The Ultimate User-Agent Blacklist, Featuring Over 1200 Bad Bots

    I get the below errors

    /bla/bla/public_html/bla1/.htaccess: RewriteCond: cannot compile regular expression '^$|\\|\\'|\\%|\\_iRc|\\_Works|\\@\\$x|\\&lt;\\?|\\$x0e|\\+select\\+|\\+union\\+|1\\,\\1\\,1', referer: http://www.bla1.bla.com/contact-us/contacts/webmaster

    and

    /bla/bla/public_html/bla1/.htaccess: RewriteCond: cannot compile regular expression '^$|\\|\\'|\\%|\\_iRc|\\_Works|\\@\\$x|\\&lt;\\?|\\$x0e|\\+select\\+|\\+union\\+|1\\,\\1\\,1\\,|2icommerce| and this continues until the end of the list.

    What did I do wrong?

    When I use the 2G compressed list it works with no problems.
    I have not tried the 3G black list.
    Lastly, does the 4G blacklist do everything that 2 & 3G do?

  40. It seems that Google also uses the following user-agent “Google-Site-Verification/1.0”. So the solution is to remove “verif” from the blacklist.

    Another issue I’m encountering is that users using Windows Vista with Firefox 3.5.5 or Internet Explorer 8 are getting 403 Forbidden errors…

    P.S. I just realised that I accidentally posted the above message at the wrong post. Sorry about that..

  41. [ Gravatar Icon ] Jeff Starr says:

    @BK: not sure about the first issue, but the second problem I think has been addressed in a previous comment. The problem is the immense length of the expression, and the solution is to break it into two different lines, which has been graciously done for us by eezz. You can grab the dual-line version here:

    http://perishablepress.com/press/wp-content/online/code/user-agent-blacklist_revised.txt

    And, the 4G does most of what the previous versions do, plus a whole lot more. Much of the rules that were changed were done so to optimize performance, precision, and effectiveness of the blacklist. You can read more about the theory behind the 4G Blacklist for more information.

  42. [ Gravatar Icon ] Jeff Starr says:

    @RapidShare to IP Address: Thanks for the heads up on the verif fix for Google.

    I don’t know what the issue could be with Vista Fx or IE8, but identifying the offending term should be relatively easy using the halving method of identifying problematic code.

  43. [ Gravatar Icon ] zae says:

    Hi,

    Thank for providing this great list.

    I create a log file and found that “baidu.com” caught by term “spider”:

    ---
    HTTP_USER_AGENT: Baiduspider+(+http://www.baidu.com/search/spider.htm)
    ---

    FYI, baidu.com is legitimate search engine.

    @zae

  44. [ Gravatar Icon ] zae says:

    This user agent also has been blocked by keyword “PHP”.

    HTTP_USER_AGENT == Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 FirePHP/0.3

    I think, all visitor with FirePHP add-on installed on they mozilla browser will be blocked by keyword “PHP

    Thanks,
    zae

  45. [ Gravatar Icon ] zae says:

    YahooMobile-CrawlerEngine has been blocked by keyword “Link”. Here the user-agent:

    HTTP_USER_AGENT == "Nokia6682/2.0 (3.01.1) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 configuration/CLDC-1.1 UP.Link/6.3.0.0.0 (compatible;YahooSeeker/M1A1-R2D2; http://help.yahoo.com/help/us/ysearch/crawling/crawling-01.html)"

    For information, see http://help.yahoo.com/help/us/ysearch/crawling/crawling-01.html

    Thanks,
    zae

  46. [ Gravatar Icon ] zae says:

    Other keyword block YahooMobile are “seeker” and “YahooSeeker”.

    If we want to block YahooMobile, just keep that keyword. Otherwise, we should delete keyword: Link, seeker and YahooSeeker.

    Thanks,
    zae

  47. [ Gravatar Icon ] zae says:

    Another report…

    - Keyword “98”: block visitor using Windows 98

    - Keyword “MEGAUPLOAD”: block visitor who use Firefox browser with “megaupload” add-on installed.

    - Kewword “wire”: block some valid browser, including Safari on Mac.

    - Keyword “agent”: block some of visitor with Firefox on Linux, ie this user-agent: “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.3; ips-agent) Gecko/20090824 Fedora/1.0.7-1.1.fc4 Firefox/3.5.3”. Note: that come from my friend computer, using Linux-Fedora.

    I’ll add another report later.

    Thanks,
    zae

  48. [ Gravatar Icon ] david says:

    Thanks for the user agent list. I’m a novice at htaccess codes.

    I’m using eezz’s revised code, but it block opera mini user agent “Opera/9.50 (J2ME/MIDP; Opera Mini/4.0.9800/209; U; en)” and “BlackBerry9700/5.0.0.321 Profile/MIDP-2.1 Configuration/CLDC-1.1 VendorID/308”.

    How to unblock this user agent?

  49. [ Gravatar Icon ] multiverse says:

    Hello Jeff,

    I have deployed the revised version of this list, which has had a very positive effect on my bandwidth and site performance. Kudos on an excellent article, and thanks for making it available for us all. I have received a number of reports where people aren’t being allowed onto the site. I was successful at finding one of the problems and fixing it (AtHome is a Norwegian ISP), not a problem so much as an unwanted filter. Alas the two lingering issues I have after a great deal of analysis (links to test files, log diving and etc.) I am unable to solve. I want to show you one of the problems hoping that you can give some advice on how to solve that type of issue as it relates to your filters.

    Access Log Hit
    http://pastebin.com/m7891b746

    This hit results in a redirection to another website, which is what I want it to do. The trouble is that I cannot find in your rewrite conditions what is being caught. Can you give me some clue?

    Many thanks,
    multiverse

  50. [ Gravatar Icon ] Jeff Starr says:

    @multiverse: the easiest way to isolate problem code is to remove half of it, check for proper functionality, and then remove another half, and so on until you discover the culprit. More info on this technique here.

  51. [ Gravatar Icon ] multiverse says:

    @Jeff Thanks for taking the time to reply. That’s obviously an excellent strategy, and I’ll begin to use to troubleshoot this issue. Again, many thanks.

  52. [ Gravatar Icon ] Shane says:

    Thanks Jeff for the excellent list.
    My Joomla site is pumping 30gig of bandwidth/mth, averaging 12k unique visits. Wonder if that’s normal?

    Tried your first list, but got 500 error. Used Eeze’s list, made the other alterations suggested further down the list, and it works like a dream.

    I’ll watch bandwidth and see what happens.

  53. [ Gravatar Icon ] Derek says:

    hola.

    Like many said, you saved me hours and many concerns about security.

    Put up a donate button.

    Thanx a million

  54. [ Gravatar Icon ] multiverse says:

    @Jeff @everyone I wanted to follow up on my problem and what I’ve learned about it. After posting my question here, most complaints were from mobile users. The iPhone’s Safari works just fine on my 3GS. However our Android users and Windows CE users were the ones reporting continued issues. I decided that users who were being filtered out were in such a small population of users that it was more important to have the site security and that they would have to do without until we move to the next generation of filtering.

    That’s a terrible strategy, because we expect mobile device users to increase going forward. With that in mind, one of my co-admins suggested we deploy the Tapatalk plugin, which has enabled our mobile users to have access. That too is a bad strategy, which we hope to take care of when we move to the next generation of filtering, though it works as short term kludge.

  55. [ Gravatar Icon ] Jeff Starr says:

    @multiverse: Thanks for the information - much appreciated.

  56. [ Gravatar Icon ] john says:

    I have decided to allow only good bots and disallow or redirect all the others.
    It seems more secure, easy and executable by the slow or shared servers.

    What do you think about this? Can please share a list of good bots like google, yahoo? I think that list will be much shorter than the list of bad bots. Another advantage is that I will also not worry about the updating the bad list because of the new bad bots. Google, yahoo, bing and a few other search engines will be enough to allow for most of the webmasters here.

  57. [ Gravatar Icon ] Jeff Starr says:

    Hi john,

    I would do a good search for some current user-agent whitelists. I have collected a few and built some of my own, but it’s been awhile since doing so. I’m guessing there are better lists available around the Web. Don’t forget about the various user-agents for browsers.

    As for the whitelist approach itself, I think it depends on the site or project. Inevitably, there will be many legit user agents that won’t be on the list and thus denied access to the site. Thus, sites for which traffic is an important metric are better off without a whitelist. Conversely, “non-traffic-dependent” sites could benefit from the preservation of bandwidth and server resources.

  58. [ Gravatar Icon ] eremit says:

    @multiverse @Jeff: I’ve discovered yesterday that the Android Browser has been locked out from my site due to the User Agent list.

    The regex part that caused the problem was ‘Build’, as the Android Browser identifies itself with:

    Mozilla/5.0 (Linux; U; Android 2.2; en-us; sdk Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1

    WebOS based mobile phones doesn’t have the problem (at least my Pre works fine). So probably someone who owns a WindowsCE phone should check the logs for the user agent string. I’m pretty sure that it’s again something small like ‘Build’.

  59. [ Gravatar Icon ] eremit says:

    Just a quick not on my former comment. The stated user agent string is of course from the SDK version, but the ‘Build’-problem also applies to regular Android Systems:

    Mozilla/5.0 (Linux; U; Android 2.1-update1; en-us; Sprint APA9292KT Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17” or

    Mozilla/5.0 (Linux; U; Android 2.2; en-gb; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1

    Furthermore I’ve found another user agent that get’s blocked:

    Nokia6682/2.0 (3.01.1) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 configuration/CLDC-1.1 UP.Link/6.3.0.0.0 (compatible;YahooSeeker/M1A1-R2D2; http://help.yahoo.com/help/us/ysearch/crawling/crawling-01.html)

    I’m not familiar with the Nokia/SymbianOS stuff, but the YahooSeeker part matches the rule.

  60. [ Gravatar Icon ] Jeff Starr says:

    @eremit: Thank you for sharing this info and helping to improve the blacklist. I’ll address these issues during the next update. Cheers :)

  61. [ Gravatar Icon ] Darrell says:

    I am using the revised version from eezz in comment #11. It is blocking Google Webmasters verification. “Your verification file returns a status of 403 (Forbidden) instead of 200 (OK).” In my server logs I get this. “[error] [client 72.14.194.33] Request exceeded the limit of 10 internal redirects due to probable configuration error. Use ‘LimitInternalRecursion’ to increase the limit if necessary. Use ‘LogLevel debug’ to get a backtrace.” I tried removing these with no luck.
    Google.Wireless.Transcoder
    Googlebot\-Image
    webmasters

  62. [ Gravatar Icon ] Darrell says:

    I have to post again from more checking. When I used nothing in my htaccess but the Useragent blocker it works. So its an issue with my other rewrite rules, etc. False alarm.

  63. [ Gravatar Icon ] Jeff Starr says:

    @Darrell: Thanks for posting the followup :)

  64. [ Gravatar Icon ] Daniel says:

    Does the long list exist as a text file? All I see is a .gif which is of no use.

    I prefer the long version, easier to manage.

    Thanks

  65. [ Gravatar Icon ] Jeff Starr says:

    Hi Daniel, there is code provided in the article that you should be able to copy and paste into any file you wish, text or otherwise. It’s located just above the .gif image you mention.

  66. [ Gravatar Icon ] Johar says:

    Just want learning, theres many bad bot comes to my site.. Thanks for this Information :)

  67. [ Gravatar Icon ] wva says:

    Whow this is a huge list! Did not know that there are so many bad user agents. Thanks for sharing though :P

  68. [ Gravatar Icon ] Vini says:

    Is there a way for you to describe how to load this list in an NGINX server?

  69. [ Gravatar Icon ] Sjaak says:

    I use your 4G-list on my website. Accessing it with the standard mobile web browser with a samsung Ace will give a forbidden-403. If I use Opera mini on the samsung it gives no problems. Do you have an idea what is causing the blocking with the standard browser?

  70. [ Gravatar Icon ] Mike says:

    hi,

    After applying the user agent black list in my htaccess, it triggered an error in the Facebook Comment. The error is:

    Warning: http://www.mysite.com/my-post/ is unreachable.

    What do I need to remove in the list in order to still allow facebook comments?

  71. [ Gravatar Icon ] TJ says:

    I tried using the list as posted and even the revised list, I get a 500 internal server error.

    Also, I want some bots like google and wireless versions of yahoo bing msn and the like to be able to reach the site. Does anyone have a good list for me to try. Our resources are under heavy use thanks to rogue bots and spiders.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>

Please use basic markup. Wrap code with <code> tags!