2010 User-Agent Blacklist

Posted on August 9, 2010 in Websites by

[ 2010 User-Agent Blacklist ] The 2010 User-Agent Blacklist blocks hundreds of bad bots while ensuring open-access for the major search engines: Google, Bing, Ask, Yahoo, et al. Blocking bad user-agents is an effective addition to any security strategy. It works like this: your site is getting hammered by rogue bots that waste valuable server resources and bandwidth. So you grab a copy of the 2010 UA Blacklist from Perishable Press, include it in your site’s root .htaccess file, and enjoy a more secure and better performing website. It’s that easy.

Proven Security

The 2010 UA Blacklist has been carefully constructed based on rigorous server-log analyses. Obsessive daily log monitoring reveals bad bots scanning for exploits, spamming resources, and wasting bandwidth. While analyzing malicious behavior, evil bots are identified and added to the UA Blacklist. Blocked user-agents are denied access to your site, increasing efficiency and providing safety for your visitors.

Better Performance, Better SEO

Search engines such as Google are placing more weight on speedy, fast-loading websites. If your site is plagued with resource-devouring, bandwidth-wasting bots, it’s performance is probably not as good as it should be. Even if your site looks fine on the surface, without proper protection bad bots can gobble your bandwidth and leech your server resources. A single malicious bot can make hundreds and thousands of requests in a very short period of time while scanning and probing for vulnerabilities. If Google visits while bad bots are hitting your site, your site’s SEO could suffer. Fortunately, the 2010 UA Blacklist protects your site against hundreds of nefarious bots, thereby fostering maximum performance for the search engines.

2010 User-Agent Blacklist

Here it is, presented as two sets of HTAccess directives:

RewriteCond %{HTTP_HOST} !^(127\.0\.0\.0|localhost) [NC]
RewriteCond %{HTTP_USER_AGENT} .*(Firs|exac|Cloak|Detect|uchoo|beaut|ASPSeek|swish|ICS\)|MSIE\ 6\.0\;\ Windows\ NT\;\ DigExt\)|pt\-BR\;\ rv\:1\.9\.0\.3\)\ Firefox\/3\.0|pt\-BR\;\ rv\:1\.9\.0\.18\)\ Firefox\/3\.0|\!susie|\$x0e|\%0a|\%0d|\@\$x|\_irc|\_works|\+select\+|\+union\+|\<\?|1\,\1\,1\,|3gse|4all|4anything|5\.1\;\ xv6875\)|59\.64\.153\.|85\.17\.|88\.0\.106\.|98|a\_browser|a1\ site|abac|abach|abby|aberja|abilon|abont|abot|accept|access|accoo|accoon|aceftp|acme|active|address|adopt|adress|advisor|agent|ahead|aihit|aipbot|alarm|albert|alek|alexa\ toolbar\;\ \(r1\ 1\.5\)|alltop|alma|alot|alpha|america\ online\ browser\ 1\.1|amfi|amfibi|anal|andit|anon|ansearch|answer|answerbus|answerchase|antivirx|apollo|appie|arach|archive|arian|aboutoil|asps|aster|atari|atlocal|atom|atrax|atrop|attrib|autoh|autohot|av\ fetch|avsearch|axod|axon|baboom|baby|back|baid|bali|bandit|barry|basichttp|batch|bdfetch|beat|become|bee|beij|betabot|biglotron|bilgi|bison|bitacle|bitly|blaiz|blitz|blogl|blogscope|blogzice|bloob|blow|bord|boi|bond|boris|bost|bot\.ara|botje|botw|bpimage|brand|brok|broth|browseabit|browsex|bruin|bsalsa|bsdseek|built|bulls|bumble|bunny|busca|busi|buy|bwh3|cafek|cafi|camel|cand|captu|casper|catch|ccbot|ccubee|cd34|ceg|cfnetwork|cgichk|cha0s|chang|chaos|char|char\(|chase\ x|check\_http|checker|checkonly|chek|chill|chttpclient|cipinet|cisco|cita|citeseer|clam|claria|claw|clush|coast|code\.com|cogent|coldfusion|coll|collect|comb|combine|commentreader|common|compan|compatible\-|conc|conduc|contact|control|contype|conv|cool|copi|copy|coral|corn|cosmos|costa|cowbot|cr4nk|craft|cralwer|crank|crap|crawler0|crazy|cres|cs\-cz|cshttp|cuill|CURI|curl|curry|custo|cute|cyber|cz3|czx|daily|dalvik|daobot|dark|darwin|data|daten|dcbot|dcs|dds\ explorer|deep|deps|detect|dex|diam|diibot|dillo|ding|disc|disp|ditto|dlc|doco|dotbot|drag|drec|dsdl|dsok|dts|duck|dumb|eag|earn|earthcom|easydl|ebin|echo|edco|egoto|elnsb5|email|emer|empas|encyclo|enfi|enhan|enterprise\_search|envolk|erck|erocr|eventax|evere|evil|ewh|exploit|expre|extra|eyen|fang|fast|fastbug|faxo|fdse|feed24|feeddisc|feedhub|fetch|filan|fileboo|fimap|find|firebat|firedownload\/1\.2pre\ firefox\/3\.6|firefox\/0|firefox\/1|firefox\/2|firs|flam|flash|flexum|flip|fly|focus|fooky|forum|forv|fost|foto|foun|fount|foxy\/1\;|free|friend|frontpage|fuck|fuer|futile|fyber|gais|galbot|gbpl|gecko\/2001|gecko\/2002|gecko\/2006|gecko\/2009042316|gener|geni|geo|geona|geth|getr|getw|ggl|gira|gluc|gnome|go\!zilla|goforit|goldfire|gonzo|google\ wireless|googlebot\-image|gosearch|got\-it|gozilla|grab|graf|greg|grub|grup|gsa\-cra|gsearch|gt\:\:www|guidebot|guruji|gyps|haha|hailo|harv|hash|hatena|hax|head|helm|herit|heritrix|hgre|hippo|hloader|hmse|hmview|holm|holy|hotbar\ 4\.4\.5\.0|hpprint|httpclient|httpconnect|httplib|human|huron|hverify|hybrid|hyper|iaskspi|ibm\ evv|iccra|ichiro|icopy|ida|ie\/5\.0|ieauto|iempt|iexplore\.exe|ilium|ilse|iltrov|indexer|indy|ineturl|infonav|innerpr|inspect|insuran|intellig|interget|internet\_explorer|internet\x|intraf|ip2|ipsel|irlbot|isc\_sys|isilo|isrccrawler|isspi|jady|jaka|jam|jenn|jet|jiro|jobo|joc|jupit|just|jyx|jyxo|kash|kazo|kbee|kenjin|kernel|keywo|kfsw|kkma|kmc|know|kosmix|krae|krug|ksibot|ktxn|kum|labs|lanshan|lapo|larbin|leech|lets|lexi|lexxe|libby|libcrawl|libcurl|libfetch|libweb|libwww|light|linc|lingue|linkcheck|linklint|linkman|lint|list|litefeeds|livedoor|livejournal|liveup|lmq|locu|london|lone|loop|lork|lth\_|lwp|mac\_f|magi|magp|mail\.ru|main|majest|mam|mama|mana|marketwire|masc|mass|mata|mvi|mcbot|mecha|mechanize|mediapartners|metadata|metalogger|metaspin|metauri|mete|mib\/2\.2|microsoft\.url|microsoft\_internet\_explorer|mido|miggi|miix|mindjet|mindman|mips|mira|mire|miss|mist|mizz|mj12|mlbot|mlm|mnog|moge|moje|mooz|more|mouse|mozdex) [NC]
RewriteRule ^.*$ - [G]

RewriteCond %{HTTP_HOST} !^(127\.0\.0\.0|localhost) [NC]
RewriteCond %{HTTP_USER_AGENT} .*(Windows\ NT\ 6\.1\;\ tr\;\ rv\:1\.9\.2\.6\)|mozilla\/0|mozilla\/1|mozilla\/2|mozilla\/3|mozilla\/4\.61\ \[en\]|mozilla\/firefox|mpf|msie\ 1|msie\ 2|msie\ 3|msie\ 4|msie\ 5|msie\ 6\.0\-|msie\ 6\.0b|msie\ 7\.0a1\;|msie\ 7\.0b\;|msie6xpv1|msiecrawler|msnbot\-media|msnbot\-products|msnptc|msproxy|msrbot|musc|mvac|mwm|my\_age|myapp|mydog|myeng|myie2|mysearch|myurl|nag|name|naver|navr|near|netants|netcach|netcrawl|netfront|netinfo|netmech|netsp|netx|netz|neural|neut|newsbreak|newsgatorinbox|newsrob|newt|next|ng\-s|ng\/2|nice|nikto|nimb|ninja|ninte|nog|noko|nomad|norb|note|npbot|nuse|nutch|nutex|nwsp|obje|ocel|octo|odi3|oegp|offby|offline|omea|omg|omhttp|onfo|onyx|openf|openssl|openu|opera\ 2|opera\ 3|opera\ 4|opera\ 5|opera\ 6|opera\ 7|orac|orbit|oreg|osis|our|outf|owl|p3p\_|page2rss|pagefet|pansci|parser|patw|pavu|pb2pb|pcbrow|pear|peer|pepe|perfect|perl|petit|phoenix\/0\.|php|phras|picalo|piff|pig|pingd|pipe|pirs|plag|planet|plant|platform|playstation|plesk|pluck|plukkie|poe\-com|poirot|pomp|post|postrank|powerset|preload|press|privoxy|probe|program\_shareware|protect|protocol|prowl|proxie|proxy|psbot|pubsub|puf|pulse|punit|purebot|purity|pyq|pyth|query|quest|qweer|radian|rambler|ramp|rapid|rawdog|rawgrunt|reap|reeder|refresh|reget|relevare|repo|requ|request|rese|retrieve|rip|rix|rma|roboz|rocket|rogue|rpt\-http|rsscache|ruby|ruff|rufus|rv\:0\.9\.7\)|salt|sample|sauger|savvy|sbcyds|sbider|sblog|sbp|scagent|scanner|scej\_|sched|schizo|schlong|schmo|scorp|scott|scout|scrawl|screen|screenshot|script|seamonkey\/1\.5a|search17|searchbot|searchme|sega|semto|sensis|seop|seopro|sept|sezn|seznam|share|sharp|shaz|shell|shelo|sherl|shim|shopwiki|silurian|simple|simplepie|siph|sitekiosk|sitescan|sitevigil|sitex|skam|skimp|sledink|sleip|slide|sly|smag|smurf|snag|snapbot|snapshot|snif|snip|snoop|sock|socsci|sogou|sohu|solr|some|soso|spad|span|spbot|speed|sphere|spin|sproose|spurl|sputnik|spyder|squi|sqwid|sqworm|ssm\_ag|stack|stamp|statbot|state|steel|stilo|strateg|stress|strip|style|subot|such|suck|sume|sunos\ 5\.7|sunrise|superbot|superbro|supervi|surf4me|surfbot|survey|susi|suza|suzu|sweep|sygol|synapse|sync2it|systems|szukacz|tagger|tagoo|tagyu|take|talkro|tamu|tandem|tarantula|tbot|tcf|tcs\/1|teamsoft|tecomi|teesoft|teleport|telesoft|tencent|terrawiz|test|texnut|thomas|tiehttp|timebot|timely|tipp|tiscali|titan|tmcrawler|tmhtload|tocrawl|todobr|tongco|toolbar\;\ \(r1|topic|topyx|torrent|track|translate|traveler|treeview|tricus|trivia|trivial|true|tunnel|turing|turnitin|tutorgig|twat|tweak|twice|tygo|ubee|ultraseek|unavail|unf|universal|unknown|upg1|uptime|urlbase|urllib|urly|user\-agent\:|useragent|usyd|vagabo|valet|vamp|vci|veri\~li|verif|versus|via|virtual|visual|void|voyager|vsyn|w0000t|w3search|walhello|walker|wand|waol|watch|wavefire|wbdbot|weather|web\.ima|web2mal|webarchive|webbot|webcat|webcor|webcorp|webcrawl|webdat|webdup|webgo|webind|webis|webitpr|weblea|webmin|webmoney|webp|webql|webrobot|webster|websurf|webtre|webvac|webzip|wells|wep\_s|wget|whiz|widow|win67|windows\-rss|windows\ 2000|windows\ 3|windows\ 95|windows\ 98|windows\ ce|windows\ me|winht|winodws|wish|wizz|wordp|worio|works|world|worth|wwwc|wwwo|wwwster|xaldon|xbot|xenu|xirq|y\!tunnel|yacy|yahoo\-mmaudvid|yahooseeker|yahooysmcm|yamm|yand|yandex|yang|yoono|yori|yotta|yplus\ |ytunnel|zade|zagre|zeal|zebot|zerx|zeus|zhuaxia|zipcode|zixy|zmao) [NC]
RewriteRule ^.*$ - [G]

View text format

To implement the UA Blacklist, simply paste into your site’s root .htaccess file (or even better, the Apache configuration file). Upload, test, and stay current with updates and news.

Important Note

The UA Blacklist uses hundreds of regular expressions to block bad bots based on their user-agent. Each of these regular expressions can match many different user-agents. Care has been taken to ensure that only bad bots are blocked, but false positives are inevitable. If you know of a user-agent that should be removed from the list, please let me know. I will do my best to update things asap.

Bottom line: Only use this code if you know what you are doing. It’s not a “fix-it-and-forget” situation, especially for production sites. It’s more like a “fix-it-and-keep-an-eye-on-it” kind of thing, meant for those who understand how it works. As mentioned in the comments, the 2010 User-Agent Blacklist is a work in progress. Please use the UA Blacklist with caution and at your own risk.

So much more..

For those new to Perishable Press, please check out some of my other security resources:

Security is an important part of what I do around here, so please chime in with any suggestions, ideas, and comments. Thank you for visiting Perishable Press.

Related articles

105 Responses

  1. [ Gravatar Icon ] Crazyb says:

    I am not that Good in SEO. My question is…. Do small search engines help at all?

  2. [ Gravatar Icon ] Jeff Starr says:

    It’s all relative.. for very low-traffic sites, they may add a few hits, but for anything larger, they do more harm than good. Generally speaking.

  3. [ Gravatar Icon ] Crazyb says:

    When i add the code, the website stops working , i get error 500. Any ideas

  4. [ Gravatar Icon ] Crazyb says:

    Cool, Thanks Jeff. I have a fairly new site and i realise that a lot of dubious blogs are stealing my content and linking back to my site. I wonder if this is good or not.

    Thanks for your blogs, i’m a fan

  5. [ Gravatar Icon ] Jeff Starr says:

    Let me look into it.. something may have been munged in the process of putting the article together. I’ll report back here as soon as possible.

  6. [ Gravatar Icon ] Jeff Starr says:

    Got it! The first wildcard dot (“.”) was missing from the first rule. I have updated the article and text file so they should be working now.

  7. [ Gravatar Icon ] Eric Curtis says:

    I also get a 500 error. You IP list work perfectly when I past that in.

    Thanks.

  8. [ Gravatar Icon ] René says:

    Wouldn’t it be better to whitelist the good ones?

  9. For those of us that like other servers, do you have a source list of user-agents?

  10. [ Gravatar Icon ] Jeff Starr says:

    @Eric: There was an update to the UA Blacklist almost immediately after posting. Have you tried the most recent version?

    @René: That is also an option, but there are many “good” bots that would need to be included. Of course, some people prefer to whitelist only the major search engines (Google et al).

    @Timothy Warren: Not sure what you mean by “source list”.. The user-agents blocked by this blacklist are matched using regex expressions, so there are many more bots that are blocked than there are entries in the list.

  11. [ Gravatar Icon ] Eric Curtis says:

    Working now, thank you!

  12. [ Gravatar Icon ] RS says:

    Seeing a weird issue where I’m getting “Request Exceeded the limit of 10 internal redirects” when I’ve put this code in my httpd.conf.

    Any thoughts? I should dig into it more.

  13. [ Gravatar Icon ] Ade H says:

    This new version looks better than ever, Jeff. Your work here is frequently brilliant and always appreciated.

  14. [ Gravatar Icon ] Eric Curtis says:

    These rules prevent the CSS validator from working:

    http://jigsaw.w3.org/css-validator/

    Taking them out allowed it to work again

  15. [ Gravatar Icon ] Jeff Starr says:

    @RS: Some hosts restrict what you can do with htaccess files. It sounds like they may be limiting the number and/or length of rewrite conditions. It could also be a placement issue. Consult your host for more information.

    @Ade: Thank you kindly :)

    @Eric: I’ll check it out.. what is the UA used for the validator?

  16. [ Gravatar Icon ] RS says:

    I’m the host… so, I guess I can talk to myself. =P

    It sounds like it’s doing a redirect loop. Guess I’ll have to look into it more. But I’m guessing upping the rewrite limit will just make it loop that much more.

    This is with apache 2.2. =/

  17. [ Gravatar Icon ] Jeff Starr says:

    @RS: And the list is located in the root htaccess file? Perhaps the Apache rewrite log will help sort it out..

  18. [ Gravatar Icon ] RS says:

    I put it in my vhost file… but I’ll try htaccess.

    and yeah, I’ll turn on the rewrite log and take a look.

    Will post back.

  19. Thanks a lot Jeff for this! I fell saver now because I’m having many spam and the load on the server goes up.

  20. [ Gravatar Icon ] Dean says:

    With the line starting with:

    RewriteCond

    your previous blacklist had [NC,OR] at the end, it is now [NC].

    What does OR mean and why is it not being used now?

    Really cool blacklist :D :D Nice work

  21. [ Gravatar Icon ] Julian says:

    sorry, i am a beginner on “spam” problem

    which one is more effective?

    IP blocking or UA blocking?

  22. [ Gravatar Icon ] Jeff Starr says:

    @Dean: The [OR] means just that – “or”, such that multiple rewrite conditions may be evaluated for any given RewriteRule. I’ll check out the previous blacklist and see what’s up, but for our purposes here, no [OR] flags are required.

    @Julian: It depends on how you are using either method. It’s a lot easier to fake a user-agent than it is an IP address, so I usually reserve IP-blocking for specific cases and user-agent blocking for known bad bots. If you need to block an entire country, blocking via IP is the way to go. If you want to keep the evil “diavol” bot away, blocking the user-agent is the way to go. Great question!

  23. [ Gravatar Icon ] Adrian says:

    Jeff, do you know the difference between an evil bot and the regular check for feed updates by an RSS reader?

    It seems no, or your “carefully constructed” blacklist would not include liferea…

  24. [ Gravatar Icon ] Jeff Starr says:

    @Adrian: All I know is what my server logs tell me. It’s impossible to research every malicious request, so if a bot is acting like a twerp, it’s added to the list.

    Even so, I am more than happy to edit the blacklist to account for legitimate bots that may otherwise look suspicious. Just provide a reference – no need to be insulting.

  25. [ Gravatar Icon ] Louis says:

    Hey, Jeff, I just wanted to point out something that happened on my site (running on WP) that seems to be triggered by your user agent blacklist.

    I copied the blacklist into my .htaccess file yesterday, then last night I scheduled a post to be published at around 11am this morning. The post never got published and it was listed in my dashboard as “Missed”. This is the first time I’ve ever had a post delayed or missed when using WP’s scheduling.

    So, I ran a few tests on my local WP install, and each time I had the user agent blacklist in my .htaccess file, I could not get a scheduled post to go live using the automated system. If I remove the blacklist, the scheduling works fine.

    This link has some more info on the scheduling issue, but I just thought I’d let you know about it to see if maybe you can figure out why your blacklist seems to be causing this problem.

    Thanks!

  26. [ Gravatar Icon ] julian384 says:

    Hi,

    Thanks for the list, really great help. Although I’m having difficulty with it. I’m trying to use it on a site I have hosted (with streamline.net) but when I add this or your IP blacklist to the htaccess I got a 500 error.

    The logs give the error “…..htaccess: RewriteCond: cannot compile regular expression ‘.*(Firs|exac|Cl…” and similar with the IP blacklist.

    Is this a host problem that I cannot solve (unfortunately budget is limited to shared hosting with a budget host).

    Thanks again,
    Julian.

  27. [ Gravatar Icon ] julian384 says:

    @myself

    To follow up, just tried only using each half of your rules. Works fine with “RewriteCond %{HTTP_USER_AGENT} .*(Windows\ NT\…” but not the first rule.

    Will try more tomorrow and share anything I find.

    Thanks,
    Julian

  28. [ Gravatar Icon ] Adrian says:

    Jeff, are you seriously saying you didn’t manually check your list for false positives before telling people to use it?

    Or why did you include strings like “firefox\/3\.0″, “w3m” or “windows\ 98″ in your list?

    Looking at your list, it seems to be full of false positives, BLOCKING LEGITIMATE USERS.

  29. [ Gravatar Icon ] Jeff Starr says:

    @Adrian: lol - not sure how “seriously” I’m saying anything. I simply do my best to analyze, test, and research things as much as possible. As a full-time dad working 2.5 full-time jobs to keep up, my “blacklist” time is limited. I simply enjoy the process and try to share as much information as possible.

    Blacklists – especially user-agent blacklists – are a work in progress and are never perfect. The idea is to help improve the code by offering constructive input.

  30. [ Gravatar Icon ] Steve says:

    @Adrian . . “Looking at your list, it seems to be full of false positives, BLOCKING LEGITIMATE USERS”

    Maybe you can help us out. So your saying an average person surfing the web, with little technical expertise other then surfing favorite pages would be blocked from a site using this list? Nothing personal just trying to figure what you mean by legitimate users.

  31. [ Gravatar Icon ] Jeff Starr says:

    Some good points here in the comments have resulted in a new “Important Note” section in the article. It basically informs people that false positives are inevitable and only use the list if you know what you’re doing. Thanks to Adrian and Steve for the perspective.

  32. [ Gravatar Icon ] Eric Curtis says:

    I just wanted to share that I read your blog regularly and enjoy it quite a bit. Some of these comments lack the tact / gratefulness for the quality writing and thoughts that you share with us.

    I hope that some of this feedback did not dampen your enthusiasm for blogging here. Cheers.

  33. [ Gravatar Icon ] Crazyb says:

    @ Eric Curtis, you are def right, i call them haters. They can’t think for themselves, they only know how to criticise.

  34. [ Gravatar Icon ] Adrian says:

    Steve, an average person using a device running Windows CE will be blocked from surfing his favorite pages on a webserver using these .htaccess directives due to the “windows\ ce” entry.

    The same goes for anyone using Firefox 3.0 due to the “firefox\/3\.0″ entry.

    I do not care what other people write about me, fact is that using these .htaccess directives as recommended in this blog post is a bad thing.

    My constructive input is that each entry in such a list has to be checked manually before using it or even publically recommending it to other people.

  35. [ Gravatar Icon ] Steve says:

    @Adrian that’s cool.

    Firefox 3 was released on June 17, 2008. I certainly hope average users have upgraded by now. Duh!

    This release is only a couple days old so it will all get sorted out. There is something that appeals to me about blocking hundreds of bad bots. And any temporary inconvenience for a lone surfer here and there won’t be major. If someone is really into a web site anyway they would contact administration and say “hey I got blocked from the site whats up?”

  36. [ Gravatar Icon ] Adrian says:

    Steve, people are copying your current list, and might not update it.

    And people will not necessarily contact the administratior of a site. They might just give up. Or write a post in their blog that Google Reader is working with some feed but buggy Liferea isn’t. Or complain to us Liferea developers about some non-working Feed. And then I might be the one spending time on debugging why the feed works with Google Reader (not blocked) but not with Liferea (blocked).

    When doing blacklisting each false positive is one too many. And there are *many* in your list.

    Firefox 3.0 still has a market share around 2%, so your blacklisting blacklisting blocks 2% of all site visitors/customers.

    On the other hand, e.g. w3m is a fine albeit exotic browser with a userbase so small that I really wonder how it made it into your blacklist - and an administrator might not get user reports for that false positive for months.

  37. [ Gravatar Icon ] RS says:

    Adrian: Here’s an idea… Why don’t you go through the blacklist and modify it how you see fit, such that it minimizes/eliminates FPs, and then post it back here. So, you actually produce something constructive rather than telling Steve, who’s already said he’s a bit overworked, that he should fix it himself.

    Help out… don’t just point out the problems, help fix them. If your blacklist works better, people will use it. That’s how this works. Steve offered up his list in the hopes it will help someone, not so he can be told it sucks.

  38. [ Gravatar Icon ] RS says:

    Doh, I said Steve, I meant Jeff. Ugh… it’s only noon, and it already feels like a long day.

  39. [ Gravatar Icon ] Adrian says:

    Doh, I had the same Jeff/Steve mixup. Sorry!

  40. [ Gravatar Icon ] Jeff Starr says:

    Update: The following items have been removed from the UA Blacklist:

    • w3c
    • w3m
    • lifearea
    • firefox/3.0
    • firefox/3.0.10

    Here are the remaining Firefox-related user-agent strings that appear in the list (with escape characters removed):

    • firefox/0
    • firefox/1
    • firefox/2
    • mozilla/firefox
    • pt-BR; rv:1.9.0.3) Firefox/3.0
    • pt-BR; rv:1.9.0.18) Firefox/3.0
    • firedownload/1.2pre firefox/3.6

    The last four of these aren’t legit, and the first three block very old versions, which according to my stats is a very small percentage of visitors. We’re blocking many more bad bots with these than actual browsers.

    Some notes:

    • @Eric Curtis: The removal of the “w3c” should resolve any W3C Validator issues.
    • The Windows-related user-agents remain in the list until I can find an authoritative user-agent reference for Windows stuff.

    That’s it for now.. keep any suggestions coming, and I’ll do my best to keep things current. Thanks :)

  41. [ Gravatar Icon ] RS says:

    Thanks Jeff!

    The only one I think I might modify would be to add a RewriteCond that basically says:

    RewriteCond %{HTTP_HOST} !^127\.0\.0\.0

    Between the user strings, and the actual block rule. A fair number of people will use some php, curl, wget, etc on their local machine to trigger some event on their site, and the useragent list you have blocks many of these (in my case, I use curl to dump some data into a MySQL DB every minute that’s run by cron).

    Obviously YMMV, but it wouldn’t be a bad idea. Might also need one that says:

    RewriteCond %{HTTP_HOST} !^localhost

    But I’m pretty sure you can combine those. =)

  42. [ Gravatar Icon ] Jeff Starr says:

    @Louis: Hey just saw your comment – somehow missed it initially.

    I’m not sure why a user-agent blacklist would stop WordPress from executing PHP on the server. If there is some sort of WP UA string involved in the process, we could remove the regex match from the list.

    I’ll keep my eyes on it. Thanks for the feedback.

  43. [ Gravatar Icon ] RS says:

    Jeff,

    This may be the exact thing I just wrote about. I think wordpress’s cron script may use libcurl to trigger the publish… but I could be wrong, since I don’t think I’ve ever scheduled a post for the future. =/

  44. [ Gravatar Icon ] Jeff Starr says:

    @RS: Great timing – adding that condition is a great idea, and may indeed fix the issue with WordPress scheduling mentioned in previous comment. If so, sweet. :)

    I’ll be updating the list again as soon as I can think thru using an [OR] flag for the first condition in either list.

  45. [ Gravatar Icon ] RS says:

    Actually, yeah, good point.

    Might be best to do something like:

    RewriteCond %{HTTP_HOST} !^(127\.0\.0\.1|localhost)
    RewriteCond %{HTTP_USER_AGENT} .....your_useragent_list...
    RewriteRule ^.*$ - [G]

    That way, the easy rule (is this coming from someone other than localhost) is done first, then the hard rule is done second.

    I don’t think an [OR] is needed. This would be an “AND”… so, NOT localhost, AND one of these useragents. If we use an [OR], it would still trigger if the localhost was connecting with curl.

    Or am I thinking about this wrong? Again, this day has been really long already.

  46. [ Gravatar Icon ] Adrian says:

    RS, I’d say the way the list gets constructed is wrong. I would say you cannot create a blacklist this way in an automated way and then try to find all false positives.

    Jeff still claims the list was “carefully constructed based on rigorous server-log analyses” although the list is in reality full of false positives - I better don’t tell publically what I think about that.

    Another interesting question is whether people using “evil bots” really use identifyable user agent strings as the blog claims. If I’d write an evil bot I’d use the user agent string of the latest Firefox and all this user agent blocking wouldn’t affect my evil bot. If Jeff has data how many percent of *all* accesses this blacklist actually blocks that would be interesting information.

  47. [ Gravatar Icon ] RS says:

    Sounds like a job for Splunk! =)

  48. [ Gravatar Icon ] Jeff Starr says:

    Yes, definitely no [OR] flags! – otherwise the list would block everything that isn’t localhost. In any case, the list has been updated with these helpful conditions. Thanks to RS for the suggestion! :)

  49. [ Gravatar Icon ] Paris Vega says:

    Is this included in the Block Bad Queries wordpress plugin?

  50. [ Gravatar Icon ] DG says:

    Jeff,

    adding above rules to 4G list, is now blocking “w3.org”, and probably others too. Haven’t checkecd other part of WordPress. Will report later.

    Can you advise to unblock w3.org.

  51. [ Gravatar Icon ] DG says:

    Jeff,

    Adding above rules to exisiting 4G list (wordking) list, is now blocking “w3.org” with “410″ error code, and probably others too. Haven’t checked other part of WordPress. Will report later.

    In the meantime, can you advise to unblock w3.org?

  52. [ Gravatar Icon ] Jeff Starr says:

    @Paris Vega: No, this is an entirely different set of rules. :)

    @DG: Is w3.org the name of a user-agent? This blacklist only blocks certain UA strings and I’m not seeing w3.org listed anywhere..

  53. [ Gravatar Icon ] Daniel15 says:

    Are you sure about this blacklist’s ability to speed up a website? With a blacklist like this, every single request has to be checked against the blacklist. The checking will reduce performance as every user agent has to be checked against a regular expression.

    Is the gain from blocking bad bots greater than the loss from the checks?

  54. [ Gravatar Icon ] DG says:

    @DG: Is w3.org the name of a user-agent? This blacklist only blocks certain UA strings and I’m not seeing w3.org listed anywhere..

    Jeff,

    I think, W3.org’s default user is “W3C_Validator/xx.xxxx” - not exactly sure.

  55. [ Gravatar Icon ] Jeff Starr says:

    @Daniel15: Good point. There are many factors that determine performance, including server resources, HTTP requests, and so on. When a site is deluged with spammers and crackers looking for exploits, things can really slow down. In such cases, the blacklist will help by eliminating the waste and opening up server resources for legitimate traffic. Conversely, running the blacklist on a site that doesn’t have a lot of bad traffic may reduce performance, if only slightly. Great question, and one of the reasons why the “only-use-if-you-know-what-you-are-doing” disclaimer is included in the article.

    @DG: It looks like the “valid” regex pattern was matching against the W3C Validation service. This has been corrected for both the in-post blacklist and the text-file version. Thanks for the help! :)

  56. [ Gravatar Icon ] DG says:

    Jeff,

    No, it didn’t work, looks like “valid” regex pattern is innocent. Still, w3.org generating “410″ on both HTML & CSS validation services. It’s something else. Anyways, I’ve just added the rules to existing “4G Blacklist” let me check other part also.

    Will update you later.

    By the way, Jeff, you present theme, and the color combination are superb. One of my favorite colors.

  57. [ Gravatar Icon ] crazyb says:

    Trying this today….

  58. [ Gravatar Icon ] Steve says:

    It has improved performance. We’ll how ya like that!

  59. [ Gravatar Icon ] Jake Noble says:

    Hi, is there a list of UA that I can set my browser to so I can test this?

  60. [ Gravatar Icon ] Baber says:

    Can anyone get me this for Nginx?

  61. [ Gravatar Icon ] Micah says:

    On multiple Wordpress installations, after I have updated my .htaccess file with this blacklist code, I am not able to use the Flash uploader for the media library. It gives me an error that says “http error” in red letters. Once I removed this code from my .htaccess file, I didn’t have a problem anymore.

  62. [ Gravatar Icon ] Steve says:

    Problem:
    Hmm, I hit a problem by adding the 2010 UA Blacklist to htaccess - web site monitor basicstate.com tells website down. Tried html validation at w3c and it tells me error 410! so something is wrong.

    Removed the first/upper part of the code leaving..in place the lower part:

    RewriteCond %{HTTP_HOST} !^(127\.0\.0\.0|localhost) [NC]
    RewriteCond %{HTTP_USER_AGENT} .*(Windows\ …..

    and all was well again. ideas?

  63. [ Gravatar Icon ] Steve says:

    correction!
    had to remove all blacklist code to have the the monitors pingdom/basicstate showing green again! something seems wrong with the blacklist.

  64. [ Gravatar Icon ] RS says:

    Steve,

    You will need to look at your web logs and see what client pingdom and basicstate use to check the status of your site, and remove those from the blacklist.

    Looking on google, pingdom uses GIGRIB, but that doesn’t seem to be in the list.

    Basicstate reports itself as Mozilla 4.0, which also shouldn’t be blocked given the current blacklist. Have you tried the latest version?

  65. [ Gravatar Icon ] RS says:

    by the by… here are the appropriate pages with that info:

    http://basicstate.com/htm/hand.htm

    http://uptime.pingdom.com/general/faq#12

    Not sure why Pingdom is being blocked, but basicstate may be being blocked because it reports itself as:

    Mozilla/4.0 (compatible; MSIE 6.0 Windows 2000; http://basicstate.com/)

    Which may be catching the rule “msie\ 6\.0\-”… but not sure.

    To effectively troubleshoot, you would want to comment out the first block of user agents, test, if it works, uncomment the first block, then comment out the second block, test. Then just comment/uncomment a chunk of each until you whittle it down to the culprit. Not sure if anyone else uses these services, so any info you can give, would be great.

  66. [ Gravatar Icon ] Steve says:

    agents report as:

    Agent: Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)

    Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; http://basicstate.com/)

    However, not even w3c (html check), reporting as Agent: W3C_Validator/1.1 can find the web site any more and reports a 410 (gone) so there must be a bit more wrong than just the monitors. I have to remove both UA blocks to get it going again.

  67. [ Gravatar Icon ] Jeff Starr says:

    Just a friendly reminder to remove any confounding variables while testing. For example, other htaccess directives may be interfering with the results if present in the same file or even inherited from a parent directory. Also be mindful of possible interference from PHP and/or other sources.

  68. [ Gravatar Icon ] Steve says:

    Thanks Jeff I appreciate your support! I will look into it when I got more time on hand. So far there is nothing obvious to me what could cause a conflict.

    thx again

  69. [ Gravatar Icon ] Jeff Starr says:

    My pleasure, and I will be going through the comments here and doing what I can to look into things and keep things updated and current. This week’s just been a little crazier than most, but I should have more time to investigate next week. Huge thanks to you and everyone else who helps improve the blacklist by contributing information and feedback.

  70. [ Gravatar Icon ] Steve says:

    found the culprit!

    The 2010 UA blacklist code along with your hotlink protection code make a web site inaccessible. Try it!

  71. [ Gravatar Icon ] bob says:

    i put it on and noticed a few days later that the google adsense bot couldn’t crawl my site. i took off everything relating to ‘google’, ‘media’, ‘partner’, ‘ads’ and everything else related to their user_agent, but i still couldn’t work out which one it was. when i took it all off, and went back to crawling it okay again. if anyone knows which one is the culprit i’d like to hear, because i wouldn’t mind putting it all back on again

  72. [ Gravatar Icon ] Steve says:

    This is excellent work Jeff thank you very much. I have two questions;

    1. Can we submit suspiucious user_agents for addition to the list?

    2. How often do you think you will have time to update the list?

  73. [ Gravatar Icon ] Steve says:

    I meant to add that for months I’ve had a persistent user agent that just has this in the user_agent field;

    '

    That’s all, a simple apostrophe.

  74. [ Gravatar Icon ] Jeff Starr says:

    @Steve:

    1. Yes, please post a comment here or send an email to jeff at this domain. I try to stay current with updates, but sometimes it takes awhile. By keeping everything (issues, ideas, edits) in the thread, it makes updating things a little easier. Whatever works best for you.

    2. I try to do it every week, but sometimes it can take several. Kids are back in school tomorrow, so I should see a dramatic increase of free time =)

    I’ll look into the “apostrophe-only” user-agent for the next update. Thanks.

  75. [ Gravatar Icon ] Alex says:

    Ah, this list is quite good. It would be essential for a blog that was under attack before, it covers all of the bad user agents in my list.

    .htaccess files are indeed annoying to implement if they’re not defaults, 500 errors are just the server complaining something is not right.

    You may want to add the following enclosure for the rules though, to stop non-compliant installs to not complain:

    (mod_rewrite code here)

    Good work Jeff, I love the articles greatly!

  76. [ Gravatar Icon ] Alex says:

    Whoops, make that:

    <IfModule mod_rewrite.c>
    (htaccess code here)
    </IfModule>

    replacing the obvious

  77. [ Gravatar Icon ] Jeff Starr says:

    Good call, Alex – Thanks for the reminder :)

  78. [ Gravatar Icon ] Nicolás says:

    Finally, after a while of testing, I found the bot that disallows W3C Markup Validation.

    Search “ida” bot. It’s located between “|icopy|” and “|ie\/5\.0”.

    So… “|icopy|”…..IDA…..“|ie\/5\.0”.

    I hope I’ve been helpful !

  79. [ Gravatar Icon ] Ray says:

    I’m guessing RSS parsers were added the blacklist to prevent page scraping, but the list shouldn’t blacklist RSS parsers because some sites rely on it as widget blocks.

    I had the blacklist active on a site I was working on and it broke the site because I had SimplePie running on it.

    Just an FYI.

  80. [ Gravatar Icon ] Adam says:

    HI John thanks for this amazing resource, just wondering now that i have added the blacklist it has killed 99% of the bot problem i had. i am still getting 2 that are gettig through.

    one of them has a empty user agent and the other is identified by ‘crawl’

    i have added crawl to your blacklist list but how do i stop ones with empty user agent string?

    also is it okay to block ‘crawl’ as well

  81. [ Gravatar Icon ] James says:

    Hi
    Thanks for the code.
    I would like to use your code at :
    http://perishablepress.com/press/wp-content/online/code/2010-user-agent-blacklist.txt

    I have seen the httpd.conf file but dont understant where to paste it.
    can you please tell me where to paste in httpd.conf

    Please give some clear instructions.

    Thanks.
    James

  82. [ Gravatar Icon ] Daniel says:

    I think I found another good site for such : http://www.spanishseo.org/block-spam-bots-scrapers

    Hopefully it will be discussed to improve the list here.

  83. [ Gravatar Icon ] stOrM! says:

    Hmm just a question.
    That UA Block List in the form it exists now can be considered error free yet?

    Just asking because I’m far away from calling myself a pro when it comes to those things. I just noticed since a few days I get visits from a bot called: MaMa CaSpEr

    I ask Google for it a few sites later I got told that this seems to be kind of an exploits scanner. Which brings me to your Site esspecially to the Block List here where I noticed Casper is included. So I included your List inside of my htaccess file (what should I say? Now it will be for sure blocked because of the internal Error 500 I get lol)

    I mean my Site went down now, can’t access it anymore because of that internal error which is the reason I ask if that script is error free now?

    Any advice would be highly appricitated…
    Not sure just a guess could it be that maybe I have too much rewrite conditions included inside of my htaccess?

    I’m also not sure about if that bot comes always named as: MaMa CaSpEr
    if so, maybe could you just show me a way to block only this instead of that large list?

    Kindest regards,
    stOrM!

  84. [ Gravatar Icon ] Daris says:

    Just a quick heads up. This list does cause issues with wp-super cache. As noted for someone above they had issues with the first half of the list, I had issues with the second half of the list. Once I removed it then the 410 errors went away.

    I’m not savvy enough to know whether this is a bad thing for the cache, but the cache had seemed to be doing it’s job just throwing a 410 error when doing a self test on the cache.

    So Caveat Emptor if you’re a WP Super Cache user and want to use this script.

  85. [ Gravatar Icon ] stOrM! says:

    You’re right!
    I disabled any cache stuff since I do not have a heavy load so caching isnt required much… Its working now and the casper stuff is blocked too…

    Happily everything is working as exspected now, will watching the apache logs to sort things out but couldnt find any serious errors for now anyway, very nice script thank you for sharing it!!!

    Kindest regards,
    s!

  86. [ Gravatar Icon ] Charles says:

    Well this list has helped me to block some persistant Chinese servers that have been bothering me for a year….thanks.

    two problems have come up though.

    195.47.199.229 - - [02/Jan/2011:09:50:48 -0800] "GET / HTTP/1.1" 410 270 "-" "WatchMouse/18990 (http://watchmouse.com/ ; it)"
    208.85.4.114 - - [02/Jan/2011:09:55:52 -0800] "GET / HTTP/1.1" 410 270 "-" "WatchMouse/18990 (http://watchmouse.com/ ; ny)"
    208.122.4.142 - - [02/Jan/2011:09:58:23 -0800] "GET / HTTP/1.1" 410 363 "-" "FreeWebMonitoring SiteChecker/0.1 (+http://www.freewebmonitoring.com)"

    Both of these watch services are blocked with this, contacted watchmouse and they confirmed that

    Sun, 02 Jan 2010 18:00 - [WM] Stan P. van de Burgt:
    The default agent is WatchMouse/xx.yy where xx.yy is the version number

    What else could be causing this.

  87. [ Gravatar Icon ] Charles says:

    ok well watchmouse lets you change the user agent string,

    so changing it in watchmouse to firefox 3.5, fixes problem 1.

  88. [ Gravatar Icon ] Ed says:

    Quick note for anyone who has blocked blank user agents - if you are using Paypal IPN (Instant Payment Notifications) - Paypal for some reason uses a blank user agent - so you will fail to receive notifications of payments (eg: if running subscribe/member plugins using Wordpress or any other platform that you have integrated with Paypal).

    Spent several days trying to diagnose why my Paypal integration wasn’t working - only to find it was due to Paypal’s (in my opinion) ridiculous practice of using a blank user agent.

    Hope that helps someone avoid the same aggravation that I went through!

  89. [ Gravatar Icon ] Kelly says:

    I was able to get pingdom to work by removed pingd from the list.

    However google site verification doesn’t work. I know at least it’s in the second batch of user agents.

  90. [ Gravatar Icon ] Daniel says:

    The “|frontpage” in the blacklist caused the apache and cPanel at HM in thinking that FrontPage extensions were loaded disabling the ability to password protect directories using cPanel or the file manager.

    Removed the |frontpage from the list fixed the issue.

  91. [ Gravatar Icon ] Thompson says:

    Thanks to the commenters for the tips to remove ‘ida’ for w3c validator and ‘pingd’ for pingdom uptime checker.

    Google site verification doesn’t work though… Anyone know?

  92. [ Gravatar Icon ] Johar says:

    Thanks for your Bad Bot agent List, i will choose several bad bot, not it all :)

  93. [ Gravatar Icon ] Kelly says:

    Okay google site verification bot “verif” should be removed.

    Also not sure why, but removing “98″ fixed all my wp sites giving 410 errors today in chrome.

  94. [ Gravatar Icon ] Ade says:

    I have the same problem as above. Chrome 9 dislikes the blacklist and generates a 410. Had to remove the blacklist code for all of my sites.

  95. [ Gravatar Icon ] Tatiana says:

    Hi, I am new to the whole blogging thing, and I am not very technical. But I have been getting these 404 errors that look very strange, and I think they are coming from user agent MSIE 7.0, I am assuming it is one of those bad bots you were talkin about in this article. What exacly do I have to do to stop that?
    Thank you

  96. [ Gravatar Icon ] vale says:

    Hi Jeff,
    I’m still learning a lot from your work, thanks again for sharing your knowledge!
    Is it normal that googlebot scans for a /contac/ page?!

    66.249.68.88 - - [14/Apr/2011:13:26:38 -0500] "GET /contac/ HTTP/1.1" 301 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

  97. [ Gravatar Icon ] Jeff Starr says:

    Hi vale, not sure about “normal”, as googlebot will seek out anything it thinks is there.. if someone/thing has linked to a nonexistent page such as /contac/, then googlebot will go after it.

    Also, lots of malicious scripts include variations on the misspelled /contac/ URL.. and often these scripts are run via fake user-agents. You may want to verify its actually googlebot by doing a forward/reverse IP/host lookup.

  98. [ Gravatar Icon ] vale says:

    Hi Jeff, thanks a lot for your reply. I got aware of the “contac” case on your blog, that’s why I got worried about this request. The ip for that user agent should be google, at least that’s what I got doing both whois, dig, and host with that ip. So I don’t really know what to think. From my logs I’ve also seen something else that worried me a bit, this bot was also crawling for css and javascript.

  99. [ Gravatar Icon ] Tatianna says:

    Hi Jeff
    I got this weird string in my 404 log, referring from wp-content, with my website name and at least 1000 characters. I don’t really know what that is or what to do about it.

  100. [ Gravatar Icon ] Curious says:

    Why are underscores ‘_’ escaped?

    \_

  101. [ Gravatar Icon ] Jeff Starr says:

    Good question. I’ve always assumed they required it, but now that you mention it, I suppose there’s no need for it.. I’ll look into it and maybe break the habit if it’s not necessary. Thanks :)

  102. [ Gravatar Icon ] Curious says:

    Thanks for replying. I was just wondering, as I didn’t think there was a need for them to be escaped, but I wasn’t sure or maybe there was some other intention.

    By the way, thank you for this list.

  103. [ Gravatar Icon ] matt says:

    I see google, googlebot and verif in the user agent blacklist…

    Forgive me as I dont understand the code but by including them doesnt this mean that we are blocking google from indexing my website?

    Basically id like only bing,ask,yahoo and google to index my site. All others can be blocked.

    Thanks!

  104. [ Gravatar Icon ] Jeff Starr says:

    matt, the list blocks “google wireless” and “googlebot-image”, but not “google” or “googlebot”. Either of these may be removed if you want to allow access. “verif” is included, and may be removed as well.

    And to confirm: Bing, Ask, Yahoo and Google are not blocked by this blacklist. They will have access and may index your site.

    Cheers!

  105. [ Gravatar Icon ] matt says:

    Jeff… I cant thank you enough for making these tools to help block all the junk traffic on my sites.

    THANKS!!!