2010 User-Agent Blacklist
Posted on August 9, 2010 in Websites by Jeff Starr
The 2010 User-Agent Blacklist blocks hundreds of bad bots while ensuring open-access for the major search engines: Google, Bing, Ask, Yahoo, et al. Blocking bad user-agents is an effective addition to any security strategy. It works like this: your site is getting hammered by rogue bots that waste valuable server resources and bandwidth. So you grab a copy of the 2010 UA Blacklist from Perishable Press, include it in your site’s root .htaccess file, and enjoy a more secure and better performing website. It’s that easy.
Proven Security
The 2010 UA Blacklist has been carefully constructed based on rigorous server-log analyses. Obsessive daily log monitoring reveals bad bots scanning for exploits, spamming resources, and wasting bandwidth. While analyzing malicious behavior, evil bots are identified and added to the UA Blacklist. Blocked user-agents are denied access to your site, increasing efficiency and providing safety for your visitors.
Better Performance, Better SEO
Search engines such as Google are placing more weight on speedy, fast-loading websites. If your site is plagued with resource-devouring, bandwidth-wasting bots, it’s performance is probably not as good as it should be. Even if your site looks fine on the surface, without proper protection bad bots can gobble your bandwidth and leech your server resources. A single malicious bot can make hundreds and thousands of requests in a very short period of time while scanning and probing for vulnerabilities. If Google visits while bad bots are hitting your site, your site’s SEO could suffer. Fortunately, the 2010 UA Blacklist protects your site against hundreds of nefarious bots, thereby fostering maximum performance for the search engines.
2010 User-Agent Blacklist
Here it is, presented as two sets of HTAccess directives:
RewriteCond %{HTTP_HOST} !^(127\.0\.0\.0|localhost) [NC]
RewriteCond %{HTTP_USER_AGENT} .*(Firs|exac|Cloak|Detect|uchoo|beaut|ASPSeek|swish|ICS\)|MSIE\ 6\.0\;\ Windows\ NT\;\ DigExt\)|pt\-BR\;\ rv\:1\.9\.0\.3\)\ Firefox\/3\.0|pt\-BR\;\ rv\:1\.9\.0\.18\)\ Firefox\/3\.0|\!susie|\$x0e|\%0a|\%0d|\@\$x|\_irc|\_works|\+select\+|\+union\+|\<\?|1\,\1\,1\,|3gse|4all|4anything|5\.1\;\ xv6875\)|59\.64\.153\.|85\.17\.|88\.0\.106\.|98|a\_browser|a1\ site|abac|abach|abby|aberja|abilon|abont|abot|accept|access|accoo|accoon|aceftp|acme|active|address|adopt|adress|advisor|agent|ahead|aihit|aipbot|alarm|albert|alek|alexa\ toolbar\;\ \(r1\ 1\.5\)|alltop|alma|alot|alpha|america\ online\ browser\ 1\.1|amfi|amfibi|anal|andit|anon|ansearch|answer|answerbus|answerchase|antivirx|apollo|appie|arach|archive|arian|aboutoil|asps|aster|atari|atlocal|atom|atrax|atrop|attrib|autoh|autohot|av\ fetch|avsearch|axod|axon|baboom|baby|back|baid|bali|bandit|barry|basichttp|batch|bdfetch|beat|become|bee|beij|betabot|biglotron|bilgi|bison|bitacle|bitly|blaiz|blitz|blogl|blogscope|blogzice|bloob|blow|bord|boi|bond|boris|bost|bot\.ara|botje|botw|bpimage|brand|brok|broth|browseabit|browsex|bruin|bsalsa|bsdseek|built|bulls|bumble|bunny|busca|busi|buy|bwh3|cafek|cafi|camel|cand|captu|casper|catch|ccbot|ccubee|cd34|ceg|cfnetwork|cgichk|cha0s|chang|chaos|char|char\(|chase\ x|check\_http|checker|checkonly|chek|chill|chttpclient|cipinet|cisco|cita|citeseer|clam|claria|claw|clush|coast|code\.com|cogent|coldfusion|coll|collect|comb|combine|commentreader|common|compan|compatible\-|conc|conduc|contact|control|contype|conv|cool|copi|copy|coral|corn|cosmos|costa|cowbot|cr4nk|craft|cralwer|crank|crap|crawler0|crazy|cres|cs\-cz|cshttp|cuill|CURI|curl|curry|custo|cute|cyber|cz3|czx|daily|dalvik|daobot|dark|darwin|data|daten|dcbot|dcs|dds\ explorer|deep|deps|detect|dex|diam|diibot|dillo|ding|disc|disp|ditto|dlc|doco|dotbot|drag|drec|dsdl|dsok|dts|duck|dumb|eag|earn|earthcom|easydl|ebin|echo|edco|egoto|elnsb5|email|emer|empas|encyclo|enfi|enhan|enterprise\_search|envolk|erck|erocr|eventax|evere|evil|ewh|exploit|expre|extra|eyen|fang|fast|fastbug|faxo|fdse|feed24|feeddisc|feedhub|fetch|filan|fileboo|fimap|find|firebat|firedownload\/1\.2pre\ firefox\/3\.6|firefox\/0|firefox\/1|firefox\/2|firs|flam|flash|flexum|flip|fly|focus|fooky|forum|forv|fost|foto|foun|fount|foxy\/1\;|free|friend|frontpage|fuck|fuer|futile|fyber|gais|galbot|gbpl|gecko\/2001|gecko\/2002|gecko\/2006|gecko\/2009042316|gener|geni|geo|geona|geth|getr|getw|ggl|gira|gluc|gnome|go\!zilla|goforit|goldfire|gonzo|google\ wireless|googlebot\-image|gosearch|got\-it|gozilla|grab|graf|greg|grub|grup|gsa\-cra|gsearch|gt\:\:www|guidebot|guruji|gyps|haha|hailo|harv|hash|hatena|hax|head|helm|herit|heritrix|hgre|hippo|hloader|hmse|hmview|holm|holy|hotbar\ 4\.4\.5\.0|hpprint|httpclient|httpconnect|httplib|human|huron|hverify|hybrid|hyper|iaskspi|ibm\ evv|iccra|ichiro|icopy|ida|ie\/5\.0|ieauto|iempt|iexplore\.exe|ilium|ilse|iltrov|indexer|indy|ineturl|infonav|innerpr|inspect|insuran|intellig|interget|internet\_explorer|internet\x|intraf|ip2|ipsel|irlbot|isc\_sys|isilo|isrccrawler|isspi|jady|jaka|jam|jenn|jet|jiro|jobo|joc|jupit|just|jyx|jyxo|kash|kazo|kbee|kenjin|kernel|keywo|kfsw|kkma|kmc|know|kosmix|krae|krug|ksibot|ktxn|kum|labs|lanshan|lapo|larbin|leech|lets|lexi|lexxe|libby|libcrawl|libcurl|libfetch|libweb|libwww|light|linc|lingue|linkcheck|linklint|linkman|lint|list|litefeeds|livedoor|livejournal|liveup|lmq|locu|london|lone|loop|lork|lth\_|lwp|mac\_f|magi|magp|mail\.ru|main|majest|mam|mama|mana|marketwire|masc|mass|mata|mvi|mcbot|mecha|mechanize|mediapartners|metadata|metalogger|metaspin|metauri|mete|mib\/2\.2|microsoft\.url|microsoft\_internet\_explorer|mido|miggi|miix|mindjet|mindman|mips|mira|mire|miss|mist|mizz|mj12|mlbot|mlm|mnog|moge|moje|mooz|more|mouse|mozdex) [NC]
RewriteRule ^.*$ - [G]
RewriteCond %{HTTP_HOST} !^(127\.0\.0\.0|localhost) [NC]
RewriteCond %{HTTP_USER_AGENT} .*(Windows\ NT\ 6\.1\;\ tr\;\ rv\:1\.9\.2\.6\)|mozilla\/0|mozilla\/1|mozilla\/2|mozilla\/3|mozilla\/4\.61\ \[en\]|mozilla\/firefox|mpf|msie\ 1|msie\ 2|msie\ 3|msie\ 4|msie\ 5|msie\ 6\.0\-|msie\ 6\.0b|msie\ 7\.0a1\;|msie\ 7\.0b\;|msie6xpv1|msiecrawler|msnbot\-media|msnbot\-products|msnptc|msproxy|msrbot|musc|mvac|mwm|my\_age|myapp|mydog|myeng|myie2|mysearch|myurl|nag|name|naver|navr|near|netants|netcach|netcrawl|netfront|netinfo|netmech|netsp|netx|netz|neural|neut|newsbreak|newsgatorinbox|newsrob|newt|next|ng\-s|ng\/2|nice|nikto|nimb|ninja|ninte|nog|noko|nomad|norb|note|npbot|nuse|nutch|nutex|nwsp|obje|ocel|octo|odi3|oegp|offby|offline|omea|omg|omhttp|onfo|onyx|openf|openssl|openu|opera\ 2|opera\ 3|opera\ 4|opera\ 5|opera\ 6|opera\ 7|orac|orbit|oreg|osis|our|outf|owl|p3p\_|page2rss|pagefet|pansci|parser|patw|pavu|pb2pb|pcbrow|pear|peer|pepe|perfect|perl|petit|phoenix\/0\.|php|phras|picalo|piff|pig|pingd|pipe|pirs|plag|planet|plant|platform|playstation|plesk|pluck|plukkie|poe\-com|poirot|pomp|post|postrank|powerset|preload|press|privoxy|probe|program\_shareware|protect|protocol|prowl|proxie|proxy|psbot|pubsub|puf|pulse|punit|purebot|purity|pyq|pyth|query|quest|qweer|radian|rambler|ramp|rapid|rawdog|rawgrunt|reap|reeder|refresh|reget|relevare|repo|requ|request|rese|retrieve|rip|rix|rma|roboz|rocket|rogue|rpt\-http|rsscache|ruby|ruff|rufus|rv\:0\.9\.7\)|salt|sample|sauger|savvy|sbcyds|sbider|sblog|sbp|scagent|scanner|scej\_|sched|schizo|schlong|schmo|scorp|scott|scout|scrawl|screen|screenshot|script|seamonkey\/1\.5a|search17|searchbot|searchme|sega|semto|sensis|seop|seopro|sept|sezn|seznam|share|sharp|shaz|shell|shelo|sherl|shim|shopwiki|silurian|simple|simplepie|siph|sitekiosk|sitescan|sitevigil|sitex|skam|skimp|sledink|sleip|slide|sly|smag|smurf|snag|snapbot|snapshot|snif|snip|snoop|sock|socsci|sogou|sohu|solr|some|soso|spad|span|spbot|speed|sphere|spin|sproose|spurl|sputnik|spyder|squi|sqwid|sqworm|ssm\_ag|stack|stamp|statbot|state|steel|stilo|strateg|stress|strip|style|subot|such|suck|sume|sunos\ 5\.7|sunrise|superbot|superbro|supervi|surf4me|surfbot|survey|susi|suza|suzu|sweep|sygol|synapse|sync2it|systems|szukacz|tagger|tagoo|tagyu|take|talkro|tamu|tandem|tarantula|tbot|tcf|tcs\/1|teamsoft|tecomi|teesoft|teleport|telesoft|tencent|terrawiz|test|texnut|thomas|tiehttp|timebot|timely|tipp|tiscali|titan|tmcrawler|tmhtload|tocrawl|todobr|tongco|toolbar\;\ \(r1|topic|topyx|torrent|track|translate|traveler|treeview|tricus|trivia|trivial|true|tunnel|turing|turnitin|tutorgig|twat|tweak|twice|tygo|ubee|ultraseek|unavail|unf|universal|unknown|upg1|uptime|urlbase|urllib|urly|user\-agent\:|useragent|usyd|vagabo|valet|vamp|vci|veri\~li|verif|versus|via|virtual|visual|void|voyager|vsyn|w0000t|w3search|walhello|walker|wand|waol|watch|wavefire|wbdbot|weather|web\.ima|web2mal|webarchive|webbot|webcat|webcor|webcorp|webcrawl|webdat|webdup|webgo|webind|webis|webitpr|weblea|webmin|webmoney|webp|webql|webrobot|webster|websurf|webtre|webvac|webzip|wells|wep\_s|wget|whiz|widow|win67|windows\-rss|windows\ 2000|windows\ 3|windows\ 95|windows\ 98|windows\ ce|windows\ me|winht|winodws|wish|wizz|wordp|worio|works|world|worth|wwwc|wwwo|wwwster|xaldon|xbot|xenu|xirq|y\!tunnel|yacy|yahoo\-mmaudvid|yahooseeker|yahooysmcm|yamm|yand|yandex|yang|yoono|yori|yotta|yplus\ |ytunnel|zade|zagre|zeal|zebot|zerx|zeus|zhuaxia|zipcode|zixy|zmao) [NC]
RewriteRule ^.*$ - [G]
To implement the UA Blacklist, simply paste into your site’s root .htaccess file (or even better, the Apache configuration file). Upload, test, and stay current with updates and news.
Important Note
The UA Blacklist uses hundreds of regular expressions to block bad bots based on their user-agent. Each of these regular expressions can match many different user-agents. Care has been taken to ensure that only bad bots are blocked, but false positives are inevitable. If you know of a user-agent that should be removed from the list, please let me know. I will do my best to update things asap.
Bottom line: Only use this code if you know what you are doing. It’s not a “fix-it-and-forget” situation, especially for production sites. It’s more like a “fix-it-and-keep-an-eye-on-it” kind of thing, meant for those who understand how it works. As mentioned in the comments, the 2010 User-Agent Blacklist is a work in progress. Please use the UA Blacklist with caution and at your own risk.
So much more..
For those new to Perishable Press, please check out some of my other security resources:
- 2010 IP Blacklist
- The Perishable Press 4G Blacklist
- Blacklist Library
- HTAccess Library
- Security Library
Security is an important part of what I do around here, so please chime in with any suggestions, ideas, and comments. Thank you for visiting Perishable Press.
Related articles
- 4G Series: The Ultimate User-Agent Blacklist, Featuring Over 1200 Bad Bots
- Building the 3G Blacklist, Part 3: Improving Site Security by Selectively Blocking Rogue User Agents
- Blacklist Candidate Number 2008-01-02
- Blacklist Candidate Number 2008-04-27
- Series Summary: Building the 3G Blacklist
- Latest Blacklist Entries
- 4G Series: The Ultimate Referrer Blacklist, Featuring Over 8000 Banned Referrers
I am not that Good in SEO. My question is…. Do small search engines help at all?
It’s all relative.. for very low-traffic sites, they may add a few hits, but for anything larger, they do more harm than good. Generally speaking.
When i add the code, the website stops working , i get error 500. Any ideas
Cool, Thanks Jeff. I have a fairly new site and i realise that a lot of dubious blogs are stealing my content and linking back to my site. I wonder if this is good or not.
Thanks for your blogs, i’m a fan
Let me look into it.. something may have been munged in the process of putting the article together. I’ll report back here as soon as possible.
Got it! The first wildcard dot (“
.”) was missing from the first rule. I have updated the article and text file so they should be working now.I also get a 500 error. You IP list work perfectly when I past that in.
Thanks.
Wouldn’t it be better to whitelist the good ones?
For those of us that like other servers, do you have a source list of user-agents?
@Eric: There was an update to the UA Blacklist almost immediately after posting. Have you tried the most recent version?
@René: That is also an option, but there are many “good” bots that would need to be included. Of course, some people prefer to whitelist only the major search engines (Google et al).
@Timothy Warren: Not sure what you mean by “source list”.. The user-agents blocked by this blacklist are matched using regex expressions, so there are many more bots that are blocked than there are entries in the list.
Working now, thank you!
Seeing a weird issue where I’m getting “Request Exceeded the limit of 10 internal redirects” when I’ve put this code in my httpd.conf.
Any thoughts? I should dig into it more.
This new version looks better than ever, Jeff. Your work here is frequently brilliant and always appreciated.
These rules prevent the CSS validator from working:
http://jigsaw.w3.org/css-validator/
Taking them out allowed it to work again
@RS: Some hosts restrict what you can do with htaccess files. It sounds like they may be limiting the number and/or length of rewrite conditions. It could also be a placement issue. Consult your host for more information.
@Ade: Thank you kindly :)
@Eric: I’ll check it out.. what is the UA used for the validator?
I’m the host… so, I guess I can talk to myself. =P
It sounds like it’s doing a redirect loop. Guess I’ll have to look into it more. But I’m guessing upping the rewrite limit will just make it loop that much more.
This is with apache 2.2. =/
@RS: And the list is located in the root htaccess file? Perhaps the Apache rewrite log will help sort it out..
I put it in my vhost file… but I’ll try htaccess.
and yeah, I’ll turn on the rewrite log and take a look.
Will post back.
Thanks a lot Jeff for this! I fell saver now because I’m having many spam and the load on the server goes up.
With the line starting with:
RewriteCondyour previous blacklist had
[NC,OR]at the end, it is now[NC].What does OR mean and why is it not being used now?
Really cool blacklist :D :D Nice work
sorry, i am a beginner on “spam” problem
which one is more effective?
IP blocking or UA blocking?
@Dean: The
[OR]means just that – “or”, such that multiple rewrite conditions may be evaluated for any given RewriteRule. I’ll check out the previous blacklist and see what’s up, but for our purposes here, no[OR]flags are required.@Julian: It depends on how you are using either method. It’s a lot easier to fake a user-agent than it is an IP address, so I usually reserve IP-blocking for specific cases and user-agent blocking for known bad bots. If you need to block an entire country, blocking via IP is the way to go. If you want to keep the evil “diavol” bot away, blocking the user-agent is the way to go. Great question!
Jeff, do you know the difference between an evil bot and the regular check for feed updates by an RSS reader?
It seems no, or your “carefully constructed” blacklist would not include liferea…
@Adrian: All I know is what my server logs tell me. It’s impossible to research every malicious request, so if a bot is acting like a twerp, it’s added to the list.
Even so, I am more than happy to edit the blacklist to account for legitimate bots that may otherwise look suspicious. Just provide a reference – no need to be insulting.
Hey, Jeff, I just wanted to point out something that happened on my site (running on WP) that seems to be triggered by your user agent blacklist.
I copied the blacklist into my .htaccess file yesterday, then last night I scheduled a post to be published at around 11am this morning. The post never got published and it was listed in my dashboard as “Missed”. This is the first time I’ve ever had a post delayed or missed when using WP’s scheduling.
So, I ran a few tests on my local WP install, and each time I had the user agent blacklist in my .htaccess file, I could not get a scheduled post to go live using the automated system. If I remove the blacklist, the scheduling works fine.
This link has some more info on the scheduling issue, but I just thought I’d let you know about it to see if maybe you can figure out why your blacklist seems to be causing this problem.
Thanks!
Hi,
Thanks for the list, really great help. Although I’m having difficulty with it. I’m trying to use it on a site I have hosted (with streamline.net) but when I add this or your IP blacklist to the htaccess I got a 500 error.
The logs give the error “…..htaccess: RewriteCond: cannot compile regular expression ‘.*(Firs|exac|Cl…” and similar with the IP blacklist.
Is this a host problem that I cannot solve (unfortunately budget is limited to shared hosting with a budget host).
Thanks again,
Julian.
@myself
To follow up, just tried only using each half of your rules. Works fine with “RewriteCond %{HTTP_USER_AGENT} .*(Windows\ NT\…” but not the first rule.
Will try more tomorrow and share anything I find.
Thanks,
Julian
Jeff, are you seriously saying you didn’t manually check your list for false positives before telling people to use it?
Or why did you include strings like “firefox\/3\.0″, “w3m” or “windows\ 98″ in your list?
Looking at your list, it seems to be full of false positives, BLOCKING LEGITIMATE USERS.
@Adrian: lol - not sure how “seriously” I’m saying anything. I simply do my best to analyze, test, and research things as much as possible. As a full-time dad working 2.5 full-time jobs to keep up, my “blacklist” time is limited. I simply enjoy the process and try to share as much information as possible.
Blacklists – especially user-agent blacklists – are a work in progress and are never perfect. The idea is to help improve the code by offering constructive input.
@Adrian . . “Looking at your list, it seems to be full of false positives, BLOCKING LEGITIMATE USERS”
Maybe you can help us out. So your saying an average person surfing the web, with little technical expertise other then surfing favorite pages would be blocked from a site using this list? Nothing personal just trying to figure what you mean by legitimate users.
Some good points here in the comments have resulted in a new “Important Note” section in the article. It basically informs people that false positives are inevitable and only use the list if you know what you’re doing. Thanks to Adrian and Steve for the perspective.
I just wanted to share that I read your blog regularly and enjoy it quite a bit. Some of these comments lack the tact / gratefulness for the quality writing and thoughts that you share with us.
I hope that some of this feedback did not dampen your enthusiasm for blogging here. Cheers.
@ Eric Curtis, you are def right, i call them haters. They can’t think for themselves, they only know how to criticise.
Steve, an average person using a device running Windows CE will be blocked from surfing his favorite pages on a webserver using these .htaccess directives due to the “windows\ ce” entry.
The same goes for anyone using Firefox 3.0 due to the “firefox\/3\.0″ entry.
I do not care what other people write about me, fact is that using these .htaccess directives as recommended in this blog post is a bad thing.
My constructive input is that each entry in such a list has to be checked manually before using it or even publically recommending it to other people.
@Adrian that’s cool.
Firefox 3 was released on June 17, 2008. I certainly hope average users have upgraded by now. Duh!
This release is only a couple days old so it will all get sorted out. There is something that appeals to me about blocking hundreds of bad bots. And any temporary inconvenience for a lone surfer here and there won’t be major. If someone is really into a web site anyway they would contact administration and say “hey I got blocked from the site whats up?”
Steve, people are copying your current list, and might not update it.
And people will not necessarily contact the administratior of a site. They might just give up. Or write a post in their blog that Google Reader is working with some feed but buggy Liferea isn’t. Or complain to us Liferea developers about some non-working Feed. And then I might be the one spending time on debugging why the feed works with Google Reader (not blocked) but not with Liferea (blocked).
When doing blacklisting each false positive is one too many. And there are *many* in your list.
Firefox 3.0 still has a market share around 2%, so your blacklisting blacklisting blocks 2% of all site visitors/customers.
On the other hand, e.g. w3m is a fine albeit exotic browser with a userbase so small that I really wonder how it made it into your blacklist - and an administrator might not get user reports for that false positive for months.
Adrian: Here’s an idea… Why don’t you go through the blacklist and modify it how you see fit, such that it minimizes/eliminates FPs, and then post it back here. So, you actually produce something constructive rather than telling Steve, who’s already said he’s a bit overworked, that he should fix it himself.
Help out… don’t just point out the problems, help fix them. If your blacklist works better, people will use it. That’s how this works. Steve offered up his list in the hopes it will help someone, not so he can be told it sucks.
Doh, I said Steve, I meant Jeff. Ugh… it’s only noon, and it already feels like a long day.
Doh, I had the same Jeff/Steve mixup. Sorry!
Update: The following items have been removed from the UA Blacklist:
w3cw3mlifeareafirefox/3.0firefox/3.0.10Here are the remaining Firefox-related user-agent strings that appear in the list (with escape characters removed):
firefox/0firefox/1firefox/2mozilla/firefoxpt-BR; rv:1.9.0.3) Firefox/3.0pt-BR; rv:1.9.0.18) Firefox/3.0firedownload/1.2pre firefox/3.6The last four of these aren’t legit, and the first three block very old versions, which according to my stats is a very small percentage of visitors. We’re blocking many more bad bots with these than actual browsers.
Some notes:
w3c” should resolve any W3C Validator issues.That’s it for now.. keep any suggestions coming, and I’ll do my best to keep things current. Thanks :)
Thanks Jeff!
The only one I think I might modify would be to add a RewriteCond that basically says:
RewriteCond %{HTTP_HOST} !^127\.0\.0\.0Between the user strings, and the actual block rule. A fair number of people will use some php, curl, wget, etc on their local machine to trigger some event on their site, and the useragent list you have blocks many of these (in my case, I use curl to dump some data into a MySQL DB every minute that’s run by cron).
Obviously YMMV, but it wouldn’t be a bad idea. Might also need one that says:
RewriteCond %{HTTP_HOST} !^localhostBut I’m pretty sure you can combine those. =)
@Louis: Hey just saw your comment – somehow missed it initially.
I’m not sure why a user-agent blacklist would stop WordPress from executing PHP on the server. If there is some sort of WP UA string involved in the process, we could remove the regex match from the list.
I’ll keep my eyes on it. Thanks for the feedback.
Jeff,
This may be the exact thing I just wrote about. I think wordpress’s cron script may use libcurl to trigger the publish… but I could be wrong, since I don’t think I’ve ever scheduled a post for the future. =/
@RS: Great timing – adding that condition is a great idea, and may indeed fix the issue with WordPress scheduling mentioned in previous comment. If so, sweet. :)
I’ll be updating the list again as soon as I can think thru using an
[OR]flag for the first condition in either list.Actually, yeah, good point.
Might be best to do something like:
RewriteCond %{HTTP_HOST} !^(127\.0\.0\.1|localhost)RewriteCond %{HTTP_USER_AGENT} .....your_useragent_list...RewriteRule ^.*$ - [G]That way, the easy rule (is this coming from someone other than localhost) is done first, then the hard rule is done second.
I don’t think an
[OR]is needed. This would be an “AND”… so, NOT localhost, AND one of these useragents. If we use an[OR], it would still trigger if the localhost was connecting with curl.Or am I thinking about this wrong? Again, this day has been really long already.
RS, I’d say the way the list gets constructed is wrong. I would say you cannot create a blacklist this way in an automated way and then try to find all false positives.
Jeff still claims the list was “carefully constructed based on rigorous server-log analyses” although the list is in reality full of false positives - I better don’t tell publically what I think about that.
Another interesting question is whether people using “evil bots” really use identifyable user agent strings as the blog claims. If I’d write an evil bot I’d use the user agent string of the latest Firefox and all this user agent blocking wouldn’t affect my evil bot. If Jeff has data how many percent of *all* accesses this blacklist actually blocks that would be interesting information.
Sounds like a job for Splunk! =)
Yes, definitely no
[OR]flags! – otherwise the list would block everything that isn’t localhost. In any case, the list has been updated with these helpful conditions. Thanks to RS for the suggestion! :)Is this included in the Block Bad Queries wordpress plugin?
Jeff,
adding above rules to 4G list, is now blocking “w3.org”, and probably others too. Haven’t checkecd other part of WordPress. Will report later.
Can you advise to unblock w3.org.
Jeff,
Adding above rules to exisiting 4G list (wordking) list, is now blocking “w3.org” with “410″ error code, and probably others too. Haven’t checked other part of WordPress. Will report later.
In the meantime, can you advise to unblock w3.org?
@Paris Vega: No, this is an entirely different set of rules. :)
@DG: Is
w3.orgthe name of a user-agent? This blacklist only blocks certain UA strings and I’m not seeingw3.orglisted anywhere..Are you sure about this blacklist’s ability to speed up a website? With a blacklist like this, every single request has to be checked against the blacklist. The checking will reduce performance as every user agent has to be checked against a regular expression.
Is the gain from blocking bad bots greater than the loss from the checks?
Jeff,
I think, W3.org’s default user is “
W3C_Validator/xx.xxxx” - not exactly sure.@Daniel15: Good point. There are many factors that determine performance, including server resources, HTTP requests, and so on. When a site is deluged with spammers and crackers looking for exploits, things can really slow down. In such cases, the blacklist will help by eliminating the waste and opening up server resources for legitimate traffic. Conversely, running the blacklist on a site that doesn’t have a lot of bad traffic may reduce performance, if only slightly. Great question, and one of the reasons why the “only-use-if-you-know-what-you-are-doing” disclaimer is included in the article.
@DG: It looks like the “
valid” regex pattern was matching against the W3C Validation service. This has been corrected for both the in-post blacklist and the text-file version. Thanks for the help! :)Jeff,
No, it didn’t work, looks like “valid” regex pattern is innocent. Still, w3.org generating “410″ on both HTML & CSS validation services. It’s something else. Anyways, I’ve just added the rules to existing “4G Blacklist” let me check other part also.
Will update you later.
By the way, Jeff, you present theme, and the color combination are superb. One of my favorite colors.
Trying this today….
It has improved performance. We’ll how ya like that!
Hi, is there a list of UA that I can set my browser to so I can test this?
Can anyone get me this for Nginx?
On multiple Wordpress installations, after I have updated my .htaccess file with this blacklist code, I am not able to use the Flash uploader for the media library. It gives me an error that says “http error” in red letters. Once I removed this code from my .htaccess file, I didn’t have a problem anymore.
Problem:
Hmm, I hit a problem by adding the 2010 UA Blacklist to htaccess - web site monitor basicstate.com tells website down. Tried html validation at w3c and it tells me error 410! so something is wrong.
Removed the first/upper part of the code leaving..in place the lower part:
RewriteCond %{HTTP_HOST} !^(127\.0\.0\.0|localhost) [NC]
RewriteCond %{HTTP_USER_AGENT} .*(Windows\ …..
and all was well again. ideas?
correction!
had to remove all blacklist code to have the the monitors pingdom/basicstate showing green again! something seems wrong with the blacklist.
Steve,
You will need to look at your web logs and see what client pingdom and basicstate use to check the status of your site, and remove those from the blacklist.
Looking on google, pingdom uses GIGRIB, but that doesn’t seem to be in the list.
Basicstate reports itself as Mozilla 4.0, which also shouldn’t be blocked given the current blacklist. Have you tried the latest version?
by the by… here are the appropriate pages with that info:
http://basicstate.com/htm/hand.htm
http://uptime.pingdom.com/general/faq#12
Not sure why Pingdom is being blocked, but basicstate may be being blocked because it reports itself as:
Mozilla/4.0 (compatible; MSIE 6.0 Windows 2000; http://basicstate.com/)
Which may be catching the rule “msie\ 6\.0\-”… but not sure.
To effectively troubleshoot, you would want to comment out the first block of user agents, test, if it works, uncomment the first block, then comment out the second block, test. Then just comment/uncomment a chunk of each until you whittle it down to the culprit. Not sure if anyone else uses these services, so any info you can give, would be great.
agents report as:
Agent: Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; http://basicstate.com/)
However, not even w3c (html check), reporting as Agent: W3C_Validator/1.1 can find the web site any more and reports a 410 (gone) so there must be a bit more wrong than just the monitors. I have to remove both UA blocks to get it going again.
Just a friendly reminder to remove any confounding variables while testing. For example, other htaccess directives may be interfering with the results if present in the same file or even inherited from a parent directory. Also be mindful of possible interference from PHP and/or other sources.
Thanks Jeff I appreciate your support! I will look into it when I got more time on hand. So far there is nothing obvious to me what could cause a conflict.
thx again
My pleasure, and I will be going through the comments here and doing what I can to look into things and keep things updated and current. This week’s just been a little crazier than most, but I should have more time to investigate next week. Huge thanks to you and everyone else who helps improve the blacklist by contributing information and feedback.
found the culprit!
The 2010 UA blacklist code along with your hotlink protection code make a web site inaccessible. Try it!
i put it on and noticed a few days later that the google adsense bot couldn’t crawl my site. i took off everything relating to ‘google’, ‘media’, ‘partner’, ‘ads’ and everything else related to their user_agent, but i still couldn’t work out which one it was. when i took it all off, and went back to crawling it okay again. if anyone knows which one is the culprit i’d like to hear, because i wouldn’t mind putting it all back on again
This is excellent work Jeff thank you very much. I have two questions;
1. Can we submit suspiucious user_agents for addition to the list?
2. How often do you think you will have time to update the list?
I meant to add that for months I’ve had a persistent user agent that just has this in the user_agent field;
'That’s all, a simple apostrophe.
@Steve:
1. Yes, please post a comment here or send an email to jeff at this domain. I try to stay current with updates, but sometimes it takes awhile. By keeping everything (issues, ideas, edits) in the thread, it makes updating things a little easier. Whatever works best for you.
2. I try to do it every week, but sometimes it can take several. Kids are back in school tomorrow, so I should see a dramatic increase of free time =)
I’ll look into the “apostrophe-only” user-agent for the next update. Thanks.
Ah, this list is quite good. It would be essential for a blog that was under attack before, it covers all of the bad user agents in my list.
.htaccess files are indeed annoying to implement if they’re not defaults, 500 errors are just the server complaining something is not right.
You may want to add the following enclosure for the rules though, to stop non-compliant installs to not complain:
(mod_rewrite code here)Good work Jeff, I love the articles greatly!
Whoops, make that:
<IfModule mod_rewrite.c>(htaccess code here)</IfModule>replacing the obvious
Good call, Alex – Thanks for the reminder :)
Finally, after a while of testing, I found the bot that disallows W3C Markup Validation.
Search “ida” bot. It’s located between “
|icopy|” and “|ie\/5\.0”.So… “
|icopy|”…..IDA…..“|ie\/5\.0”.I hope I’ve been helpful !
I’m guessing RSS parsers were added the blacklist to prevent page scraping, but the list shouldn’t blacklist RSS parsers because some sites rely on it as widget blocks.
I had the blacklist active on a site I was working on and it broke the site because I had SimplePie running on it.
Just an FYI.
HI John thanks for this amazing resource, just wondering now that i have added the blacklist it has killed 99% of the bot problem i had. i am still getting 2 that are gettig through.
one of them has a empty user agent and the other is identified by ‘crawl’
i have added crawl to your blacklist list but how do i stop ones with empty user agent string?
also is it okay to block ‘crawl’ as well
Hi
Thanks for the code.
I would like to use your code at :
http://perishablepress.com/press/wp-content/online/code/2010-user-agent-blacklist.txt
I have seen the httpd.conf file but dont understant where to paste it.
can you please tell me where to paste in httpd.conf
Please give some clear instructions.
Thanks.
James
I think I found another good site for such : http://www.spanishseo.org/block-spam-bots-scrapers
Hopefully it will be discussed to improve the list here.
Hmm just a question.
That UA Block List in the form it exists now can be considered error free yet?
Just asking because I’m far away from calling myself a pro when it comes to those things. I just noticed since a few days I get visits from a bot called: MaMa CaSpEr
I ask Google for it a few sites later I got told that this seems to be kind of an exploits scanner. Which brings me to your Site esspecially to the Block List here where I noticed Casper is included. So I included your List inside of my htaccess file (what should I say? Now it will be for sure blocked because of the internal Error 500 I get lol)
I mean my Site went down now, can’t access it anymore because of that internal error which is the reason I ask if that script is error free now?
Any advice would be highly appricitated…
Not sure just a guess could it be that maybe I have too much rewrite conditions included inside of my htaccess?
I’m also not sure about if that bot comes always named as: MaMa CaSpEr
if so, maybe could you just show me a way to block only this instead of that large list?
Kindest regards,
stOrM!
Just a quick heads up. This list does cause issues with wp-super cache. As noted for someone above they had issues with the first half of the list, I had issues with the second half of the list. Once I removed it then the 410 errors went away.
I’m not savvy enough to know whether this is a bad thing for the cache, but the cache had seemed to be doing it’s job just throwing a 410 error when doing a self test on the cache.
So Caveat Emptor if you’re a WP Super Cache user and want to use this script.
You’re right!
I disabled any cache stuff since I do not have a heavy load so caching isnt required much… Its working now and the casper stuff is blocked too…
Happily everything is working as exspected now, will watching the apache logs to sort things out but couldnt find any serious errors for now anyway, very nice script thank you for sharing it!!!
Kindest regards,
s!
Well this list has helped me to block some persistant Chinese servers that have been bothering me for a year….thanks.
two problems have come up though.
195.47.199.229 - - [02/Jan/2011:09:50:48 -0800] "GET / HTTP/1.1" 410 270 "-" "WatchMouse/18990 (http://watchmouse.com/ ; it)"208.85.4.114 - - [02/Jan/2011:09:55:52 -0800] "GET / HTTP/1.1" 410 270 "-" "WatchMouse/18990 (http://watchmouse.com/ ; ny)"208.122.4.142 - - [02/Jan/2011:09:58:23 -0800] "GET / HTTP/1.1" 410 363 "-" "FreeWebMonitoring SiteChecker/0.1 (+http://www.freewebmonitoring.com)"Both of these watch services are blocked with this, contacted watchmouse and they confirmed that
Sun, 02 Jan 2010 18:00 - [WM] Stan P. van de Burgt:
The default agent is WatchMouse/xx.yy where xx.yy is the version number
What else could be causing this.
ok well watchmouse lets you change the user agent string,
so changing it in watchmouse to firefox 3.5, fixes problem 1.
Quick note for anyone who has blocked blank user agents - if you are using Paypal IPN (Instant Payment Notifications) - Paypal for some reason uses a blank user agent - so you will fail to receive notifications of payments (eg: if running subscribe/member plugins using Wordpress or any other platform that you have integrated with Paypal).
Spent several days trying to diagnose why my Paypal integration wasn’t working - only to find it was due to Paypal’s (in my opinion) ridiculous practice of using a blank user agent.
Hope that helps someone avoid the same aggravation that I went through!
I was able to get pingdom to work by removed pingd from the list.
However google site verification doesn’t work. I know at least it’s in the second batch of user agents.
The “
|frontpage” in the blacklist caused the apache and cPanel at HM in thinking that FrontPage extensions were loaded disabling the ability to password protect directories using cPanel or the file manager.Removed the
|frontpagefrom the list fixed the issue.Thanks to the commenters for the tips to remove ‘ida’ for w3c validator and ‘pingd’ for pingdom uptime checker.
Google site verification doesn’t work though… Anyone know?
Thanks for your Bad Bot agent List, i will choose several bad bot, not it all :)
Okay google site verification bot “verif” should be removed.
Also not sure why, but removing “98″ fixed all my wp sites giving 410 errors today in chrome.
I have the same problem as above. Chrome 9 dislikes the blacklist and generates a 410. Had to remove the blacklist code for all of my sites.
Hi, I am new to the whole blogging thing, and I am not very technical. But I have been getting these 404 errors that look very strange, and I think they are coming from user agent MSIE 7.0, I am assuming it is one of those bad bots you were talkin about in this article. What exacly do I have to do to stop that?
Thank you
Hi Jeff,
I’m still learning a lot from your work, thanks again for sharing your knowledge!
Is it normal that googlebot scans for a
/contac/page?!66.249.68.88 - - [14/Apr/2011:13:26:38 -0500] "GET /contac/ HTTP/1.1" 301 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"Hi vale, not sure about “normal”, as googlebot will seek out anything it thinks is there.. if someone/thing has linked to a nonexistent page such as
/contac/, then googlebot will go after it.Also, lots of malicious scripts include variations on the misspelled
/contac/URL.. and often these scripts are run via fake user-agents. You may want to verify its actually googlebot by doing a forward/reverse IP/host lookup.Hi Jeff, thanks a lot for your reply. I got aware of the “contac” case on your blog, that’s why I got worried about this request. The ip for that user agent should be google, at least that’s what I got doing both whois, dig, and host with that ip. So I don’t really know what to think. From my logs I’ve also seen something else that worried me a bit, this bot was also crawling for css and javascript.
Hi Jeff
I got this weird string in my 404 log, referring from wp-content, with my website name and at least 1000 characters. I don’t really know what that is or what to do about it.
Why are underscores ‘_’ escaped?
\_
Good question. I’ve always assumed they required it, but now that you mention it, I suppose there’s no need for it.. I’ll look into it and maybe break the habit if it’s not necessary. Thanks :)
Thanks for replying. I was just wondering, as I didn’t think there was a need for them to be escaped, but I wasn’t sure or maybe there was some other intention.
By the way, thank you for this list.
I see google, googlebot and verif in the user agent blacklist…
Forgive me as I dont understand the code but by including them doesnt this mean that we are blocking google from indexing my website?
Basically id like only bing,ask,yahoo and google to index my site. All others can be blocked.
Thanks!
matt, the list blocks “google wireless” and “googlebot-image”, but not “google” or “googlebot”. Either of these may be removed if you want to allow access. “verif” is included, and may be removed as well.
And to confirm: Bing, Ask, Yahoo and Google are not blocked by this blacklist. They will have access and may index your site.
Cheers!
Jeff… I cant thank you enough for making these tools to help block all the junk traffic on my sites.
THANKS!!!