How to Block Baidu Bot
A user of my 6G Firewall recently asked how to block the “baidu” bot from accessing their site. This post explains why Baidu is not blocked in 6G and provides a quick .htaccess technique to deny it (or anything claiming to be it) access to your site.
A cry for help
Recently one of my users sent an urgent message:
My aim to avoid mainly “baidu” eating all my bandwidth!! Also I learned that many other bots also using the name ‘baidu’. I found more than 50 different IPs & 10,000++ entries, either referer or requests, named ‘baidu’ within 7 days! … Can you please suggest somethings on this or if you already posted any blog for above issue, please tell me the links.
After providing this person a quick, easy-to-implement solution, I thought it would be useful to share the technique here at Perishable Press. Read on to learn more about the Baidu search bot and how to block it from accessing your site.
What is Baidu and why it’s not blocked by 6G et al
Here is the big blurb on Baidu:
Baidu, Inc., incorporated on January 18, 2000, is a Chinese web services company headquartered at the Baidu Campus in Beijing’s Haidian District. Baidu offers many services, including a Chinese search engine for websites, audio files and images.
Basically Baidu is not blocked in 6G Firewall, BBQ Pro, or Blackhole for Bad Bots because it’s a search engine used by the majority of the Chinese population. It’s like the Chinese version of Google, if you will. This also is why I include Baidu in my list of user agents for the top search engines. Because in normal circumstances it may not be prudent to ban Baidu by default, especially if you’re going for as much traffic as you can get.
Why block Baidu?
On the other hand, there are numerous reasons why someone would want to block Baidu, for example:
- Too many bad requests reporting “baidu” as user agent (whether legit or not)
- Fine-tuning traffic to a particular geographic region
- Personal strategy, political reasons, etc.
And of course the reason is irrelevant. If you own your website you are free to block or allow whomever you wish. Further, keep in mind that “blocking baidu” is not exclusively targeting the actual search engine bot. Rather, it’s blocking any bot that reports itself as being “baidu”. In my experience, there are legions of bad bots that transmit counterfeit identities. I don’t know about you, but my personal security strategy leans towards blocking any bot that is pretending to be someone else. Dishonest little bots.
How to block Baidu
Fortunately, blocking “baidu” is dead simple using a slice of .htaccess. Perhaps the easiest, most effective way of doing the job is to add the following directives to your site’s root .htaccess file:
# block baidu bot
<IfModule mod_rewrite.c>
RewriteCond %{HTTP_USER_AGENT} baidu [NC]
RewriteRule .* - [F,L]
</IfModule>
Once implemented, that snippet provides extra-strong protection against anything claiming to be “baidu” (that is, it will block any bot that includes the term “baidu” anywhere in the reported user agent). Very effective, so only add to your site if you know what you are doing and are positive that you want to say goodbye to Baidu.
5 responses to “How to Block Baidu Bot”
Baidu says that they follow robots.txt directives.
http://help.baidu.com/question?prod_en=master&class=498&id=1000973
However as they only index simplified Chinese language pages there’s usually no point letting them crawl the site at all.
Good point, thanks John.
John, why would you think that Baidu only indexes Chinese language pages? I don’t believe that is true at all. On the page you referred to, a bit further down there are instructions to see what Baidu has indexed of any given site. I just added perishablepress to it and there are many pages that in fact have been indexed, total amount of results is actually 3,410 pages!
link to see what is indexed: http://www.baidu.com/s?wd=site%3Aperishablepress.com
That is one reason I personally do not block Baidu. I need any traffic I can get. Very interesting indeed.
That’s news to me, in the past they were crawling my site at a crazy speed and sending zero traffic. The research I did at the time told me that they only indexed simplified Chinese.
Looks like things have changed, or the information I read at the time was wrong.