Web developers trying to control comment-spam, bandwidth-theft, and content-scraping must choose between two fundamentally different approaches: selectively deny target offenders (the "blacklist" method) or selectively allow desirable agents (the "opt-in", or "whitelist" method).
Currently popular according to various online forums and discussion boards is the blacklist method. The blacklist method requires the webmaster to create and maintain a working list of undesirable agents, usually blocking their access via htaccess or php. The downside of "blacklisting" is that it requires considerable effort to stay current with the exponential number of ever-evolving threats, which require exceedingly long lists for an effective response. Although time-consuming and potentially work-intensive (there are automated methods of blacklisting bad bots), blacklisting optimizes hits by allowing site access to anyone not on the blacklist. Unfortunately for blacklisters, it has become relatively trivial to disguise bots by using standard user-agent strings. So the bad guys bypass the blacklist and slip into your site incognito. Besides, nobody wants to waste valuable time digging through endless access logs. Whereas blacklisting is reactive, whitelisting is proactive..
Growing in popularity now is the exclusive, invitation-only "opt-in", or "whitelist" method. The upside of the whitelist method is that it delivers a devastating blow to online scum trying to spam your site, steal your bandwidth, and scrape your content. Whitelisting is also gaining favor because it eliminates the time-consuming task of maintaining and updating rapidly growing, incredibly long blacklists, which, incidentally, may also affect server performance. With the opt-in method, you simply set a simple block of htaccess rules and you’re done. Nonetheless, the argument against whitelisting is that it is still possible for spambots to disguise themselves and gain access to your site. Not to mention the inevitable consequences of denying access to legitimate users, browsers, and other agents.
It all comes down to choosing between the lesser of two evils: false negatives or false positives. If security is an issue, and you aren’t terribly concerned about allowing absolutely everyone into your site, then whitelisting is probably the best option. Not only will you end up saving massive amounts of time, but you will also eliminate a great deal of spam, bandwidth theft, and stolen content. On the other hand, if you are running a business or operating a public-service oriented site, government site, etc., unintentionally denying access to legitimate users is an absolutely horrifying prospect.
Here is an example of a extremely restrictive whitelist that blocks everyone except for the major search engines (Google, Yahoo, MSN, Ask) and popular browsers (Firefox, Internet Explorer, Opera, Safari). Everyone — and we mean everyone (e.g., Lynx, cell phones, PDA’s, anonymous agents, etc.) — is denied access. Also be advised that the following whitelist, although common, is far from perfect — search engines are constantly evolving, multiplying, and experimenting. Further research is always advisable;)
# Exclusive whitelist for search engines and browsers # Google agents BrowserMatchNoCase Googlebot good_pass BrowserMatchNoCase Mediapartners-Google good_pass # Yahoo agents BrowserMatchNoCase Slurp good_pass BrowserMatchNoCase Yahoo-MMCrawler good_pass # MSN agents BrowserMatchNoCase ^msnbot good_pass BrowserMatchNoCase SandCrawler good_pass # Ask agents BrowserMatchNoCase Teoma good_pass BrowserMatchNoCase Jeeves good_pass # Popular browsers BrowserMatchNoCase ^Mozilla good_pass BrowserMatchNoCase ^Opera good_pass BrowserMatchNoCase ^MSIE IE good_pass BrowserMatchNoCase ^Safari good_pass # The bouncer <Limit GET POST PUT HEAD> order deny,allow deny from all allow from env=good_pass </Limit>