Tag: security

How to Deal with Content Scrapers

Posted on September 24, 2010 in Websites by Jeff Starr

Chris Coyier of CSS-Tricks recently declared that people should do “nothing” in response to other sites scraping their content. I totally get what Chris is saying here. He is basically saying that the original source of content is better than scrapers because:

  • it’s on a domain with more trust.
  • you published that article first.
  • it’s coded better for SEO than theirs.
  • it’s better designed than theirs.
  • it isn’t at risk for serious penalization from search engines.

If these things are all true, then I agree, you have nothing to worry about. Unfortunately, that’s a tall order for many sites on the Web today. Although most scraping sites are pure and utter crap, the software available for automating the production of decent-quality websites is getting better everyday. More and more I’ve been seeing scraper sites that look and feel authentic because they are using some sweet WordPress theme and a few magical plugins. In the past, it was easy to spot a scraper site, but these days it’s getting harder to distinguish between scraped and original content. Not just for visitors, but for search engines too.

Continue Reading

2010 User-Agent Blacklist

Posted on August 9, 2010 in Websites by Jeff Starr

[ 2010 User-Agent Blacklist ] The 2010 User-Agent Blacklist blocks hundreds of bad bots while ensuring open-access for the major search engines: Google, Bing, Ask, Yahoo, et al. Blocking bad user-agents is an effective addition to any security strategy. It works like this: your site is getting hammered by rogue bots that waste valuable server resources and bandwidth. So you grab a copy of the 2010 UA Blacklist from Perishable Press, include it in your site’s root .htaccess file, and enjoy a more secure and better performing website. It’s that easy.

Proven Security

The 2010 UA Blacklist has been carefully constructed based on rigorous server-log analyses. Obsessive daily log monitoring reveals bad bots scanning for exploits, spamming resources, and wasting bandwidth. While analyzing malicious behavior, evil bots are identified and added to the UA Blacklist. Blocked user-agents are denied access to your site, increasing efficiency and providing safety for your visitors.

Continue Reading

Best Method for Email Obfuscation?

Posted on August 1, 2010 in Accessibility, Function by Jeff Starr

Awhile ago, Silvan Mühlemann conducted a 1.5 year experiment whereby different approaches to email obfuscation were tested for effectiveness. Nine different methods were implemented, with each test account receiving anywhere from 1800 to zero spam emails. Here is an excerpt from the article:

When displaying an e-mail address on a website you obviously want to obfuscate it to avoid it getting harvested by spammers. But which obfuscation method is the best one? I drove a test to find out.

After reading through the article and its many findings, here are what seem to be the best methods for obfuscating email addresses displayed publicly on web pages..

Continue Reading

Protect Your Site with a Blackhole for Bad Bots

Posted on July 14, 2010 in Websites by Jeff Starr

[ Black Hole ] One of my favorite security measures here at Perishable Press is the site’s virtual Blackhole trap for bad bots. The concept is simple: include a hidden link to a robots.txt-forbidden directory somewhere on your pages. Bots that ignore or disobey your robots rules will crawl the link and fall into the trap, which then performs a WHOIS Lookup and records the event in the blackhole data file. Once added to the blacklist data file, bad bots immediately are denied access to your site. I call it the “one-strike” rule: bots have one chance to follow the robots.txt protocol, check the site’s robots.txt file, and obey its directives. Failure to comply results in immediate banishment. The best part is that the Blackhole only affects bad bots: normal users never see the hidden link, and good bots obey the robots rules in the first place.

In five easy steps, you can set up your own Blackhole to trap bad bots and protect your site from evil scripts, bandwidth thieves, content scrapers, spammers, and other malicious behavior.

[ Blackhole Directory with Files ] The Blackhole is built with PHP, and uses a bit of .htaccess to protect the blackhole directory. The blackhole script combines heavily modified versions of the Kloth.net script (for the bot trap) and the Network Query Tool (for the whois lookups). Refined over the years and completely revamped for this tutorial, the Blackhole consists of a single plug-&-play directory that contains the following four files:

Continue Reading

2010 IP Blacklist

Posted on July 6, 2010 in Websites by Jeff Starr

Over the course of each year, I blacklist a considerable number of individual IP addresses. Every day, Perishable Press is hit with countless numbers of spammers, scrapers, crackers and all sorts of other hapless turds. Weekly examinations of my site’s error logs enable me to filter through the chaff and cherry-pick only the most heinous, nefarious attackers for blacklisting. Minor offenses are generally dismissed, but the evil bastards that insist on wasting resources running redundant automated scripts are immediately investigated via IP lookup and denied access via simple htaccess directive:

<Limit GET POST PUT>
 Order Allow,Deny
 Allow from all
 Deny from 123.456.789
</LIMIT>

Although many of the worst attacks happen in randomized, zombie-like fashion, I have found that individual IPs that are not blacklisted will return repeatedly until finally blocked. Yet, despite the short-term success enjoyed by denying access to the most malicious IPs, the long-term futility of such blacklisting reflects the temporary nature of this solution.

In other words, I have found that blocking individual IPs is useful only for limited periods of time. Thus, every year, I gather my code and flush the blacklist of all individually blocked IP addresses. I then start fresh, adding the worst villains to the list, blocking entire IP ranges if necessary, and referring to previous versions of my htaccess files to cross-check suspiciously familiar entities. Eventually, a new blacklist emerges and I share it at Perishable Press. Here is the current version for 2010..

Continue Reading

Fixing WordPress Infinite Duplicate Content Issue

Posted on April 6, 2010 in WordPress by Jeff Starr

Jeff Morris recently demonstrated a potential issue with the way WordPress handles multipaged posts and comments. The issue involves WordPress’ inability to discern between multipaged posts and comments that actually exist and those that do not. By redirecting requests for nonexistent numbered pages to the original post, WordPress creates an infinite amount of duplicate content for your site. In this article, we explain the issue, discuss the implications, and provide an easy, working solution.

Understanding the “infinite duplicate content” issue

Using the <!--nextpage--> tag, WordPress makes it easy to split your post content into multiple pages, and also makes it easy to paginate the display of your comment threads. For both paged posts and paged comments, WordPress appends the page number to the permalink. So for example, if we have a post split into 3 pages, WordPress will generate the following set of completely valid permalinks (based on name-only permalink structure):

Continue Reading

Is it Secret? Is it Safe?

Posted on March 17, 2010 in Function by Jeff Starr

[ Enjoying the Evening ] Whenever I find myself working with PHP or messing around with server settings, I nearly always create a phpinfo.php file and place it in the root directory of whatever domain I happen to be working on. These types of informational files employ PHP’s handy phpinfo() function to display a concise summary of all of your server’s variables, which may then be referenced for debugging purposes, bragging rights, and so on.

While this sort of thing is normally okay, I frequently forget to remove the file and just leave it sitting there for the entire world to look at. This of course is a big “no-no” for site security, because the phpinfo.php file contains a hefty amount of information about my server, including stuff like:

  • The web server version
  • The IP address of the host
  • The version of the operating system
  • The root directory of the web server
  • Configuration information about the remote PHP installation
  • The username of the user who installed php and if they are a SUDO user

Continue Reading

Protect WordPress Against Malicious URL Requests

Posted on December 22, 2009 in WordPress by Jeff Starr

A few months ago, many WordPress sites were attacked with some extremely malicious code. While searching for a good solution, I discovered the following gem of a plugin in the pastebin repository:

<?php /* Plugin Name: Block Bad Queries */

if (strlen($_SERVER['REQUEST_URI']) > 255 || 
	strpos($_SERVER['REQUEST_URI'], "eval(") || 
	strpos($_SERVER['REQUEST_URI'], "base64")) {
		@header("HTTP/1.1 414 Request-URI Too Long");
		@header("Status: 414 Request-URI Too Long");
		@header("Connection: Close");
		@exit;
} ?>

This script checks for excessively long request strings (i.e., greater than 255 characters), as well as the presence of either “eval(” or “base64” in the request URI. These sorts of nefarious requests were implicated in the September 2009 WordPress attacks.

Continue Reading

How to Protect Your Site Against Content Thieves

Posted on September 23, 2009 in Blogging by Jeff Starr

Stolen content is the bane of every blogger who provides a publicly available RSS feed. By delivering your content via feed, you make it easy for scrapers to assimilate and re-purpose your material on their crap Adsense sites. It’s bad enough that someone would re-post your entire feed without credit, but to use it for cheap money-making schemes is about as pathetic as it gets. If you’re lucky, the bastards may leave all the links intact, so at least you will get a few back-links (if you have been linking internally) and get notified of the stolen content as well (via pingback or Google Alert). Lately, however, many of the scraper sites that I have seen are completely removing all links within the stolen content. Incidentally, there are some tell-tale signs that the site you are visiting is a scraper site:

  • No RSS feed available
  • Many quality posts that contain no links
  • Many quality posts but very low subscriber count
  • Great content but with zero comments on any posts
  • Lots of good content but with lots of Adsense or other ads
  • No “About” page or business information
  • And the number one brain-dead giveaway: no contact form or email address

If you pay attention as you surf around, you may want to keep an eye out for some of these dead giveaways. If it looks like the site is profiting from stolen content, it is advisable to leave immediately and locate an original source of information (you could even be cool and report the scraper site to the original author). I.e., help strengthen the legit blogging community and don’t support scrapers in any way. But avoiding scraper sites is merely an afterthought. The real challenge is to have a solid strategy in place that will help you identify, eliminate and prevent stolen content. Unfortunately, there is no “magic cure” that will stop the scrapers from stealing your hard work — apart from running a private site or not providing a feed — but there are many great tools that have proven quite effective in fighting the war against stolen content. While not completely exhaustive, here are some powerful tips and tricks that have served me well over the years:

Continue Reading

Disable Trace and Track for Better Security

Posted on September 6, 2009 in Function by Jeff Starr

The shared server on which I host Perishable Press was recently scanned by security software that revealed a significant security risk. Namely, the HTTP request methods TRACE and TRACK were found to be enabled on my webserver. The TRACE and TRACK protocols are HTTP methods used in the debugging of webserver connections.

Although these methods are useful for legitimate purposes, they may compromise the security of your server by enabling cross-site scripting attacks (XST). By exploiting certain browser vulnerabilities, an attacker may manipulate the TRACE and TRACK methods to intercept your visitors’ sensitive data. The solution, of course, is disable these methods on your webserver.

Continue Reading

HTAccess Password-Protection Tricks

Posted on July 13, 2009 in Function by Jeff Starr

Recently a reader asked about how to password-protect a directory for every specified IP while allowing open access to everyone else. In my article, Stupid htaccess Tricks, I show how to password-protect a directory for every IP except the one specified, but not for the reverse case. In this article, I will demonstrate this technique along with a wide variety of other useful password-protection tricks, including a few from my Stupid htaccess Tricks article. Before getting into the juicy stuff, we’ll review a few basics of HTAccess password protection.

Continue Reading

Secure Visitor Posting for WordPress

Posted on June 1, 2009 in WordPress by Jeff Starr

[ ~{*}~ ] Normally, when visitors post a comment to your site, specific types of client data are associated with the request. Commonly, a client will provide a user agent, a referrer, and a host header. When any of these variables is absent, there is good reason to suspect foul play. For example, virtually all browsers provide some sort of user-agent name to identify themselves. Conversely, malicious scripts directly posting spam and other payloads to your site frequently operate without specifying a user agent. In the Ultimate User-Agent Blacklist, we account for the “no-user-agent” case in the very first directive, preventing a host of anonymous visitors from hitting the site.

In addition to empty user-agent strings, malicious requests for site content frequently fail to provide any referrer information. Unless special privacy software is being used, the web page from which a visitor has arrived at your site will be specified in the header information for that request. Likewise, when a visitor posts a comment at your site, the referrer string for that post request will be the URL of that particular page. Thus, as with blank user-agent requests, no-referrer requests are frequently indicative of spam and other malicious behavior.

Another important piece of information provided by all legitimate clients is the host request header. The host header specifies the Internet host and port number of the requested resource. This information is required for all clients making HTTP/1.1 requests. Thus, requiring the host request-header field for all posts to your site safely eliminates illicit requests from hitting your server.

Continue Reading

4G Series: The Ultimate Referrer Blacklist, Featuring Over 8000 Banned Referrers

Posted on April 21, 2009 in Websites by Jeff Starr

You have seen user-agent blacklists, IP blacklists, 4G Blacklists, and everything in between. Now, in this article, for your sheer and utter amusement, I present a collection of over 8000 blacklisted referrers.

For the uninitiated, in teh language of teh Web, a referrer is the online resource from whence a visitor happened to arrive at your site. For example, if Johnny the Wonder Parrot was visiting the Mainstream Media website and happened to follow a link to your site (of all places), you would look at your access logs, notice Johnny’s visit, and speak out loud (slowly): “hmmm.. it looks like the Mainstream Media website referred my good pal Johnny to my Alka-Seltzer sales page.” In such a bizarre case, the Mainstream Media website — or specific page — is referred to as (no pun intended) the referrer.

Continue Reading

4G Series: The Ultimate User-Agent Blacklist, Featuring Over 1200 Bad Bots

Posted on March 29, 2009 in Websites by Jeff Starr

[ Image: Inverted Eclipse ] As discussed in my recent article, Eight Ways to Blacklist with Apache’s mod_rewrite, one method of stopping spammers, scrapers, email harvesters, and malicious bots is to blacklist their associated user agents. Apache enables us to target bad user agents by testing the user-agent string against a predefined blacklist of unwanted visitors. Any bot identifying itself as one of the blacklisted agents is immediately and quietly denied access. While this certainly isn’t the most effective method of securing your site against malicious behavior, it may certainly provide another layer of protection.

Even so, there are several things to consider before choosing to implement an extensive user-agent blacklist on your site. First and most importantly is the transient nature of the user agent itself. On most systems, the user-agent variable is easy to change, making it possible for bot owners to use any user-agent name they wish. Once a bad bot makes the rounds, becomes known, and is blacklisted, the bot owner need only modify or change its declared user agent and they’re back in business. User-agent names are constantly invented, spoofed, or otherwise altered in order to operate beneath — or above — the virtual radar. Thus, a user-agent blacklist is a high-maintenance affair, requiring continuous cultivation in order to maintain relevancy and effectiveness.

Continue Reading

The Perishable Press 4G Blacklist

Posted on March 16, 2009 in Websites by Jeff Starr

[ 4G Stormtrooper ] At last! After many months of collecting data, crafting directives, and testing results, I am thrilled to announce the release of the 4G Blacklist! The 4G Blacklist is a next-generation protective firewall that secures your website against a wide range of malicious activity. Like its 3G predecessor, the 4G Blacklist is designed for use on Apache servers and is easily implemented via HTAccess or the httpd.conf configuration file. In order to function properly, the 4G Blacklist requires two specific Apache modules, mod_rewrite and mod_alias. As with the third generation of the blacklist, the 4G Blacklist consists of multiple parts:

Update Feb 22, 2011: The 5G version of the blacklist is available now in beta.

Continue Reading

Yahoo! Slurp too Stupid to be a Robot

Posted on March 15, 2009 in Nonsense, Websites by Jeff Starr

I really hate bad robots. When a web crawler, spider, bot — or whatever you want to call it — behaves in a way that is contrary to expected and/or accepted protocols, we say that the bot is acting suspiciously, behaving badly, or just acting stupid in general. Unfortunately, there are thousands — if not hundreds of thousands — of nefarious bots violating our websites every minute of the day.

For the most part, there are effective methods available enabling us to protect our sites against the endless hordes of irrelevant and mischievous bots. Such evil is easily blocked with virtually zero side-effects because their presence is simply irrelevant.

But what about bad bots that aren’t exactly irrelevant, such as Yahoo’s mindless Slurp crawler? By disobeying the robots.txt protocol as promised, Yahoo’s Slurp clearly falls into the “bad-bot” category. Unlike typical “nonsense” bots, Slurp is not exactly irrelevant (yet), so simply blocking them is not a reasonable solution.

Continue Reading

Building the Perishable Press 4G Blacklist

Posted on March 8, 2009 in Websites by Jeff Starr

[ Building the Hoover Dam, Part 1 ]

Last year, after much research and discussion, I built a concise, lightweight security strategy for Apache-powered websites. Prior to the development of this strategy, I relied on several extensive blacklists to protect my sites against malicious user agents and IP addresses. Unfortunately, these mega-lists eventually became unmanageable and ineffective. As increasing numbers of attacks hit my server, I began developing new techniques for defending against external threats. This work soon culminated in the release of a “next-generation” blacklist that works by targeting common elements of decentralized server attacks. Consisting of a mere 37 lines, this “2G” Blacklist provided enough protection to enable me to completely eliminate over 350 blacklisting directives from my site’s root htaccess file. This improvement increased site performance and decreased attack rates, however many bad hits were still getting through. More work was needed..

Continue Reading

Controlling Proxy Access with HTAccess

Posted on February 22, 2009 in Function by Jeff Starr

In my recent article on blocking proxy servers, I explain how to use HTAccess to deny site access to a wide range of proxy servers. The method works great, but some readers want to know how to allow access for specific proxy servers while denying access to as many other proxies as possible.

Fortunately, the solution is as simple as adding a few lines to my original proxy-blocking method. Specifically, we may allow any requests coming from our whitelist of proxy servers by testing Apache’s HTTP_REFERER variable, like so:

RewriteCond %{HTTP_REFERER} !(.*)allowed-proxy-01.domain.tld(.*)
RewriteCond %{HTTP_REFERER} !(.*)allowed-proxy-02.domain.tld(.*)
RewriteCond %{HTTP_REFERER} !(.*)allowed-proxy-03.domain.tld(.*)

Continue Reading

Eight Ways to Blacklist with Apache’s mod_rewrite

Posted on February 3, 2009 in Function by Jeff Starr

With the imminent release of the next series of (4G) blacklist articles here at Perishable Press, now is the perfect time to examine eight of the most commonly employed blacklisting methods achieved with Apache’s incredible rewrite module, mod_rewrite. In addition to facilitating site security, the techniques presented in this article will improve your understanding of the different rewrite methods available with mod_rewrite.

Blacklist via Request Method

[ #1 ] This first blacklisting method evaluates the client’s request method. Every time a client attempts to connect to your server, it sends a message indicating the type of connection it wishes to make. There are many different types of request methods recognized by Apache. The two most common methods are GET and POST requests, which are required for “getting” and “posting” data to and from the server. In most cases, these are the only request methods required to operate a dynamic website. Allowing more request methods than are necessary increases your site’s vulnerability. Thus, to restrict the types of request methods available to clients, we use this block of Apache directives:

Continue Reading

PHP Short Open Tag: Convenient Shortcut or Short Changing Security?

Posted on January 12, 2009 in Function by Bill Brown

[ Echo Shortcut Code ] Most of us learned how to use “echo()” in one of our very first PHP tutorials. That was certainly the case for me. As a consequence, I never really had a need to visit PHP’s documentation page for echo(). On a recent visit to Perishable Press, I saw a Tumblr post from Jeff about the use of PHP’s shortcut syntax for echo() but somewhere deep in my memory, there lurked a warning about its use. I decided to investigate.

Continue Reading

Yahoo! Lies about Obeying Robots.txt Directives

Posted on November 16, 2008 in Websites by Jeff Starr

There are two possibilities here: Yahoo!’s Slurp crawler is broken or Yahoo! lies about obeying Robots directives. Either case isn’t good. Slurp just can’t seem to keep its nose out of my private business. And, as I’ve discussed before, this happens all the time. Here are the two most recent offenses, as recorded in the log file for my blackhole spider trap:

Continue Reading

Blacklist Candidate 2008-10-19

Posted on October 19, 2008 in Function by Jeff Starr

Welcome to the Perishable Press “Blacklist Candidate” series. In this post, we continue our new tradition of exposing, humiliating and banishing spammers, crackers and other worthless scumbags..

[ Photo: Television Flashback ] From time to time on the show, a contestant places a bid that is so absurd and so asinine that you literally laugh out loud, point at the monitor, and openly ridicule the pathetic loser. On such occasions, even the host of the show will laugh and mock the idiocy. Of course, this same situation happens frequently here at Perishable Press, where the scumbags that manage to escape the 3G Blacklist are proving themselves to be increasingly desperate and pathetic. Such is the case with this month’s official Blacklist Candidate Number 2008-10-19:

Come on down — you’re the next cracker whore to get banished from the site!

Continue Reading

Evil Incarnate, but Easily Blocked

Posted on September 15, 2008 in Websites by Jeff Starr

As my readers know, I spend a lot of time digging through error logs, preventing attacks, and reporting results. Occasionally, some moron will pull a stunt that deserves exposure, public humiliation, and banishment. There is certainly no lack of this type of nonsense, as many of you are well-aware.

3G Blacklist

Even so, I have to admit that I am very happy with my latest strategy against crackers, spammers, and other scumbags, namely, the 3G Blacklist. Since implementing this effective HTAccess security method, I have seen a dramatic decrease in the overall volume of malicious activity recorded in my Apache, PHP, and 404 error logs. Thankfully, the 3G Blacklist has kept things pretty quiet around here, at least as far as malicious activity is concerned. In fact, it has been awhile since I’ve seen anything worthy of sharing with you — until now.

Continue Reading