Tag: robots

Protect Your Site with a Blackhole for Bad Bots

Posted on July 14, 2010 in Websites by Jeff Starr

[ Black Hole ] One of my favorite security measures here at Perishable Press is the site’s virtual Blackhole trap for bad bots. The concept is simple: include a hidden link to a robots.txt-forbidden directory somewhere on your pages. Bots that ignore or disobey your robots rules will crawl the link and fall into the trap, which then performs a WHOIS Lookup and records the event in the blackhole data file. Once added to the blacklist data file, bad bots immediately are denied access to your site. I call it the “one-strike” rule: bots have one chance to follow the robots.txt protocol, check the site’s robots.txt file, and obey its directives. Failure to comply results in immediate banishment. The best part is that the Blackhole only affects bad bots: normal users never see the hidden link, and good bots obey the robots rules in the first place.

In five easy steps, you can set up your own Blackhole to trap bad bots and protect your site from evil scripts, bandwidth thieves, content scrapers, spammers, and other malicious behavior.

[ Blackhole Directory with Files ] The Blackhole is built with PHP, and uses a bit of .htaccess to protect the blackhole directory. The blackhole script combines heavily modified versions of the Kloth.net script (for the bot trap) and the Network Query Tool (for the whois lookups). Refined over the years and completely revamped for this tutorial, the Blackhole consists of a single plug-&-play directory that contains the following four files:

Continue Reading

Stop 404 Requests for Mobile Versions of Your Site

Posted on April 26, 2010 in Function by Jeff Starr

If you’ve been keeping an eye on your 404 errors recently, you will have noticed an increase in requests for nonexistent mobile files and directories, especially over the past year or so. The scripts and bots requesting these files from your server seem to be looking for a mobile version of your site. Unfortunately, they are wasting bandwidth and resources in the process. It has become common to see the following 404 errors constantly repeated in your log files:

  • http://domain.tld/apple-touch-icon.png
  • http://domain.tld/iphone
  • http://domain.tld/mobile
  • http://domain.tld/mobi
  • http://domain.tld/m

So some bot comes along, assumes that your site includes a mobile version, and then tries its hand at guessing the location. In the common request-set listed above, we see the bot looking first for an “apple-touch icon,” and then for mobile content in various directories. If this only happens once in awhile, it’s no big deal. But these days I’ve been seeing many different bots requesting these nonexistent resources.

Even worse, these mobile-hungry bots can’t seem to remember where they’ve been – they typically request the same resources repeatedly, and in multiple locations within the directory structure. I frequently see hundreds of these types of requests in my weekly error-log analyses. Needless to say, this is an incredible waste of time, bandwidth, and server resources.

Continue Reading

Tell Google to Not Index Certain Parts of Your Page

Posted on August 23, 2009 in Websites by Jeff Starr

There are several ways to instruct Google to stay away from various pages in your site:

..and so on. These directives all function in different ways, but they all serve the same basic purpose: control how Google crawls the various pages on your site. For example, you can use meta noindex to instruct Google not to index your sitemap, RSS feed, or any other page you wish. This level of control over which pages are crawled and indexed is helpful, but what if you need to control how Google crawls the contents of a specific page? Easy. Google enables us to do this with a set of googleon/googleoff tags.

Continue Reading

Yahoo! Slurp too Stupid to be a Robot

Posted on March 15, 2009 in Nonsense, Websites by Jeff Starr

I really hate bad robots. When a web crawler, spider, bot — or whatever you want to call it — behaves in a way that is contrary to expected and/or accepted protocols, we say that the bot is acting suspiciously, behaving badly, or just acting stupid in general. Unfortunately, there are thousands — if not hundreds of thousands — of nefarious bots violating our websites every minute of the day.

For the most part, there are effective methods available enabling us to protect our sites against the endless hordes of irrelevant and mischievous bots. Such evil is easily blocked with virtually zero side-effects because their presence is simply irrelevant.

But what about bad bots that aren’t exactly irrelevant, such as Yahoo’s mindless Slurp crawler? By disobeying the robots.txt protocol as promised, Yahoo’s Slurp clearly falls into the “bad-bot” category. Unlike typical “nonsense” bots, Slurp is not exactly irrelevant (yet), so simply blocking them is not a reasonable solution.

Continue Reading

Yahoo! Lies about Obeying Robots.txt Directives

Posted on November 16, 2008 in Websites by Jeff Starr

There are two possibilities here: Yahoo!’s Slurp crawler is broken or Yahoo! lies about obeying Robots directives. Either case isn’t good. Slurp just can’t seem to keep its nose out of my private business. And, as I’ve discussed before, this happens all the time. Here are the two most recent offenses, as recorded in the log file for my blackhole spider trap:

Continue Reading

Unexplained Crawl Behavior Involving Tagged Query Strings

Posted on June 4, 2008 in Websites by Jeff Starr

I need your help! I am losing my mind trying to solve another baffling mystery. For the past three or four months, I have been recording many 404 Errors generated from msnbot, Yahoo-Slurp, and other spider crawls. These errors result from invalid requests for URLs containing query strings such as the following:

  • http://perishablepress.com/press/page/2/?tag=spam
  • http://perishablepress.com/press/page/3/?tag=code
  • http://perishablepress.com/press/page/2/?tag=email
  • http://perishablepress.com/press/page/2/?tag=xhtml
  • http://perishablepress.com/press/page/4/?tag=notes
  • http://perishablepress.com/press/page/2/?tag=flash
  • http://perishablepress.com/press/page/2/?tag=links
  • http://perishablepress.com/press/page/3/?tag=theme
  • http://perishablepress.com/press/page/2/?tag=press

..plus hundreds and hundreds more 1. The URL pattern is always the same: a different page number followed by a query string containing one of the tags used here at Perishable Press, for example: “/?tag=something”. The problem is that there are no such links anywhere on the site. The site employs permalink format for all WordPress-generated links (e.g., post/page links, search queries, tag queries, etc.). In an effort to locate the source of these URLs, I have performed multiple, thorough searches through every aspect of my site (files, database, code, examples, etc.), and the results are always the same: nothing. UGH! What is the source of these misguided URLs? Hopefully, you will be able to help shed some light on this unexplained crawl behavior (please, I beg you!!).

Continue Reading

Taking Advantage of the X-Robots Tag

Posted on June 3, 2008 in Function by Jeff Starr

Controlling the spidering, indexing and caching of your (X)HTML-based web pages is possible with meta robots directives such as these:

<meta name="googlebot" content="index,archive,follow,noodp">
<meta name="robots" content="all,index,follow">
<meta name="msnbot" content="all,index,follow">

I use these directives here at Perishable Press and they continue to serve me well for controlling how the “big bots” 1 crawl and represent my (X)HTML-based content in search results.

For other, non-(X)HTML types of content, however, using meta robots directives to control indexing and caching is not an option. An excellent example of this involves directing Google to index and cache PDF documents. The last time I checked, meta tags can’t be added to PDFs, Word documents, Excel documents, text files, and other non-(X)HTML-based content. The solution, of course, is to take advantage of the relatively new 2 HTTP header, X-Robots-Tag.

Continue Reading

Yahoo! Slurp in My Blackhole (Yet Again)

Posted on December 16, 2007 in Websites by Jeff Starr

Yup, ‘ol Slurp is at it again, flagrantly disobeying specific robots.txt rules forbidding access to my bad-bot trap, lovingly dubbed the “blackhole.” As many readers know, this is not the first time Yahoo has been caught behaving badly. This time, Yahoo was caught trespassing five different times via three different IPs over the course of four different days. Here is the data recorded in my site’s blackhole log (I know, that sounds terrible):

Continue Reading

Yahoo! in my Blackhole

Posted on November 25, 2007 in Websites by Jeff Starr

Okay, I realize that the title sounds a bit odd, but nowhere near as odd as my recent discovery of Slurp ignoring explicit robots.txt rules and digging around in my highly specialized bot trap, which I have lovingly dubbed “the blackhole”. What is up with that, Yahoo!? — does your Slurp spider obey robots.txt directives or not? I have never seen Google crawling around that side of town, neither has MSN nor even Ask ventured into the forbidden realms. Has anyone else experienced such unexpected behavior from one the four major search engines? Hmmm.. let’s dig a little further..

Here is the carefully formulated, highly specific, properly placed robots.txt rule that explicitly and strictly forbids all agents from accessing my blackhole bot trap:

Continue Reading

Comprehensive Reference for WordPress NoNofollow/Dofollow Plugins

Posted on September 5, 2007 in WordPress by Jeff Starr

Recently, while deliberating an optimal method for eliminating nofollow link attributes from Perishable Press, I collected, installed, tested and reviewed every WordPress no-nofollow/dofollow plugin that I could find. As of the writing of this post, I have evaluated 12 15 dofollow plugins, all of which are freely available on the Internet.

In this article, I present a concise, current, and comprehensive reference for WordPress no-nofollow and dofollow plugins. Every attempt has been made to provide accurate, useful, and complete information for each of the plugins represented below. Further, as this subject is a newfound interest of mine, it is my intention to keep this post updated with fresh information, so please bookmark it for future reference. Finally, please help expand/enhance this list by dropping any relevant information via comment area below. Thanks & enjoy!

Continue Reading

Stop WordPress from Leaking PageRank to Admin Pages

Posted on August 29, 2007 in Function, WordPress by Jeff Starr

During the most recent Perishable Press redesign, I noticed that several of my WordPress admin pages had been assigned significant levels of PageRank. Not good. After some investigation, I realized that my ancient robots.txt rules were insufficient in preventing Google from indexing various WordPress admin pages. Specifically, the following pages have been indexed and subsequently assigned PageRank:

  • WP Admin Login Page
  • WP Lost Password Page
  • WP Registration Page
  • WP Admin Dashboard

Needless to say, it is important to stop WordPress from leaking PageRank to admin pages. Instead of wasting our hard-earned link-equity on non-ranking pages, let’s redirect it to more important pages and posts. In order to accomplish this, we will attack the problem on three different fronts: admin links, robots.txt rules, and meta tags. Let’s take a look at each of these methods..

Continue Reading

Eliminate 404 Errors for PHP Functions

Posted on August 27, 2007 in Function by Jeff Starr

Recently, I discussed the suspicious behavior recently observed by the Yahoo! Slurp crawler. As revealed by the site’s closely watched 404-error logs, Yahoo! had been requesting a series of nonexistent resources. Although a majority of the 404 errors were exclusive to the Slurp crawler, there were several instances of requests that were also coming from Google, Live, and even Ask. Initially, these distinct errors were misdiagnosed as existing URLs appended with various JavaScript functions. Here are a few typical examples of these frequently observed log entries:

http://perishablepress.com/press/category/websites/feed/function.opendir
http://perishablepress.com/press/category/websites/feed/function.array-rand
http://perishablepress.com/press/category/websites/feed/function.mkdir
http://perishablepress.com/press/category/websites/feed/ref.outcontrol

Fortunately, an insightful reader named Bas pointed out that the errors were actually PHP functions. Bas explains:

The two functions (array_rand and opendir) you define as javascript functions are PHP functions. Some servers generate clickable links to the php manual (which uses function.NAMEOFFUNCTION in their URL’s) in php scripting error messages. Maybe that’s also the cause of these problems.

Continue Reading

Suspicious Behavior from Yahoo! Slurp Crawler

Posted on August 13, 2007 in Websites by Jeff Starr

[ Image: Black and white illustration of the upper half of a man's suspicious, paranoid face ] Most of the time, when I catch scumbags attempting to spam, scrape, leech, or otherwise hack my site, I stitch up a new voodoo doll and let the cursing begin. No, seriously, I just blacklist the idiots. I don’t need their traffic, and so I don’t even blink while slamming the doors in their faces.

Of course, this policy presents a bit of a dilemma when the culprit is one of the four major search engines. Slamming the door on Yahoo! would be unwise, but if their Slurp crawler continues behaving suspiciously, I may have no choice. Check out the following records, pulled directly from one of my error logs, where Yahoo! exhibits some extremely questionable behavior.

Continue Reading

Invite Only: Visitor Exclusivity via the Opt-In Method

Posted on January 22, 2007 in Function by Jeff Starr

Web developers trying to control comment-spam, bandwidth-theft, and content-scraping must choose between two fundamentally different approaches: selectively deny target offenders (the "blacklist" method) or selectively allow desirable agents (the "opt-in", or "whitelist" method).

Currently popular according to various online forums and discussion boards is the blacklist method. The blacklist method requires the webmaster to create and maintain a working list of undesirable agents, usually blocking their access via htaccess or php. The downside of "blacklisting" is that it requires considerable effort to stay current with the exponential number of ever-evolving threats, which require exceedingly long lists for an effective response. Although time-consuming and potentially work-intensive (there are automated methods of blacklisting bad bots), blacklisting optimizes hits by allowing site access to anyone not on the blacklist. Unfortunately for blacklisters, it has become relatively trivial to disguise bots by using standard user-agent strings. So the bad guys bypass the blacklist and slip into your site incognito. Besides, nobody wants to waste valuable time digging through endless access logs. Whereas blacklisting is reactive, whitelisting is proactive..

Continue Reading

Disobedient Robots and Company

Posted on January 1, 2007 in Perishable, Websites by Jeff Starr

In our never-ending battle against spammers, leeches, scrapers, and other online undesirables, we have implemented several powerful security measures to improve the operational integrity of our perpetual virtual existence. Here is a rundown of the new behind-the-scenes security features of Perishable Press:

  • Automated spambot trap, designed to identify bots (and/or stupid people) that disobey rules specified in the site’s robots.txt file.
  • Automated disobedient-robot identification (via reverse IP lookup), admin-notification (via email) and blacklist inclusion (via htaccess).
  • Automated inclusion of disobedient robot identification on our now public "Disobedient Robots" page.
  • Imroved htaccess rules, designed to eliminate scum-sucking worms and other useless vermin.
  • Automated tracking tools, designed to keep a close eye on any suspicious or questionable activity.
  • Automated 404-error statistics, designed to optimize the elimination of 404 errors.
  • Plus a few other secret-agent tricks that we are not at liberty to discuss ;)

As you can see, we have been pretty busy around here — fortunately, the new security features have been working flawlessly, reducing stolen bandwidth, potential spam, disobedient robots, and 404 errors. Hopefully, the end result of these new features will involve smoother site functionality and better browsing for everyone.

Stop Bitacle from Stealing Content

Posted on November 8, 2006 in Websites, WordPress by Jeff Starr

If you have yet to encounter the content-scraping site, bitacle.org, consider yourself lucky. The scum-sucking worm-holes at bitacle.org are well-known for literally, blatantly, and piggishly stealing blog content and using it for financial gains through advertising. While I am not here to discuss the legal, philosophical, or technical ramifications of illegal bitacle behavior, I am here to provide a few critical tools that will help stop bitacle from stealing your content.

The htaccess Finger

Perhaps the most straightforward and effective method for keeping the bitacle thieves away from your site, adding the following htaccess rules to your root htaccess file will literally block bitacle’s IP address and return a 403 Forbidden message (for more information on htaccess files, see our article, Stupid htaccess Tricks, referenced below). Add this to your site’s root htaccess file:

RewriteBase /
RewriteCond %{REMOTE_ADDR} ^212\.22\.59\.251$ [OR]
RewriteCond %{HTTP_USER_AGENT} Bitacle
RewriteRule .? - [F]

The robots.txt Slap

Next up, another effective anti-bitacle method that instructs the bitacle bots to stay away from your site. This method uses a robots.txt file in your site’s root directory and literally denies bitacle agents crawl-access to all site contents. Simply add the following lines to your site’s root robots.txt file (for more information on robots.txt, see our article, Robots Notes Plus, referenced below):

User-agent: Bitacle bot/1.1
Disallow: /
User-agent: Bitacle bot
Disallow: /
User-agent: Bitacle *
Disallow: /
User-agent: Bitacle*
Disallow: /
User-agent: Bitacle
Disallow: /

Related WordPress Plugins

For more help on the anti-plagiarism front, check out Redalt’s Antileech Plugin and MaxPower’s Digital Fingerprint Plugin. These fine WordPress plugins come highly recommended and are definitely worth checking out.

Other Essential Tools

Beyond the essential preventative methods discussed above, there are many other resources and tools now available for dealing with site scrapers, content thieves, and other worthless garbage. A worthwhile website is Copyscape, which provides an excellent tool that enables users to search the web for stolen content. If you find that your content has indeed been plagiarized, read up on how to respond properly and effectively. Finally, try searching for various search terms, such as "plagiarism tools", "content scraping", "copyright protection", "syndication theft", etc. Good Luck!

Stop bitacle.org
Stop bitacle.org

References & Resources

Robots Notes Plus

Posted on April 3, 2006 in Function by Jeff Starr

About the Robots Exclusion Standard 1:

The robots exclusion standard or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website. The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website.

Notes on the robots.txt Rules:

Rules of specificity apply, not inheritance. Always include a blank line between rules. Note also that not all robots obey the robots rules — even Google has been reported to ignore certain robots rules. Also, comments are allowed (and recommended) within any robots.txt file when written on a per-line basis. Simply begin each line of comments with a pound sign “#”.

Prevent Robots from Indexing the Entire Site:

User-agent: *
Disallow: /

Prevent a Specific Robot from Indexing the Entire Site:

User-agent: Googlebot-Image
Disallow: /

Prevent all Robots from Indexing Specific Pages/Directories:

User-agent: *
Disallow: /cgi-bin/
Disallow: /privatedir/
Disallow: /tutorials/blank.html

A Specific Example:

In this example, no robots are allowed to index anything except for Google, which is allowed to index everything except the specified pages/directories. Note the required blank line between the rules.

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /privatedir/

Another Specific Example:

In this example, no agents are allowed to index anything except for Alexa, which is allowed to index anything. Note that there is a blank space after the colon, which enables this rule to work.

User-agent: *
Disallow: /

User-agent: ia_archiver
Disallow:

Prevent all Agents Except for Google:

Here is Google’s preferred way to disallow all agents anything except Google, which is allowed everything. Note that “Allow” is not a standard parameter and therefore is not recommended.

User-agent: *
Disallow: /

User-agent: Googlebot
Allow: /

Notes on the “meta robots” Tag:

Certain robots rules may also be included in the head section of a web document. Examine the following examples:

<meta name="robots" content="noindex,nofollow,noarchive" />
<meta name="robots" content="noindex,nofollow" />
<meta name="googlebot" content="none" />
<meta name="alexa" content="all" />

Here is a general list of values available for the “content” attribute of the “meta robots” tag:

noindex, index — Determines indexing of site/pages.
nofollow, follow — Determines following of links.
nosnippet — Do not display excerpts or cached content.
noarchive — Do not display or collect cached content.

Additionally, Altavista supports:

noimageindex — Index text but not images.
noimageclick — Link to pages but not images.

References