About the Robots Exclusion Standard 1:
The robots exclusion standard or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website. The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website.
Notes on the robots.txt Rules:
Rules of specificity apply, not inheritance. Always include a blank line between rules. Note also that not all robots obey the robots rules — even Google has been reported to ignore certain robots rules (404 link removed 2016/12/31). Also, comments are allowed (and recommended) within any robots.txt file when written on a per-line basis. Simply begin each line of comments with a pound sign “#”.
Prevent Robots from Indexing the Entire Site:
Prevent a Specific Robot from Indexing the Entire Site:
Prevent all Robots from Indexing Specific Pages/Directories:
A Specific Example:
In this example, no robots are allowed to index anything except for Google, which is allowed to index everything except the specified pages/directories. Note the required blank line between the rules.
Another Specific Example:
In this example, no agents are allowed to index anything except for Alexa, which is allowed to index anything. Note that there is a blank space after the colon, which enables this rule to work.
Prevent all Agents Except for Google:
Here is Google’s preferred way to disallow all agents anything except Google, which is allowed everything. Note that “Allow” is not a standard parameter and therefore is not recommended.
Notes on the “meta robots” Tag:
Certain robots rules may also be included in the head section of a web document. Examine the following examples:
<meta name="robots" content="noindex,nofollow,noarchive" />
<meta name="robots" content="noindex,nofollow" />
<meta name="googlebot" content="none" />
<meta name="alexa" content="all" />
Here is a general list of values available for the “content” attribute of the “meta robots” tag:
index — Determines indexing of site/pages.
follow — Determines following of links.
nosnippet — Do not display excerpts or cached content.
noarchive — Do not display or collect cached content.
Additionally, Altavista supports:
noimageindex — Index text but not images.
noimageclick — Link to pages but not images.