Robots Notes Plus
About the Robots Exclusion Standard:
The robots exclusion standard or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website. The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website.
Notes on the robots.txt Rules:
Rules of specificity apply, not inheritance. Always include a blank line between rules. Note also that not all robots obey the robots rules — even Google has been reported to ignore certain robots rules. Also, comments are allowed (and recommended) within any robots.txt file when written on a per-line basis. Simply begin each line of comments with a pound sign “#
”.
Prevent Robots from Indexing the Entire Site:
User-agent: *
Disallow: /
Prevent a Specific Robot from Indexing the Entire Site:
User-agent: Googlebot-Image
Disallow: /
Prevent all Robots from Indexing Specific Pages/Directories:
User-agent: *
Disallow: /cgi-bin/
Disallow: /privatedir/
Disallow: /tutorials/blank.html
A Specific Example:
In this example, no robots are allowed to index anything except for Google, which is allowed to index everything except the specified pages/directories. Note the required blank line between the rules.
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /privatedir/
Another Specific Example:
In this example, no agents are allowed to index anything except for Alexa, which is allowed to index anything. Note that there is a blank space after the colon, which enables this rule to work.
User-agent: *
Disallow: /
User-agent: ia_archiver
Disallow:
Prevent all Agents Except for Google:
Here is Google’s preferred way to disallow all agents anything except Google, which is allowed everything. Note that “Allow” is not a standard parameter and therefore is not recommended.
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
Notes on the “meta robots” Tag:
Certain robots rules may also be included in the head section of a web document. Examine the following examples:
<meta name="robots" content="noindex,nofollow,noarchive" />
<meta name="robots" content="noindex,nofollow" />
<meta name="googlebot" content="none" />
<meta name="alexa" content="all" />
Here is a general list of values available for the “content” attribute of the “meta robots” tag:
noindex
,index
— Determines indexing of site/pages.nofollow
,follow
— Determines following of links.nosnippet
— Do not display excerpts or cached content.noarchive
— Do not display or collect cached content.
Additionally, Altavista supports:
noimageindex
— Index text but not images.noimageclick
— Link to pages but not images.