Plugin Sale! Save 15% on pro plugins with discount code: FALL2020
Web Dev + WordPress + Security

Robots Notes Plus

About the Robots Exclusion Standard:

The robots exclusion standard or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website. The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website.

Notes on the robots.txt Rules:

Rules of specificity apply, not inheritance. Always include a blank line between rules. Note also that not all robots obey the robots rules — even Google has been reported to ignore certain robots rules. Also, comments are allowed (and recommended) within any robots.txt file when written on a per-line basis. Simply begin each line of comments with a pound sign “#”.

Prevent Robots from Indexing the Entire Site:

User-agent: *
Disallow: /

Prevent a Specific Robot from Indexing the Entire Site:

User-agent: Googlebot-Image
Disallow: /

Prevent all Robots from Indexing Specific Pages/Directories:

User-agent: *
Disallow: /cgi-bin/
Disallow: /privatedir/
Disallow: /tutorials/blank.html

A Specific Example:

In this example, no robots are allowed to index anything except for Google, which is allowed to index everything except the specified pages/directories. Note the required blank line between the rules.

User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /privatedir/

Another Specific Example:

In this example, no agents are allowed to index anything except for Alexa, which is allowed to index anything. Note that there is a blank space after the colon, which enables this rule to work.

User-agent: *
Disallow: /
User-agent: ia_archiver
Disallow: 

Prevent all Agents Except for Google:

Here is Google’s preferred way to disallow all agents anything except Google, which is allowed everything. Note that “Allow” is not a standard parameter and therefore is not recommended.

User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /

Notes on the “meta robots” Tag:

Certain robots rules may also be included in the head section of a web document. Examine the following examples:

<meta name="robots" content="noindex,nofollow,noarchive" />
<meta name="robots" content="noindex,nofollow" />
<meta name="googlebot" content="none" />
<meta name="alexa" content="all" />

Here is a general list of values available for the “content” attribute of the “meta robots” tag:

  • noindex, index — Determines indexing of site/pages.
  • nofollow, follow — Determines following of links.
  • nosnippet — Do not display excerpts or cached content.
  • noarchive — Do not display or collect cached content.

Additionally, Altavista supports:

  • noimageindex — Index text but not images.
  • noimageclick — Link to pages but not images.

Jeff Starr
About the Author
Jeff Starr = Creative thinker. Passionate about free and open Web.
WP Themes In Depth: Build and sell awesome WordPress themes.
Welcome
Perishable Press is operated by Jeff Starr, a professional web developer and book author with two decades of experience. Here you will find posts about web development, WordPress, security, and more »
Blackhole Pro: Trap bad bots in a virtual black hole.
Thoughts
Stoked! Had a great interview with Eric over at Speckyboy.com :)
Air finally clearing here in WA. Feeling grateful to breathe again. #oxygenmatters
Past week here in WA state has been hellish. So much smoke, like living in a chimney.
Now in September, I’m where I wanted to be in March.
Spent some time updating my article on unsafe characters, once again current with latest IETF specification.
Just realized that “Neo” is an anagram for “One”. As in, “he is the One” (The Matrix).
To get VLC app to load all songs (including subfolders), go to Preferences ▸ Show All ▸ Playlist ▸ Subdirectory behavior ▸ Expand.
Newsletter
Get news, updates, deals & tips via email.
Email kept private. Easy unsubscribe anytime.