Taking Advantage of the X-Robots Tag

Controlling the spidering, indexing and caching of your (X)HTML-based web pages is possible with meta robots directives such as these:

<meta name="googlebot" content="index,archive,follow,noodp">
<meta name="robots" content="all,index,follow">
<meta name="msnbot" content="all,index,follow">

I use these directives here at Perishable Press and they continue to serve me well for controlling how the “big bots” 1 crawl and represent my (X)HTML-based content in search results.

For other, non-(X)HTML types of content, however, using meta robots directives to control indexing and caching is not an option. An excellent example of this involves directing Google to index and cache PDF documents. The last time I checked, meta tags can’t be added to PDFs, Word documents, Excel documents, text files, and other non-(X)HTML-based content. The solution, of course, is to take advantage of the relatively new 2 HTTP header, X-Robots-Tag.

The X-Robots-Tag header takes the same parameters as used by meta robots tags. For example:

  • index — index the page
  • noindex — don’t index the page
  • follow — follow links from the page
  • nosnippet — don’t display descriptions or cached links
  • nofollow — don’t follow links from the page
  • noarchive — don’t cache/archive the page
  • none — do nothing, ignore the page
  • all — do whatever you want, default behavior

..and so on. Within meta tags, these directives make it possible to control exactly how search engines represent your (X)HTML-based web pages. And now, setting these same directives via the X-Robots-Tag header, it is possible to extend SEO-related control over virtually every other type of content as well — PDFs, Word documents, Flash, audio, and video files — you name it!

Implementation

Implementing X-Robots-Tag functionality for own files is easy. For dynamically generated content, such as PHP files, place the following code at the very top of your page:

// instruct supportive search engines to index and cache the page
<?php header('X-Robots-Tag: index,archive'); ?>

Of course, the actual robots parameters will vary, depending on whether or not the content should be indexed, archived, etc. Here is a good summary of available robots directives.

To implement X-Robots-Tag directives for non-PHP files, such as PDF, Flash, and Word documents, it is possible to set the headers via HTAccess. Customize one of the following HTAccess scripts according to your indexing needs and add it to your site’s root HTAccess file or Apache configuration file:

Include and cache all PDF, Word Documents, and Flash files in search results:

# index and archive specified file types
<IfModule mod_headers.c>
 <FilesMatch "\.(doc|pdf|swf)$">
  Header set X-Robots-Tag "index,archive"
 </FilesMatch>
</IfModule>

Do not index/include any PDF, Word Documents, and Flash files in search results:

# do not index specified file types
<IfModule mod_headers.c>
 <FilesMatch "\.(doc|pdf|swf)$">
  Header set X-Robots-Tag "noindex"
 </FilesMatch>
</IfModule>

Index/include all PDF, Word, and Flash files in search results, but do not cache or include a snippet:

# no cache no snippet for specified file types
<IfModule mod_headers.c>
 <FilesMatch "\.(doc|pdf|swf)$">
  Header set X-Robots-Tag "noarchive, nosnippet"
 </FilesMatch>
</IfModule>

To customize any of these X-Robots-Tag directives, edit the file extensions to match those of your target files and edit the various robots parameters to suit your needs. Note that Google will also obey the unavailable_after directive when delivered via X-Robots-Tag header:

# specify expiration date
<IfModule mod_headers.c>
 <FilesMatch "\.(doc|pdf|swf)$">
  Header set X-Robots-Tag "unavailable_after: 4 Jul 2050 11:11:11 GMT"
 </FilesMatch>
</IfModule>

..and likewise, you may also combine X-Robots-Tag directives:

# expiration date with no cache and no snippet
<IfModule mod_headers.c>
 <FilesMatch "\.(doc|pdf|swf)$">
  Header set X-Robots-Tag "unavailable_after: 4 Jul 2050 11:11:11 GMT"
  Header set X-Robots-Tag "noarchive, nosnippet"
 </FilesMatch>
</IfModule>

And one more for the road.. Here is an alternate method of implementing your own set of custom X-Robots-Tag directives via HTAccess:

# index with no cache for pdf word and flash documents
<IfModule mod_headers.c>
 <IfModule mod_setenvif.c>
  SetEnvIf Request_URI "*\.pdf$" x_tag=yes
  SetEnvIf Request_URI "*\.doc$" x_tag=yes
  SetEnvIf Request_URI "*\.swf$" x_tag=yes
  Header set X-Robots-Tag "index, noarchive" env=x_tag
 </IfModule>
</IfModule>

Closing thoughts..

meta robots tags work great for controlling search-engine interaction with your (X)HTML-based web content. For other types of content, such as PDF, text, and multimedia files, use X-Robots-Tag headers instead. Currently supported by both Google and Yahoo, X-Robots-Tag headers are easily specified for any collection of file types with a few straightforward HTAccess directives. Once in place, these directives provide granular control over search-engine representation of your non-(X)HTML-based web content.

Footnotes

  • 1 Currently, the four major search engines are Google, Yahoo!, MSN/Live, and Ask.
  • 2 Currently, two of the four (three?) major search engines, Google and Yahoo, understand and follow X-Robots-Tag directives.