Spring Sale! Save 30% on all books w/ code: PLANET24
Web Dev + WordPress + Security

Taking Advantage of the X-Robots Tag

Controlling the spidering, indexing and caching of your (X)HTML-based web pages is possible with meta robots directives such as these:

<meta name="googlebot" content="index,archive,follow,noodp">
<meta name="robots" content="all,index,follow">
<meta name="msnbot" content="all,index,follow">

I use these directives here at Perishable Press and they continue to serve me well for controlling how the “big bots”1 crawl and represent my (X)HTML-based content in search results.

For other, non-(X)HTML types of content, however, using meta robots directives to control indexing and caching is not an option. An excellent example of this involves directing Google to index and cache PDF documents. The last time I checked, meta tags can’t be added to PDFs, Word documents, Excel documents, text files, and other non-(X)HTML-based content. The solution, of course, is to take advantage of the relatively new2 HTTP header, X-Robots-Tag.

How it works

The X-Robots-Tag header takes the same parameters used by meta robots tag. For example:

  • index — index the page
  • noindex — don’t index the page
  • follow — follow links from the page
  • nosnippet — don’t display descriptions or cached links
  • nofollow — don’t follow links from the page
  • noarchive — don’t cache/archive the page
  • all — do whatever you want, default behavior
  • none — do nothing, ignore the page

..and so on. Within meta tags, these directives make it possible to control exactly how search engines represent your (X)HTML-based web pages. And now, setting these same directives via the X-Robots-Tag header, it is possible to extend SEO-related control over virtually every other type of content as well — PDFs, Word documents, Flash, audio, and video files — you name it!

Implementation

Implementing X-Robots-Tag functionality for own files is easy. For dynamically generated content, such as PHP files, place the following code at the very top of your page:

// instruct supportive search engines to index and cache the page
<?php header('X-Robots-Tag: index,archive'); ?>

Of course, the actual robots parameters will vary, depending on whether or not the content should be indexed, archived, etc. Here is a good summary of available robots directives.

To implement X-Robots-Tag directives for non-PHP files, such as PDF, Flash, and Word documents, it is possible to set the headers via HTAccess. Customize one of the following HTAccess scripts according to your indexing needs and add it to your site’s root HTAccess file or Apache configuration file:

Include and cache all PDF, Word Documents, and/or Flash files in search results:

# index and archive specified file types
<IfModule mod_headers.c>
 <FilesMatch "\.(doc|pdf|swf)$">
  Header set X-Robots-Tag "index,archive"
 </FilesMatch>
</IfModule>

Do not index/include any PDF, Word Docs, and/or Flash files in search results:

# do not index specified file types
<IfModule mod_headers.c>
 <FilesMatch "\.(doc|pdf|swf)$">
  Header set X-Robots-Tag "noindex"
 </FilesMatch>
</IfModule>

Index/include all PDF, Word, and Flash files in search results, but do not cache or include a snippet:

# no cache no snippet for specified file types
<IfModule mod_headers.c>
 <FilesMatch "\.(doc|pdf|swf)$">
  Header set X-Robots-Tag "noarchive, nosnippet"
 </FilesMatch>
</IfModule>

To customize any of these X-Robots-Tag directives, edit the file extensions to match those of your target files and edit the various robots parameters to suit your needs. Note that Google will also obey the unavailable_after directive when delivered via X-Robots-Tag header:

# specify expiration date
<IfModule mod_headers.c>
 <FilesMatch "\.(doc|pdf|swf)$">
  Header set X-Robots-Tag "unavailable_after: 4 Jul 2050 11:11:11 GMT"
 </FilesMatch>
</IfModule>

..and likewise, you may also combine X-Robots-Tag directives:

# expiration date with no cache and no snippet
<IfModule mod_headers.c>
 <FilesMatch "\.(doc|pdf|swf)$">
  Header set X-Robots-Tag "unavailable_after: 4 Jul 2050 11:11:11 GMT"
  Header set X-Robots-Tag "noarchive, nosnippet"
 </FilesMatch>
</IfModule>

And one more for the road.. Here is an alternate method of implementing your own set of custom X-Robots-Tag directives via HTAccess:

# index with no cache for pdf word and flash documents
<IfModule mod_headers.c>
 <IfModule mod_setenvif.c>
  SetEnvIf Request_URI "*\.pdf$" x_tag=yes
  SetEnvIf Request_URI "*\.doc$" x_tag=yes
  SetEnvIf Request_URI "*\.swf$" x_tag=yes
  Header set X-Robots-Tag "index, noarchive" env=x_tag
 </IfModule>
</IfModule>

Closing thoughts..

meta robots tags work great for controlling search-engine interaction with your (X)HTML-based web content. For other types of content, such as PDF, text, and multimedia files, use X-Robots-Tag headers instead. Currently supported by both Google and Yahoo, X-Robots-Tag headers are easily specified for any collection of file types with a few straightforward HTAccess directives. Once in place, these directives provide granular control over search-engine representation of your non-(X)HTML-based web content.

Footnotes

  • 1 Currently, the four major search engines are Google, Yahoo!, MSN/Live/Bing, and Ask.
  • 2 Currently, two of the four (three?) major search engines, Google and Yahoo, understand and follow X-Robots-Tag directives.

About the Author
Jeff Starr = Web Developer. Security Specialist. WordPress Buff.
USP Pro: Unlimited front-end forms for user-submitted posts and more.

8 responses to “Taking Advantage of the X-Robots Tag”

  1. Ferodynamics 2008/06/05 7:53 pm

    I’m curious why you you want granular control. You say this serves you well, what took you down this road?

  2. Perishable 2008/06/08 7:40 am

    As explained in the article, using the X-Robots-Tag has proven useful in controlling the indexing and archiving of specific file types for both Google and Yahoo. For example, many of my multimedia files are not included in search results. Using X-Robots-Tag makes this easy to do.

  3. Hello Perishable
    Thanks for doing this site, I’ve found a lot of good snippets…
    In this one, though, I believe it should be at the end instead of .

    # do not index specified file types
    Header set X-Robots-Tag "noindex"

  4. sorry, of course the code gets stripped:

    the enclosing tag should be:

    <FilesMatch "\.(doc|pdf|swf)$">
    </FilesMatch>

    not:

    <FilesMatch "\.(doc|pdf|swf)$">
    </Files>

  5. Thanks lou — good catch. The article has been updated with the correct information. Thanks for helping to improve it.

  6. Hi,

    I would like to know if it’s possible to do it for a special page instead of files like (doc|pdf|swf)

    It’s for a blog like this :
    http://www.example.com/ici-la

    How to do that with the example below.

    Thank you so much for this interesting article

  7. Hi Mesrine, you might want to check out robots meta tags. You can set robots directives like follow, nofollow, index, noindex, and so on for your web pages.

  8. How to write code if specific set of files to be blocked. Suppose I have to block abc.pdf, andf.pdf and wert.pdf etc.

Comments are closed for this post. Something to add? Let me know.
Welcome
Perishable Press is operated by Jeff Starr, a professional web developer and book author with two decades of experience. Here you will find posts about web development, WordPress, security, and more »
.htaccess made easy: Improve site performance and security.
Thoughts
I live right next door to the absolute loudest car in town. And the owner loves to drive it.
8G Firewall now out of beta testing, ready for use on production sites.
It's all about that ad revenue baby.
Note to self: encrypting 500 GB of data on my iMac takes around 8 hours.
Getting back into things after a bit of a break. Currently 7° F outside. Chillz.
2024 is going to make 2020 look like a vacation. Prepare accordingly.
First snow of the year :)
Newsletter
Get news, updates, deals & tips via email.
Email kept private. Easy unsubscribe anytime.