Taking Advantage of the X-Robots Tag

Updated July 4, 2018 • 8 comments

Controlling the spidering, indexing and caching of your (X)HTML-based web pages is possible with meta robots directives such as these:

<meta name="googlebot" content="index,archive,follow,noodp">
<meta name="robots" content="all,index,follow">
<meta name="msnbot" content="all,index,follow">

I use these directives here at Perishable Press and they continue to serve me well for controlling how the “big bots”¹ crawl and represent my (X)HTML-based content in search results.

For other, non-(X)HTML types of content, however, using meta robots directives to control indexing and caching is not an option. An excellent example of this involves directing Google to index and cache PDF documents. The last time I checked, meta tags can’t be added to PDFs, Word documents, Excel documents, text files, and other non-(X)HTML-based content. The solution, of course, is to take advantage of the relatively new² HTTP header, X-Robots-Tag.

How it works

The X-Robots-Tag header takes the same parameters used by meta robots tag. For example:

index — index the page
noindex — don’t index the page
follow — follow links from the page
nosnippet — don’t display descriptions or cached links
nofollow — don’t follow links from the page
noarchive — don’t cache/archive the page
all — do whatever you want, default behavior
none — do nothing, ignore the page

..and so on. Within meta tags, these directives make it possible to control exactly how search engines represent your (X)HTML-based web pages. And now, setting these same directives via the X-Robots-Tag header, it is possible to extend SEO-related control over virtually every other type of content as well — PDFs, Word documents, Flash, audio, and video files — you name it!

Implementation

Implementing X-Robots-Tag functionality for own files is easy. For dynamically generated content, such as PHP files, place the following code at the very top of your page:

// instruct supportive search engines to index and cache the page
<?php header('X-Robots-Tag: index,archive'); ?>

Of course, the actual robots parameters will vary, depending on whether or not the content should be indexed, archived, etc. Here is a good summary of available robots directives.

To implement X-Robots-Tag directives for non-PHP files, such as PDF, Flash, and Word documents, it is possible to set the headers via HTAccess. Customize one of the following HTAccess scripts according to your indexing needs and add it to your site’s root HTAccess file or Apache configuration file:

Include and cache all PDF, Word Documents, and/or Flash files in search results:

# index and archive specified file types
<IfModule mod_headers.c>
 <FilesMatch "\.(doc|pdf|swf)$">
  Header set X-Robots-Tag "index,archive"
 </FilesMatch>
</IfModule>

Do not index/include any PDF, Word Docs, and/or Flash files in search results:

# do not index specified file types
<IfModule mod_headers.c>
 <FilesMatch "\.(doc|pdf|swf)$">
  Header set X-Robots-Tag "noindex"
 </FilesMatch>
</IfModule>

Index/include all PDF, Word, and Flash files in search results, but do not cache or include a snippet:

# no cache no snippet for specified file types
<IfModule mod_headers.c>
 <FilesMatch "\.(doc|pdf|swf)$">
  Header set X-Robots-Tag "noarchive, nosnippet"
 </FilesMatch>
</IfModule>

To customize any of these X-Robots-Tag directives, edit the file extensions to match those of your target files and edit the various robots parameters to suit your needs. Note that Google will also obey the unavailable_after directive when delivered via X-Robots-Tag header:

# specify expiration date
<IfModule mod_headers.c>
 <FilesMatch "\.(doc|pdf|swf)$">
  Header set X-Robots-Tag "unavailable_after: 4 Jul 2050 11:11:11 GMT"
 </FilesMatch>
</IfModule>

..and likewise, you may also combine X-Robots-Tag directives:

# expiration date with no cache and no snippet
<IfModule mod_headers.c>
 <FilesMatch "\.(doc|pdf|swf)$">
  Header set X-Robots-Tag "unavailable_after: 4 Jul 2050 11:11:11 GMT"
  Header set X-Robots-Tag "noarchive, nosnippet"
 </FilesMatch>
</IfModule>

And one more for the road.. Here is an alternate method of implementing your own set of custom X-Robots-Tag directives via HTAccess:

# index with no cache for pdf word and flash documents
<IfModule mod_headers.c>
 <IfModule mod_setenvif.c>
  SetEnvIf Request_URI "*\.pdf$" x_tag=yes
  SetEnvIf Request_URI "*\.doc$" x_tag=yes
  SetEnvIf Request_URI "*\.swf$" x_tag=yes
  Header set X-Robots-Tag "index, noarchive" env=x_tag
 </IfModule>
</IfModule>

Closing thoughts..

meta robots tags work great for controlling search-engine interaction with your (X)HTML-based web content. For other types of content, such as PDF, text, and multimedia files, use X-Robots-Tag headers instead. Currently supported by both Google and Yahoo, X-Robots-Tag headers are easily specified for any collection of file types with a few straightforward HTAccess directives. Once in place, these directives provide granular control over search-engine representation of your non-(X)HTML-based web content.

Footnotes

¹ Currently, the four major search engines are Google, Yahoo!, MSN/Live/Bing, and Ask.
² Currently, two of the four (three?) major search engines, Google and Yahoo, understand and follow X-Robots-Tag directives.

About the Author

Jeff Starr = Web Developer. Security Specialist. WordPress Buff.

8 responses to “Taking Advantage of the X-Robots Tag”

Ferodynamics 2008/06/05 7:53 pm

I’m curious why you you want granular control. You say this serves you well, what took you down this road?
Perishable 2008/06/08 7:40 am • Post Author

As explained in the article, using the X-Robots-Tag has proven useful in controlling the indexing and archiving of specific file types for both Google and Yahoo. For example, many of my multimedia files are not included in search results. Using X-Robots-Tag makes this easy to do.
lou 2010/02/22 2:43 am

Hello Perishable
Thanks for doing this site, I’ve found a lot of good snippets…
In this one, though, I believe it should be at the end instead of .

# do not index specified file types
Header set X-Robots-Tag "noindex"
lou 2010/02/22 2:48 am

sorry, of course the code gets stripped:

the enclosing tag should be:

<FilesMatch "\.(doc|pdf|swf)$">
</FilesMatch>

not:

<FilesMatch "\.(doc|pdf|swf)$">
</Files>
Jeff Starr 2010/02/22 11:33 am • Post Author

Thanks lou — good catch. The article has been updated with the correct information. Thanks for helping to improve it.
Mesrine 2010/10/24 7:24 am

Hi,

I would like to know if it’s possible to do it for a special page instead of files like (doc|pdf|swf)

It’s for a blog like this :
http://www.example.com/ici-la

How to do that with the example below.

Thank you so much for this interesting article
Jeff Starr 2010/10/24 10:50 am • Post Author

Hi Mesrine, you might want to check out robots meta tags. You can set robots directives like follow, nofollow, index, noindex, and so on for your web pages.
Kamal 2012/10/26 4:38 am

How to write code if specific set of files to be blocked. Suppose I have to block abc.pdf, andf.pdf and wert.pdf etc.

Comments are closed for this post. Something to add? Let me know.