Taking Advantage of the X-Robots Tag
Controlling the spidering, indexing and caching of your (X)HTML-based web pages is possible with meta
robots
directives such as these:
<meta name="googlebot" content="index,archive,follow,noodp">
<meta name="robots" content="all,index,follow">
<meta name="msnbot" content="all,index,follow">
I use these directives here at Perishable Press and they continue to serve me well for controlling how the “big bots”1 crawl and represent my (X)HTML-based content in search results.
For other, non-(X)HTML types of content, however, using meta
robots
directives to control indexing and caching is not an option. An excellent example of this involves directing Google to index and cache PDF documents. The last time I checked, meta
tags can’t be added to PDFs, Word documents, Excel documents, text files, and other non-(X)HTML-based content. The solution, of course, is to take advantage of the relatively new2 HTTP header, X-Robots-Tag
.
How it works
The X-Robots-Tag
header takes the same parameters used by meta
robots
tag. For example:
index
— index the pagenoindex
— don’t index the pagefollow
— follow links from the pagenosnippet
— don’t display descriptions or cached linksnofollow
— don’t follow links from the pagenoarchive
— don’t cache/archive the pageall
— do whatever you want, default behaviornone
— do nothing, ignore the page
..and so on. Within meta
tags, these directives make it possible to control exactly how search engines represent your (X)HTML-based web pages. And now, setting these same directives via the X-Robots-Tag
header, it is possible to extend SEO-related control over virtually every other type of content as well — PDFs, Word documents, Flash, audio, and video files — you name it!
Implementation
Implementing X-Robots-Tag
functionality for own files is easy. For dynamically generated content, such as PHP files, place the following code at the very top of your page:
// instruct supportive search engines to index and cache the page
<?php header('X-Robots-Tag: index,archive'); ?>
Of course, the actual robots
parameters will vary, depending on whether or not the content should be indexed, archived, etc. Here is a good summary of available robots directives.
To implement X-Robots-Tag
directives for non-PHP files, such as PDF, Flash, and Word documents, it is possible to set the headers via HTAccess. Customize one of the following HTAccess scripts according to your indexing needs and add it to your site’s root HTAccess file or Apache configuration file:
Include and cache all PDF, Word Documents, and/or Flash files in search results:
# index and archive specified file types
<IfModule mod_headers.c>
<FilesMatch "\.(doc|pdf|swf)$">
Header set X-Robots-Tag "index,archive"
</FilesMatch>
</IfModule>
Do not index/include any PDF, Word Docs, and/or Flash files in search results:
# do not index specified file types
<IfModule mod_headers.c>
<FilesMatch "\.(doc|pdf|swf)$">
Header set X-Robots-Tag "noindex"
</FilesMatch>
</IfModule>
Index/include all PDF, Word, and Flash files in search results, but do not cache or include a snippet:
# no cache no snippet for specified file types
<IfModule mod_headers.c>
<FilesMatch "\.(doc|pdf|swf)$">
Header set X-Robots-Tag "noarchive, nosnippet"
</FilesMatch>
</IfModule>
To customize any of these X-Robots-Tag
directives, edit the file extensions to match those of your target files and edit the various robots
parameters to suit your needs. Note that Google will also obey the unavailable_after
directive when delivered via X-Robots-Tag
header:
# specify expiration date
<IfModule mod_headers.c>
<FilesMatch "\.(doc|pdf|swf)$">
Header set X-Robots-Tag "unavailable_after: 4 Jul 2050 11:11:11 GMT"
</FilesMatch>
</IfModule>
..and likewise, you may also combine X-Robots-Tag
directives:
# expiration date with no cache and no snippet
<IfModule mod_headers.c>
<FilesMatch "\.(doc|pdf|swf)$">
Header set X-Robots-Tag "unavailable_after: 4 Jul 2050 11:11:11 GMT"
Header set X-Robots-Tag "noarchive, nosnippet"
</FilesMatch>
</IfModule>
And one more for the road.. Here is an alternate method of implementing your own set of custom X-Robots-Tag
directives via HTAccess:
# index with no cache for pdf word and flash documents
<IfModule mod_headers.c>
<IfModule mod_setenvif.c>
SetEnvIf Request_URI "*\.pdf$" x_tag=yes
SetEnvIf Request_URI "*\.doc$" x_tag=yes
SetEnvIf Request_URI "*\.swf$" x_tag=yes
Header set X-Robots-Tag "index, noarchive" env=x_tag
</IfModule>
</IfModule>
Closing thoughts..
meta
robots
tags work great for controlling search-engine interaction with your (X)HTML-based web content. For other types of content, such as PDF, text, and multimedia files, use X-Robots-Tag
headers instead. Currently supported by both Google and Yahoo, X-Robots-Tag
headers are easily specified for any collection of file types with a few straightforward HTAccess directives. Once in place, these directives provide granular control over search-engine representation of your non-(X)HTML-based web content.
Footnotes
- 1 Currently, the four major search engines are Google, Yahoo!, MSN/Live/Bing, and Ask.
- 2 Currently, two of the four (three?) major search engines, Google and Yahoo, understand and follow
X-Robots-Tag
directives.
8 responses to “Taking Advantage of the X-Robots Tag”
I’m curious why you you want granular control. You say this serves you well, what took you down this road?
As explained in the article, using the
X-Robots-Tag
has proven useful in controlling the indexing and archiving of specific file types for both Google and Yahoo. For example, many of my multimedia files are not included in search results. UsingX-Robots-Tag
makes this easy to do.Hello Perishable
Thanks for doing this site, I’ve found a lot of good snippets…
In this one, though, I believe it should be at the end instead of .
# do not index specified file types
Header set X-Robots-Tag "noindex"
sorry, of course the code gets stripped:
the enclosing tag should be:
<FilesMatch "\.(doc|pdf|swf)$">
</FilesMatch>
not:
<FilesMatch "\.(doc|pdf|swf)$">
</Files>
Thanks lou — good catch. The article has been updated with the correct information. Thanks for helping to improve it.
Hi,
I would like to know if it’s possible to do it for a special page instead of files like (doc|pdf|swf)
It’s for a blog like this :
http://www.example.com/ici-la
How to do that with the example below.
Thank you so much for this interesting article
Hi Mesrine, you might want to check out robots meta tags. You can set robots directives like follow, nofollow, index, noindex, and so on for your web pages.
How to write code if specific set of files to be blocked. Suppose I have to block abc.pdf, andf.pdf and wert.pdf etc.