Taking Advantage of the X-Robots Tag
Controlling the spidering, indexing and caching of your (X)HTML-based web pages is possible with
robots directives such as these:
<meta name="googlebot" content="index,archive,follow,noodp"> <meta name="robots" content="all,index,follow"> <meta name="msnbot" content="all,index,follow">
I use these directives here at Perishable Press and they continue to serve me well for controlling how the “big bots”1 crawl and represent my (X)HTML-based content in search results.
For other, non-(X)HTML types of content, however, using
robots directives to control indexing and caching is not an option. An excellent example of this involves directing Google to index and cache PDF documents. The last time I checked,
meta tags can’t be added to PDFs, Word documents, Excel documents, text files, and other non-(X)HTML-based content. The solution, of course, is to take advantage of the relatively new2 HTTP header,
How it works
X-Robots-Tag header takes the same parameters used by
robots tag. For example:
index— index the page
noindex— don’t index the page
follow— follow links from the page
nosnippet— don’t display descriptions or cached links
nofollow— don’t follow links from the page
noarchive— don’t cache/archive the page
all— do whatever you want, default behavior
none— do nothing, ignore the page
..and so on. Within
meta tags, these directives make it possible to control exactly how search engines represent your (X)HTML-based web pages. And now, setting these same directives via the
X-Robots-Tag header, it is possible to extend SEO-related control over virtually every other type of content as well — PDFs, Word documents, Flash, audio, and video files — you name it!
X-Robots-Tag functionality for own files is easy. For dynamically generated content, such as PHP files, place the following code at the very top of your page:
// instruct supportive search engines to index and cache the page <?php header('X-Robots-Tag: index,archive'); ?>
Of course, the actual
robots parameters will vary, depending on whether or not the content should be indexed, archived, etc. Here is a good summary of available robots directives.
X-Robots-Tag directives for non-PHP files, such as PDF, Flash, and Word documents, it is possible to set the headers via HTAccess. Customize one of the following HTAccess scripts according to your indexing needs and add it to your site’s root HTAccess file or Apache configuration file:
Include and cache all PDF, Word Documents, and/or Flash files in search results:
# index and archive specified file types <IfModule mod_headers.c> <FilesMatch "\.(doc|pdf|swf)$"> Header set X-Robots-Tag "index,archive" </FilesMatch> </IfModule>
Do not index/include any PDF, Word Docs, and/or Flash files in search results:
# do not index specified file types <IfModule mod_headers.c> <FilesMatch "\.(doc|pdf|swf)$"> Header set X-Robots-Tag "noindex" </FilesMatch> </IfModule>
Index/include all PDF, Word, and Flash files in search results, but do not cache or include a snippet:
# no cache no snippet for specified file types <IfModule mod_headers.c> <FilesMatch "\.(doc|pdf|swf)$"> Header set X-Robots-Tag "noarchive, nosnippet" </FilesMatch> </IfModule>
To customize any of these
X-Robots-Tag directives, edit the file extensions to match those of your target files and edit the various
robots parameters to suit your needs. Note that Google will also obey the
unavailable_after directive when delivered via
# specify expiration date <IfModule mod_headers.c> <FilesMatch "\.(doc|pdf|swf)$"> Header set X-Robots-Tag "unavailable_after: 4 Jul 2050 11:11:11 GMT" </FilesMatch> </IfModule>
..and likewise, you may also combine
# expiration date with no cache and no snippet <IfModule mod_headers.c> <FilesMatch "\.(doc|pdf|swf)$"> Header set X-Robots-Tag "unavailable_after: 4 Jul 2050 11:11:11 GMT" Header set X-Robots-Tag "noarchive, nosnippet" </FilesMatch> </IfModule>
And one more for the road.. Here is an alternate method of implementing your own set of custom
X-Robots-Tag directives via HTAccess:
# index with no cache for pdf word and flash documents <IfModule mod_headers.c> <IfModule mod_setenvif.c> SetEnvIf Request_URI "*\.pdf$" x_tag=yes SetEnvIf Request_URI "*\.doc$" x_tag=yes SetEnvIf Request_URI "*\.swf$" x_tag=yes Header set X-Robots-Tag "index, noarchive" env=x_tag </IfModule> </IfModule>
robots tags work great for controlling search-engine interaction with your (X)HTML-based web content. For other types of content, such as PDF, text, and multimedia files, use
X-Robots-Tag headers instead. Currently supported by both Google and Yahoo,
X-Robots-Tag headers are easily specified for any collection of file types with a few straightforward HTAccess directives. Once in place, these directives provide granular control over search-engine representation of your non-(X)HTML-based web content.
- 1 Currently, the four major search engines are Google, Yahoo!, MSN/Live/Bing, and Ask.
- 2 Currently, two of the four (three?) major search engines, Google and Yahoo, understand and follow
I’m curious why you you want granular control. You say this serves you well, what took you down this road?
As explained in the article, using the
X-Robots-Taghas proven useful in controlling the indexing and archiving of specific file types for both Google and Yahoo. For example, many of my multimedia files are not included in search results. Using
X-Robots-Tagmakes this easy to do.
Thanks for doing this site, I’ve found a lot of good snippets…
In this one, though, I believe it should be at the end instead of .
# do not index specified file types
Header set X-Robots-Tag "noindex"
sorry, of course the code gets stripped:
the enclosing tag should be:
Thanks lou — good catch. The article has been updated with the correct information. Thanks for helping to improve it.
I would like to know if it’s possible to do it for a special page instead of files like (doc|pdf|swf)
It’s for a blog like this :
How to do that with the example below.
Thank you so much for this interesting article
Hi Mesrine, you might want to check out robots meta tags. You can set robots directives like follow, nofollow, index, noindex, and so on for your web pages.
How to write code if specific set of files to be blocked. Suppose I have to block abc.pdf, andf.pdf and wert.pdf etc.