Cleaning up my files during the recent redesign, I realized that several years had somehow passed since the last time I even looked at the site’s robots.txt file. I guess that’s a good thing, but with all of the changes to site structure and content, it was time again for a delightful romp through robots.txt.
Robots.txt in 30 seconds
Primarily, robots directives disallow obedient spiders access to specified parts of your site. They can also explicitly “allow” access to specific files and directories. So basically they’re used to let Google, Bing et al know where they can go when visiting your site. You can also do nifty stuff like instruct specific user-agents and declare sitemaps. For just a simple text file, robots.txt wields considerable power. And we want to use whatever power we can get to our greatest advantage.
Robots.txt and WordPress
Running WordPress, you want search engines to crawl and index your posts and pages, but not your core WP files and directories. You also want to make sure that feeds and trackbacks aren’t included in the search results. It’s also good practice to declare a sitemap. Here is a good starting point for your next WP-based robots.txt:
User-agent: *
Disallow: /feed/
Disallow: /trackback/
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /xmlrpc.php
Disallow: /wp-
Allow: /wp-content/uploads/
Sitemap: http://example.com/sitemap.xml
That’s a plug-n-play version that you can further customize to fit specific site structure as well as your own SEO strategy. To use this code for your WordPress-powered site, just copy/paste into a blank file named “robots.txt located in your web-accessible root directory, for example:
http://perishablepress.com/robots.txt
If you take a look at the contents of the robots.txt file for Perishable Press, you’ll notice some additional robots directives that are used to forbid crawl access to stuff like the site’s blackhole for bad bots. Let’s have a look:
User-agent: *
Disallow: /feed/
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /trackback/
Disallow: /xmlrpc.php
Disallow: ?wptheme=
Disallow: /blackhole/
Disallow: /transfer/
Disallow: /tweets/
Disallow: /mint/
Allow: /tag/mint/
Allow: /tag/feed/
Allow: /wp-content/online/
Sitemap: http://perishablepress.com/sitemap-perish.xml
Sitemap: http://perishablepress.com/sitemap-press.xml
User-agent: ia_archiver
Disallow: /
Spiders don’t need to be crawling around anything in my /cgi-bin/, so that’s disallowed, as are several non-WP subdirectories such as my Twitter Archive, FTP directory, and Mint installation.
Then I add a few explicit Allow directives to unblock access to specific URLs otherwise disallowed by existing rules. I also declare both of my sitemaps (yes you can have more than one), and then finally completely disallow access to everything for the annoying ia_archiver user-agent.
Previously on robots.txt
As mentioned, my previous robots.txt file went unchanged for several years (which just vanished in the blink of an eye), but proved quite effective, especially with compliant spiders like googlebot. Unfortunately, it contains language that only a few of the bigger search engines understand (and thus obey):
User-agent: *
Disallow: /mint/
Disallow: /labs/
Disallow: /*/wp-*
Disallow: /*/feed/*
Disallow: /*/*?s=*
Disallow: /*/*.js$
Disallow: /*/*.inc$
Disallow: /transfer/
Disallow: /*/cgi-bin/*
Disallow: /*/blackhole/*
Disallow: /*/trackback/*
Disallow: /*/xmlrpc.php
Allow: /*/20*/wp-*
Allow: /press/feed/$
Allow: /press/tag/feed/$
Allow: /*/wp-content/online/*
Sitemap: http://perishablepress.com/sitemap.xml
User-agent: ia_archiver
Disallow: /
Apparently, the wildcard character isn’t recognized by lesser bots, and I’m thinking that the end-pattern symbol (dollar sign $) is probably not well-supported either, although Google certainly gets it.
These patterns may be better supported in the future, but going forward there is no reason to include them. As with the robots examples above, the same pattern-matching is possible without using wildcards and dollar signs, enabling all compliant bots to understand your crawl preferences.
106 Responses
Rajesh – April 18, 2013 •
Is it advisable to block /feed url on Robots.txt File?
User-agent: *
Disallow: /feed/
If Yes, Why Would You Do That ?
Jeff Starr – April 18, 2013 •
You don’t want search engines to return your feeds in the search results.. you want them to return your web pages :)
Chris M. – April 18, 2013
Jeff,
In that regard, what are your thoughts on Joost’s comment on this matter:
Here’s the (short) post where he says that:
http://yoast.com/example-robots-txt-wordpress/Jeff Starr – April 18, 2013
Sure that makes sense if you don’t have a sitemap ;) Otherwise, keeping your feed content out of search results keeps juice focused on your actual web pages.
Chris M. – April 19, 2013 •
@Jeff: Now that’s some sense-talk! :-)
Thanks!
Praveen – May 11, 2013 •
Thanks for publishing this helpful article, its very informative, keep it Up!